Calculus
Matrix Calculus
The Gradient
The gradient of function f wrt matrix $A \in \mathbb { R } ^ { m \times n }$ ) is the following matrix of partial derivatives:
The size of $\nabla _ { A } f ( A )$ is the same as the size of A. The gradient esixts only if the function in real valued , that is it returns a scalar value.
Gradients have the following very important property is crucial in machine learning. $\nabla f ( \mathbf { x } )$ points in the direction of steepest ascent from x. This can be used to iteratively minimize a function thru gradient descent.
It has the following properties :
- $\nabla _ { x } ( f ( x ) + g ( x ) ) = \nabla _ { x } f ( x ) + \nabla _ { x } g ( x )$
- For $t \in \mathbb { R } , \nabla _ { x } ( t f ( x ) ) = t \nabla _ { x } f ( x )$
The Jacobian
The Jacobian is a matrix of the following first order partial derivatives
The Hessian
Hessian matrix $\nabla _ { x } ^ { 2 } f ( x )$ or H is the below matrix of partial derivatives.
Similar to the gradient, the Hessian is defined only when f(x) is real-valued. The gradient as the analogue of the first derivative for functions of vectors, and the Hessian as the analogue of the second derivative.
The Hessian is used in some optimization algorithms such as Newton’s method. It is expensive to calculate but can drastically reduce the number of iterations needed to converge to a local minimum by providing information about the curvature of the function.
Examples of Gradient Calculation
Let $f ( x ) = b ^ { T } x$
Then $f ( x ) = \sum _ { i = 1 } ^ { n } b _ { i } x _ { i }$
So $\frac { \partial f ( x ) } { \partial x _ { k } } = \frac { \partial } { \partial x _ { k } } \sum _ { i = 1 } ^ { n } b _ { i } x _ { i } = b _ { k }$
Hence $\nabla _ { x } b ^ { T } x = b$
Examples of Hessian Calculation
Let $f ( x ) = x ^ { T } A x$
Hence $\nabla _ { x } ^ { 2 } x ^ { T } A x = 2 A$
Commonly used rules in Matrix Calculus:
- $\nabla _ { \mathbf { x } } \left( \mathbf { a } ^ { \top } \mathbf { x } \right) = \mathbf { a }$
- $\nabla _ { \mathbf { x } } \left( \mathbf { x } ^ { \top } \mathbf { A } \mathbf { x } \right) = \left( \mathbf { A } + \mathbf { A } ^ { \top } \right) \mathbf { x }$
- Chain rule for multivariate functions : Let $z = f ( x , y )$ where x and y depend on one or more variables and differentiable at t. Then
Example:
Let z=x2y−y2 where x and y are parametrized as x=t2 and y=2t.
- if $x ^ { * }$ is a local minimum of f and f is continuously differentiable in a neighborhood of $x ^ { * }$ ,then $\nabla f \left( \mathrm { x } ^ { * } \right) = 0$ . Points where the gradient vanishes are called stationary points. Note that not all stationary points are extrema.
- First-order information (i.e. the gradient) alone is insufficient to characterize local minima. But we can say more with second-order information (i.e. the Hessian).
- If f is twice continuously differentiable with $\nabla ^ { 2 } f$ positive semi-definite of $\mathbf { X } ^ { * }$, and $\nabla f \left( \mathbf { x } ^ { * } \right) = 0$.Then $\mathbf { X } ^ { * }$ is a local minimum of f.