Calculus

3 minute read

Matrix Calculus

The Gradient

The gradient of function f wrt matrix $A \in \mathbb { R } ^ { m \times n }$ ) is the following matrix of partial derivatives:

$\nabla _ { A } f ( A ) \in \mathbb { R } ^ { m \times n } = \left[ \begin{array} { c c c c } { \frac { \partial f ( A ) } { \partial A _ { 11 } } } & { \frac { \partial f ( A ) } { \partial A _ { 12 } } } & { \cdots } & { \frac { \partial f ( A ) } { \partial A _ { 11 } } } \\ { \frac { \partial f ( A ) } { \partial A _ { 21 } } } & { \frac { \partial f ( A ) } { \partial A _ { 22 } } } & { \cdots } & { \frac { \partial f ( A ) } { \partial A _ { 2 n } } } \\ { \vdots } & { \vdots } & { \ddots } & { \vdots } \\ { \frac { \partial f ( A ) } { \partial A _ { m 1 } } } & { \frac { \partial f ( A ) } { \partial A _ { m 2 } } } & { \cdots } & { \frac { \partial f ( A ) } { \partial A _ { m n } } } \end{array} \right]$

The size of $\nabla _ { A } f ( A )$ is the same as the size of A. The gradient esixts only if the function in real valued , that is it returns a scalar value.

Gradients have the following very important property is crucial in machine learning. $\nabla f ( \mathbf { x } )$ points in the direction of steepest ascent from x. This can be used to iteratively minimize a function thru gradient descent.

It has the following properties :

$\nabla _ { x } ( f ( x ) + g ( x ) ) = \nabla _ { x } f ( x ) + \nabla _ { x } g ( x )$
For $t \in \mathbb { R } , \nabla _ { x } ( t f ( x ) ) = t \nabla _ { x } f ( x )$

The Jacobian

The Jacobian is a matrix of the following first order partial derivatives

$\mathbf { J } _ { f } = \left[ \begin{array} { c c c } { \frac { \partial f _ { 1 } } { \partial x _ { 1 } } } & { \cdots } & { \frac { \partial f _ { 1 } } { \partial x _ { n } } } \\ { \vdots } & { \ddots } & { \vdots } \\ { \frac { \partial f _ { m } } { \partial x _ { 1 } } } & { \cdots } & { \frac { \partial f _ { m } } { \partial x _ { n } } } \end{array} \right] \quad \text { i.e. } \quad \left[ \mathbf { J } _ { f } \right] _ { i j } = \frac { \partial f _ { i } } { \partial x _ { j } }$

The Hessian

Hessian matrix $\nabla _ { x } ^ { 2 } f ( x )$ or H is the below matrix of partial derivatives.

$\nabla _ { x } ^ { 2 } f ( x ) \in \mathbb { R } ^ { n \times n } = \left[ \begin{array} { c c c c } { \frac { \partial ^ { 2 } f ( x ) } { \partial x _ { 1 } ^ { 2 } } } & { \frac { \partial ^ { 2 } f ( x ) } { \partial x _ { 1 } \partial x _ { 2 } } } & { \dots } & { \frac { \partial ^ { 2 } f ( x ) } { \partial x _ { 1 } \partial x _ { n } } } \\ { \frac { \partial ^ { 2 } f ( x ) } { \partial x _ { 2 } \partial x _ { 1 } } } & { \frac { \partial ^ { 2 } f ( x ) } { \partial x _ { 2 } ^ { 2 } } } & { \cdots } & { \frac { \partial ^ { 2 } f ( x ) } { \partial x _ { 2 } \partial x _ { n } } } \\ { \vdots } & { \vdots } & { \ddots } & { \vdots } \\ { \frac { \partial ^ { 2 } f ( x ) } { \partial x _ { n } \partial x _ { 1 } } } & { \frac { \partial ^ { 2 } f ( x ) } { \partial x _ { n } \partial x _ { 2 } } } & { \cdots } & { \frac { \partial ^ { 2 } f ( x ) } { \partial x _ { n } ^ { 2 } } } \end{array} \right]$

Similar to the gradient, the Hessian is defined only when f(x) is real-valued. The gradient as the analogue of the first derivative for functions of vectors, and the Hessian as the analogue of the second derivative.

The Hessian is used in some optimization algorithms such as Newton’s method. It is expensive to calculate but can drastically reduce the number of iterations needed to converge to a local minimum by providing information about the curvature of the function.

Examples of Gradient Calculation

Let $f ( x ) = b ^ { T } x$

Then $f ( x ) = \sum _ { i = 1 } ^ { n } b _ { i } x _ { i }$

So $\frac { \partial f ( x ) } { \partial x _ { k } } = \frac { \partial } { \partial x _ { k } } \sum _ { i = 1 } ^ { n } b _ { i } x _ { i } = b _ { k }$

Hence $\nabla _ { x } b ^ { T } x = b$

Examples of Hessian Calculation

Let $f ( x ) = x ^ { T } A x$

$\frac { \partial ^ { 2 } f ( x ) } { \partial x _ { k } \partial x _ { \ell } } = \frac { \partial } { \partial x _ { k } } \left[ \frac { \partial f ( x ) } { \partial x _ { \ell } } \right] = \frac { \partial } { \partial x _ { k } } \left[ 2 \sum _ { i = 1 } ^ { n } A _ { \ell i } x _ { i } \right] = 2 A _ { \ell k } = 2 A _ { k \ell }$

Hence $\nabla _ { x } ^ { 2 } x ^ { T } A x = 2 A$

Commonly used rules in Matrix Calculus:

$\nabla _ { \mathbf { x } } \left( \mathbf { a } ^ { \top } \mathbf { x } \right) = \mathbf { a }$
$\nabla _ { \mathbf { x } } \left( \mathbf { x } ^ { \top } \mathbf { A } \mathbf { x } \right) = \left( \mathbf { A } + \mathbf { A } ^ { \top } \right) \mathbf { x }$
Chain rule for multivariate functions : Let $z = f ( x , y )$ where x and y depend on one or more variables and differentiable at t. Then

$\frac { d z } { d t } = \frac { \partial z } { \partial x } \frac { d x } { d t } + \frac { \partial z } { \partial y } \frac { d y } { d t }$

Example:

Let z=x2y−y2 where x and y are parametrized as x=t2 and y=2t.

$\begin{aligned} \frac { d z } { d t } & = \frac { \partial z } { \partial x } \frac { d x } { d t } + \frac { \partial z } { \partial y } \frac { d y } { d t } \\ & = ( 2 x y ) ( 2 t ) + \left( x ^ { 2 } - 2 y \right) ( 2 ) \\ & = \left( 2 t ^ { 2 } \cdot 2 t \right) ( 2 t ) + \left( \left( t ^ { 2 } \right) ^ { 2 } - 2 ( 2 t ) \right) ( 2 ) \\ & = 8 t ^ { 4 } + 2 t ^ { 4 } - 8 t \\ & = 10 t ^ { 4 } - 8 t \end{aligned}$

if $x ^ { * }$ is a local minimum of f and f is continuously differentiable in a neighborhood of $x ^ { * }$ ,then $\nabla f \left( \mathrm { x } ^ { * } \right) = 0$ . Points where the gradient vanishes are called stationary points. Note that not all stationary points are extrema.
First-order information (i.e. the gradient) alone is insufficient to characterize local minima. But we can say more with second-order information (i.e. the Hessian).
If f is twice continuously differentiable with $\nabla ^ { 2 } f$ positive semi-definite of $\mathbf { X } ^ { * }$, and $\nabla f \left( \mathbf { x } ^ { * } \right) = 0$.Then $\mathbf { X } ^ { * }$ is a local minimum of f.

Share on

Twitter Facebook Google+ LinkedIn

Ganesh Santhanam

Calculus

Matrix Calculus

The Gradient

The Jacobian

The Hessian

Examples of Gradient Calculation

Examples of Hessian Calculation

Commonly used rules in Matrix Calculus:

Example:

Share on

You May Also Enjoy

Linear Algebra

Chat bot Using seq2seq model with Attention

Bayesian Deep Learning

Word2vec for Embedding