Bayesian Deep Learning

3 minute read

Bayesian Deep Learning

Bayesian Deep Networks is a standard feed forward neural network with priors over each weight. We can then use Bayes rule and get the posterior. The advantage of this is that we can get the uncertainty information as well as the parameter estimates.

Normal Prior $\begin{array} { r l } { w _ { 1 , i } } & { \sim N ( 0,1 ) } \\ { b _ { 1 , i } } & { \sim N ( 0,1 ) } \end{array}$

Posterior estimation

Automatic Differentiation Variational Inference

we maximize the evidence lower bound (elbo) as we cannot directly maximize the KL divergence as it lacks an analytic form as it involves the posterior.

$\mathcal { L } ( \phi ) = \mathbb { E } _ { q ( \theta ) } [ \log p ( \mathbf { x } , \theta ) ] - \mathbb { E } _ { q ( \theta ) } [ \log q ( \theta ; \phi ) ]$

The first term is an expectation of the joint density under the approximation, and the second is the entropy of the variational density. The elbo is equal to the negative kl divergence up to the constant log p(x). Maximizing the elbo minimizes the kl divergence (Jordan et al., 1999; Bishop, 2006). We would like to optimize using stochastic gradient ascent to maximize the elbo and use automatic differentiation to compute gradients.

$\phi ^ { * } = \underset { \phi \in \Phi } { \arg \max } \mathcal { L } ( \phi )$

However we cannot directly use automatic differentiation on the elbo. This is because the elbo involves an intractable expectation. Hence we begin by transforming the support of the latent variables θ such that they live in the real coordinate space

Transforming Latent variable to real coordinate space

Then we employ one final transformation: elliptical standardization.

$q ( \boldsymbol { \eta } ) = \text { Normal } ( \boldsymbol { \eta } | \mathbf { 0 } , \mathbf { I } ) = \prod _ { k = 1 } ^ { K } \mathrm { Normal } \left( \eta _ { k } | 0,1 \right)$

The standardization transforms the variational problem from Equation into the following:

$\phi ^ { * } = \underset { \phi } { \arg \max } \mathbb { E } _ { \mathrm { N } ( \eta ; 0,1 ) } \left[ \log p \left( \mathbf { x } , T ^ { - 1 } \left( S _ { \phi } ^ { - 1 } ( \boldsymbol { \eta } ) \right) \right) + \log \left| \operatorname { det } J _ { T ^ { - 1 } } \left( S _ { \phi } ^ { - 1 } ( \eta ) \right) \right| \right] + \mathbb { H } [ q ( \zeta ; \phi ) ]$

Real coordinate to standardized space

We now reach the final step: stochastic optimization of the variational objective function. Since the expectation is no longer dependent on φ, we can directly calculate its gradient.

$\nabla _ { \mu } \mathcal { L } = \mathbb { E } _ { \mathrm { N } ( \eta ) } \left[ \nabla _ { \boldsymbol { \theta } } \log p ( \mathbf { x } , \boldsymbol { \theta } ) \nabla _ { \zeta } T ^ { - 1 } ( \boldsymbol { \zeta } ) + \nabla _ { \zeta } \log \left| \operatorname { det } J _ { T ^ { - 1 } } ( \boldsymbol { \zeta } ) \right| \right]$

We obtain gradients with respect to ω (mean-field) and L (full-rank) in a similar fashion

$\nabla _ { \boldsymbol { \omega } } \mathcal { L } = \mathbb { E } _ { \mathrm { N } ( \boldsymbol { \eta } ) } \left[ \left( \nabla _ { \boldsymbol { \theta } } \log p ( \mathbf { x } , \boldsymbol { \theta } ) \nabla _ { \zeta } T ^ { - 1 } ( \boldsymbol { \zeta } ) + \nabla _ { \zeta } \log \left| \operatorname { det } J _ { T ^ { - 1 } } ( \boldsymbol { \zeta } ) \right| \right) \boldsymbol { \eta } ^ { \top } \operatorname { diag } ( \exp ( \boldsymbol { \omega } ) ) \right] + \mathbf { 1 }$ $\nabla _ { \mathbf { L } } \mathcal { L } = \mathbb { E } _ { \mathrm { N } ( \eta ) } \left[ \left( \nabla _ { \boldsymbol { \theta } } \log p ( \mathbf { x } , \boldsymbol { \theta } ) \nabla _ { \zeta } T ^ { - 1 } ( \boldsymbol { \zeta } ) + \nabla _ { \zeta } \log \left| \operatorname { det } J _ { T ^ { - 1 } } ( \zeta ) \right| \right) \eta ^ { \top } \right] + \left( \mathbf { L } ^ { - 1 } \right) ^ { \top }$

Exmaple in python - Forest Cover Dataset and PyMC3

The data used in this report was taken from the UC Irvine Machine Learning Repository . This “Covertype” Dataset has Input: 66 cartographic variables ; Output: one of 7 forest cover types.The data set consists of 15,120 samples of 30m x 30m patches of forest located in northern Colorado’s Roosevelt National Forest. They are classified for each of the seven cover types.

For Probabilistic Programming in Python we use PyMC3. It Provides statistical distributions , sampling algorithms and uses Theano for auto Differentiation in the backend.

We use a shallow 2 hidden layer neural network with 20 hidden nodes in layer 2 and 3. It has 66 inputs and outputs one of 7 classes. As it is a shallow network Relu is not used. Instead Tanh is used as it has stronger gradients as its range is from [-1,1]

Neural Network with 2 hidden layers

We visualize the weights of the outputs.

Output layer weights

We can compute both the point estimates as well as the Bayesian estimates.

Point Estimates

Bayesian Estimates with uncertainty

Though the point estimates indicate clear classification between the class boundaries the Bayesian estimates tell a very different story. It appears that the model has difficulty in differentiating between class 1 and 2. Hence we should try to get more data for these 2 classes to predict with more confidence.

References

Blackard, Jock A., Dean, Denis J. “Comparative accuracies of artifical neural networks and discriminantanalysis in predicting foorest cover types from cartogrpahic variables”. Computers and Electronics inAgriculture (1999).
Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei.Automatic differentiation variational inference.Journal of Machine Learning Research,18(14):1–45, 2017
Frederic Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Ar-naud Bergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano:new features and speed improvements. Deep Learning and Unsupervised Feature Learn-ing NIPS 2012 Workshop, 2012
John Salvatier, Thomas V Wiecki, and Christopher Fonnesbeck. Probabilistic programming in Python using PyMC3.PeerJ Computer Science, 2:e55, 2016
D. Blei, A. Kucukelbir, and J. McAuliffe. Variational inference: A review for statisticians.arXiv:1601.00670,2016.

import theano.tensor as tt  # pymc devs are discussing new backends
import pymc3 as pm

n_hidden = 20

with pm.Model() as nn_model:
  # Input -> Layer 1
  weights_1 = pm.Normal('w_1', mu=0, sd=1,
                        shape=(ann_input.shape[1], n_hidden),
                        testval=init_1)
  acts_1 = pm.Deterministic('activations_1',
                            tt.tanh(tt.dot(ann_input, weights_1)))

  # Layer 1 -> Layer 2
  weights_2 = pm.Normal('w_2', mu=0, sd=1,
                        shape=(n_hidden, n_hidden),
                        testval=init_2)
  acts_2 = pm.Deterministic('activations_2',
                            tt.tanh(tt.dot(acts_1, weights_2)))

  # Layer 2 -> Output Layer
  weights_out = pm.Normal('w_out', mu=0, sd=1,
                          shape=(n_hidden, ann_output.shape[1]),
                          testval=init_out)
  acts_out = pm.Deterministic('activations_out',
                              tt.nnet.softmax(tt.dot(acts_2, weights_out)))  # noqa

  # Define likelihood
  out = pm.Multinomial('likelihood', n=1, p=acts_out,
                       observed=ann_output)

with nn_model:
  s = theano.shared(pm.floatX(1.1))
  inference = pm.ADVI(cost_part_grad_scale=s)  # approximate inference done using ADVI
  approx = pm.fit(100000, method=inference)
  trace = approx.sample(5000)

Share on

Twitter Facebook Google+ LinkedIn

Ganesh Santhanam

Bayesian Deep Learning

Bayesian Deep Learning

Automatic Differentiation Variational Inference

Exmaple in python - Forest Cover Dataset and PyMC3

References

Share on

You May Also Enjoy

Linear Algebra

Chat bot Using seq2seq model with Attention

Word2vec for Embedding

Variational Auto Encoder