Skip to content

Commit

Permalink
Updates from ShareLaTeX
Browse files Browse the repository at this point in the history
  • Loading branch information
kumar-shridhar committed Dec 28, 2018
1 parent 6233b5a commit c7eda61
Show file tree
Hide file tree
Showing 6 changed files with 28 additions and 15 deletions.
4 changes: 2 additions & 2 deletions Chapter2/chapter2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ \subsection{Variational Inference}
Maximizing the KL divergence between the posterior and the prior over $w$ will result in a variational distribution that learns a good representation from the data (obtained from log likelihood) and is closer to the prior distribution. In other words, it can prevent overfitting.


\subsection{Local reparametrisation trick}
\subsection{Local Reparametrisation Trick}
The ability to rewrite statistical problems in an equivalent but different form, to reparameterise them, is one of the most general-purpose tools used in mathematical statistics. The type of \textit{reparameterization} when the global uncertainty in the weights is translated into a form of local uncertainty which is independent across examples is known as the \emph{local reparameterization trick}.
An alternative estimator is deployed for which $\Cov{}{L_{i},L_{j}} = 0$, so that the variance of the stochastic gradients scales as $1/M$.
The new estimator is made computationally efficient by sampling the intermediate variables and not sampling $\beps$ directly, but only $f(\beps)$ through which $\beps$ influences $L_\D^\text{SGVB}(\bphi)$. Hence, the source of global noise can be translated to local noise ($\beps \rightarrow f(\beps)$), a local reparameterization can be applied so as to obtain a statistically efficient gradient estimator.
Expand Down Expand Up @@ -207,7 +207,7 @@ \subsection{Bayes by Backprop}
where $n$ is the number of draws.
\newline We sample $w^{(i)}$ from $q_{\theta}(w|\mathcal{D})$. The uncertainty afforded by \textit{Bayes by Backprop} trained neural networks has been used successfully for training feedforward neural networks in both supervised and reinforcement learning environments \cite{blundell2015weight,lipton2016efficient,houthooft2016curiosity}, for training recurrent neural networks \cite{fortunato2017bayesian}, but has not been applied to convolutional neural networks to-date.

\section{Model weights pruning}
\section{Model Weights Pruning}

Model pruning reduces the sparsity in a deep neural network's
various connection matrices, thereby reducing the number of valued parameters in the model. The whole idea of model pruning is to reduce the number of parameters without much loss in the accuracy of the model. This reduces the use of a large parameterized model with regularization and promotes the use of dense connected smaller models. Some recent work suggests \cite{DBLP:journals/corr/HanMD15} \cite{DBLP:journals/corr/NarangDSE17} that the network can achieve a sizable reduction in model size, yet achieving comparable accuracy.
Expand Down
4 changes: 2 additions & 2 deletions Chapter3/chapter3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ \chapter{Related Work}

\section{Related Work}

\subsection{Bayesian training}
\subsection{Bayesian Training}
Applying Bayesian methods to neural networks has been studied in the past with various approximation methods for the intractable true posterior probability distribution $p(w|\mathcal{D})$. Buntine and Weigend \cite{buntine1991bayesian} started to propose various \textit{maximum-a-posteriori} (MAP) schemes for neural networks. They were also the first who suggested second order derivatives in the prior probability distribution $p(w)$ to encourage smoothness of the resulting approximate posterior probability distribution.
In subsequent work by Hinton and Van Camp \cite{hinton1993keeping}, the first variational methods were proposed which naturally served as a regularizer in neural networks. He also mentioned that the amount of information in a weight can be controlled by adding Gaussian noise. When optimizing the trade-off between the expected error and the information in the weights, the noise level can be adapted during learning.

Expand All @@ -46,7 +46,7 @@ \subsection{Bayesian training}
\newline Several authors have claimed that Dropout \cite{srivastava2014dropout} and Gaussian Dropout \cite{wang2013fast} can be viewed as approximate variational inference schemes \cite{gal2015bayesian, kingma2015variational}. We compare our results to Gal's \& Ghahramani's \cite{gal2015bayesian} and discuss the methodological differences in detail.


\subsection{Uncertainties estimation}
\subsection{Uncertainty Estimation}

Neural Networks can predict uncertainty when Bayesian methods are introduced in it. An attempt to model uncertainty has been studied from 1990s \cite{neal2012bayesian} but has not been applied successfully until 2015. Gal and Ghahramani \cite{Gal2015Dropout} in 2015 provided a theoretical framework for modelling Bayesian uncertainty. Gal and Ghahramani \cite{gal2015bayesian} obtained the uncertainty estimates by casting dropout training in conventional deep networks as a Bayesian approximation of a Gaussian Process. They showed that any network trained with dropout is an approximate Bayesian model, and uncertainty estimates can be obtained by computing the variance on multiple predictions with different dropout masks.

Expand Down
10 changes: 5 additions & 5 deletions Chapter4/chapter4.tex
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ \chapter{Concept}

\pagebreak

\section{Bayesian convolutional neural networks with variational inference}
\section{Bayesian Convolutional Neural Networks with Variational Inference}
In this section, we explain our algorithm of building a \ac{cnn} with probability distributions over its weights in each filter, as seen in Figure \ref{fig:filter_scalar}, and apply variational inference, i.e. \textit{Bayes by Backprop}, to compute the intractable true posterior probability distribution, as described in the last Chapter. Notably, a fully Bayesian perspective on a \ac{cnn} is for most \ac{cnn} architectures not accomplished by merely placing probability distributions over weights in convolutional layers; it also requires probability distributions over weights in fully-connected layers (see Figure \ref{fig:CNNwithdist_grey}).
%
\begin{figure}[H]
Expand All @@ -54,19 +54,19 @@ \section{Bayesian convolutional neural networks with variational inference}
\end{center}
\end{figure}
%
\subsection{Local reparameterization trick for convolutional layers}
\subsection{Local Reparameterization Trick for Convolutional Layers}
We utilise the local reparameterization trick \cite{kingma2015variational} and apply it to \acp{cnn}. Following \cite{kingma2015variational,neklyudov2018variance}, we do not sample the weights $w$, but we sample instead layer activations $b$ due to its consequent computational acceleration. The variational posterior probability distribution $q_{\theta}(w_{ijhw}|\mathcal{D})=\mathcal{N}(\mu_{ijhw},\alpha_{ijhw}\mu^2_{ijhw})$ (where $i$ and $j$ are the input, respectively output layers, $h$ and $w$ the height, respectively width of any given filter) allows to implement the local reparamerization trick in convolutional layers. This results in the subsequent equation for convolutional layer activations $b$:
\begin{equation}
b_j=A_i\ast \mu_i+\epsilon_j\odot \sqrt{A^2_i\ast (\alpha_i\odot \mu^2_i)}
\end{equation}
where $\epsilon_j \sim \mathcal{N}(0,1)$, $A_i$ is the receptive field, $\ast$ signalises the convolutional operation, and $\odot$ the component-wise multiplication.

\subsection{Applying two sequential convolutional operations (mean and variance)}
\subsection{Applying two Sequential Convolutional Operations (Mean and Variance)}
The crux of equipping a \ac{cnn} with probability distributions over weights instead of single point-estimates and being able to update the variational posterior probability distribution $q_{\theta}(w|\mathcal{D})$ by backpropagation lies in applying \textit{two} convolutional operations whereas filters with single point-estimates apply \textit{one}. As explained in the last chapter, we deploy the local reparametrization trick and sample from the output $b$. Since the output $b$ is a function of mean $\mu_{ijwh}$ and variance $\alpha_{ijhw}\mu^2_{ijhw}$ among others, we are then able to compute the two variables determining a Gaussian probability distribution, namely mean $\mu_{ijhw}$ and variance $\alpha_{ijhw}\mu^2_{ijhw}$, separately.
\newline We do this in two convolutional operations: in the first, we treat the output $b$ as an output of a \ac{cnn} updated by frequentist inference. We optimize with Adam \cite{kingma2014adam} towards a single point-estimate which makes the validation accuracy of classifications increasing. We interpret this single point-estimate as the mean $\mu_{ijwh}$ of the variational posterior probability distributions $q_{\theta}(w|\mathcal{D})$. In the second convolutional operation, we learn the variance $\alpha_{ijhw}\mu^2_{ijhw}$. As this formulation of the variance includes the mean $\mu_{ijwh}$, only $\alpha_{ijhw}$ needs to be learned in the second convolutional operation \cite{molchanov2017variational}. In this way, we ensure that only one parameter is updated per convolutional operation, exactly how it would have been with a \ac{cnn} updated by frequentist inference.
\newline In other words, while we learn in the first convolutional operation the MAP of the variational posterior probability distribution $q_{\theta}(w|\mathcal{D})$, we observe in the second convolutional operation how much values for weights $w$ deviate from this MAP. This procedure is repeated in the fully-connected layers. In addition, to accelerate computation, to ensure a positive non-zero variance $\alpha_{ijhw}\mu^2_{ijhw}$, and to enhance accuracy, we learn $\log \alpha_{ijhw}$ and use the \textit{Softplus} activation function as further described in the Experiments section.
%
\section{Uncertainty estimation in CNN}
\section{Uncertainty Estimation in CNN}
In classification tasks, we are interested in the predictive distribution $p_{\mathcal{D}}(y^*|x^*)$, where $x^*$ is an unseen data example and $y^*$ its predicted class. For a Bayesian neural network, this quantity is given by:
\begin{align}
p_{ \mathcal{D}}(y^*|x^*) = \int p_{w}(y^*|x^*) \ p_{\mathcal{D}}(w) \ dw
Expand Down Expand Up @@ -114,7 +114,7 @@ \section{Uncertainty estimation in CNN}

\newline It is of paramount importance that uncertainty is split into aleatoric and epistemic quantities since it allows the modeler to evaluate the room for improvements: while aleatoric uncertainty (also known as statistical uncertainty) is merely a measure for the variation of ("noisy") data, epistemic uncertainty is caused by the model. Hence, a modeler can see whether the quality of the data is low (i.e. high aleatoric uncertainty), or the model itself is the cause for poor performances (i.e. high epistemic uncertainty). The former cannot be improved by gathering more data, whereas the latter can be done so. \cite{Kiureghian} \cite{kendall2017uncertainties}.

\section{Model pruning}
\section{Model Pruning}

Model pruning means the reduction in the model weights parameters to reduce the model overall non-zero weights, inference time and computation cost. In our work, a Bayesian Convolutional Network learns two weights, i.e: the mean and the variance compared to point estimate learning one single weight. This makes the overall number of parameters of a Bayesian Network twice as compared to the parameters of a point estimate similar architecture.

Expand Down
8 changes: 4 additions & 4 deletions Chapter5/chapter5.tex
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ \subsection{Results}
\label{tab:tableCIFAR100}
\end{table}

\section{Uncertainity estimation}
\section{Uncertainity Estimation}

\newline Finally, Table \ref{tab:uncertainty} compares the means of aleatoric and epistemic uncertainties for a Bayesian LeNet-5 with variational inference on MNIST and CIFAR-10. The aleatoric uncertainty of CIFAR-10 is about twenty times as large as that of MNIST. Considering that the aleatoric uncertainty measures the irreducible variability and depends on the predicted values, a larger aleatoric uncertainty for CIFAR-10 can be directly deduced from its lower validation accuracy and may be further due to the smaller number of training examples. The epistemic uncertainty of CIFAR-10 is about fifteen times larger than that of MNIST, which we anticipated since epistemic uncertainty decreases proportionally to validation accuracy.
\begin{table}[H]
Expand All @@ -263,7 +263,7 @@ \section{Uncertainity estimation}
\label{tab:uncertainty}
\end{table}

\section{Model pruning}
\section{Model Pruning}

\subsubsection{Halving the Number of Filters}

Expand Down Expand Up @@ -355,11 +355,11 @@ \subsubsection{Applying L1 Norm}

One thing to note here is that the numbers of parameters of Bayesian Network after applying L1 norm is not necessarily equal to the number of parameters in the frequentist AlexNet architecture. It depends on the data size and the number of classes. However, the number of parameters in the case of MNIST and CIFAR-10 are pretty comparable and there is not much reduction in the accuracy either. Also, the early stopping was applied when there is no change in the validation accuracy for 5 epochs and the model was saved and later pruned with the application of L1 norm.

\section{Training time}
\section{Training Time}

Training time of a Bayesian \acp{cnn} is twice of a frequentist network with similar architecture when the number of samples is equal to one. In general, the training time of a Bayesian \acp{cnn}, $T$ is defined as:
\begin{align}
T = 2 * number of samples * t
T = 2 * number\_of\_samples * t
\end{align}
where, $t$ is the training time of a frequentist network.
The factor of 2 is present due to the double learnable parameters in a Bayesian CNN network i.e. mean and variance for every single point estimate weight in the frequentist network.
Expand Down
2 changes: 1 addition & 1 deletion Chapter6/chapter6.tex
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ \section{BayesCNN for Generative Adversarial Networks}

Generative Adversarial Networks (GANs) \cite{goodfellow2014generative} can be used for two major tasks: to learn good feature representations by using the generator and discriminator networks as feature extractors and to generate natural images. The learned feature representation or generated images can reduce the number of images substantially for a computer vision supervised task. However, GANs were quite unstable to train in the past and that is why we base our work on the stable GAN architecture namely Deep Convolutional GANs (DCGAN) \cite{DBLP:journals/corr/RadfordMC15}. We use the trained Bayesian discriminators for image classification tasks, showing competitive performance with the normal DCGAN architecture.

\subsection{Our approach}
\subsection{Our Approach}

We based our work on the paper: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks by \citet{DBLP:journals/corr/RadfordMC15}. We used the architecture of a deep convolutional generative adversarial networks (DCGANs) that learns a hierarchy of representations from object parts to scenes in both the generator and discriminator.
The generator used in the Network is shown in Table \ref{tab:GeneratorArchitecture}. The architecture is kept similar to the architecture used in DCGAN paper \cite{DBLP:journals/corr/RadfordMC15}. Table \ref{tab:DiscriminatorArchitecture} shows the discriminator network with Bayesian Convolutional Layers.
Expand Down
Loading

0 comments on commit c7eda61

Please sign in to comment.