Merge sharelatex-2019-01-01-0829 into master

kumar-shridhar · Jan 1, 2019 · f9cc1da · f9cc1da
2 parents b45ee6a + 37e61c7
commit f9cc1da
Show file tree

Hide file tree

Showing 9 changed files with 34 additions and 32 deletions.
diff --git a/Abstract/abstract.tex b/Abstract/abstract.tex
@@ -1,7 +1,7 @@
 % ************************** Thesis Abstract *****************************
 % Use `abstract' as an option in the document class to print only the titlepage and the abstract.
 \begin{abstract}
-Artificial Neural Networks are connectionist systems that perform a given task by learning on examples without having a prior knowledge about the task. This is done by finding an optimal point estimate for the weights in every node.
+Artificial Neural Networks are connectionist systems that perform a given task by learning on examples without having prior knowledge about the task. This is done by finding an optimal point estimate for the weights in every node.
 Generally, the network using point estimates as weights perform well with large datasets, but they fail to express uncertainty in regions with little or no data, leading to overconfident decisions.
 \newline
 

diff --git a/Appendix2/appendix2.tex b/Appendix2/appendix2.tex
@@ -3,7 +3,7 @@
 
 \chapter{How to replicate results}
 
-Install PyTorch from the oficial website (\url{https://pytorch.org/})
+Install PyTorch from the official website (\url{https://pytorch.org/})
 
 \begin{verbatim} 
 

diff --git a/Chapter1/chapter1.tex b/Chapter1/chapter1.tex
@@ -15,7 +15,7 @@ \chapter{Introduction} %Title of the First Chapter
 
 \newacro{cnn}[\textsc{Cnn}]{Convolutional neural network}
 
-Deep Neural Networks (DNNs), are connectionist systems that learn to perform tasks by learning on examples without having a prior knowledge about the tasks. 
+Deep Neural Networks (DNNs), are connectionist systems that learn to perform tasks by learning on examples without having prior knowledge about the tasks. 
 They easily scale to millions of data points and yet remain tractable to optimize with stochastic gradient descent.
 
 \acp{cnn}, a variant of DNNs, have already surpassed human accuracy in the realm of image classification (e.g. \cite{he2016deep,simonyan2014very,krizhevsky2012imagenet}). Due to the capacity of \acp{cnn} to fit on a wide diversity of non-linear data points, they require a large amount of training data. This often makes \acp{cnn} and Neural Networks in general, prone to overfitting on small datasets. The model tends to fit well to the training data, but are not predictive for new data. This often makes the Neural Networks incapable of correctly assessing the uncertainty in the training data and hence leads to overly confident decisions about the correct class, prediction or action.
@@ -44,12 +44,12 @@ \section{Current Situation}
 \includegraphics[height=.28\textheight]{Chapter1/Figs/weights.png}
 \includegraphics[height=.28\textheight]{Chapter1/Figs/distribution.png}
 \label{fig:Scalar_Bayesian_Distribution_Gluon}
-\caption{Top: Each filter weight has a fixed value, as in the case of frequentist Convolutional Networks. Bottom: Each filter weight has a distribution, as in case of Bayesian Convolutional Networks. \cite{Gluon}}
+\caption{Top: Each filter weight has a fixed value, as in the case of frequentist Convolutional Networks. Bottom: Each filter weight has a distribution, as in the case of Bayesian Convolutional Networks. \cite{Gluon}}
 \end{center}
 \end{figure}
 
 \section{Our Hypothesis}
-\newline We build our Bayesian \ac{cnn} upon \textit{Bayes by Backprop} \cite{graves2011practical,blundell2015weight}. The exact Bayesian inference on the weights of a neural network is intractable as the number of parameters is very large and the functional form of a neural network does not lend itself to exact integration. So, we approximate the intractable true posterior probability distributions $p(w|\mathcal{D})$ with variational probability distributions $q_{\theta}(w|\mathcal{D})$, which comprise the properties of Gaussian distributions $\mu \in \mathbb{R}^d$ and $\sigma \in \mathbb{R}^d$, denoted $\mathcal{N}(\theta|\mu, \sigma^2)$, where $d$ is the total number of parameters defining a probability distribution. The shape of these Gaussian variational posterior probability distributions, determined by their variance $\sigma^2$, expresses an uncertainty estimation of every model parameter. \\ \\
+\newline We build our Bayesian \ac{cnn} upon \textit{Bayes by Backprop} \cite{graves2011practical,blundell2015weight}. The exact Bayesian inference on the weights of a neural network is intractable as the number of parameters is very large and the functional form of a neural network does not lend itself to exact integration. So, we approximate the intractable true posterior probability distributions $p(w|\mathcal{D})$ with variational probability distributions $q_{\theta}(w|\mathcal{D})$, which comprise the properties of Gaussian distributions $\mu \in \mathbb{R}^d$ and $\sigma \in \mathbb{R}^d$, denoted by $\mathcal{N}(\theta|\mu, \sigma^2)$, where $d$ is the total number of parameters defining a probability distribution. The shape of these Gaussian variational posterior probability distributions, determined by their variance $\sigma^2$, expresses an uncertainty estimation of every model parameter. \\ \\
 
 
 
@@ -61,12 +61,14 @@ \section{Our Contribution}
  \item We empirically show that our proposed generic and reliable variational inference method for Bayesian \acp{cnn} can be applied to various \ac{cnn} architectures without any limitations on their performances. 
  \item We examine how to estimate the aleatoric and epistemic uncertainties and empirically show how the uncertainty can decrease, allowing the decisions made by the network to become more deterministic as the training accuracy increases. 
  \item We also empirically show how our method typically only doubles the number of parameters yet trains an infinite ensemble using unbiased Monte Carlo estimates of the gradients. 
- \item Finally, we apply L1 norm to the trained model parameters and prune the number of non zero values and further, fine-tune the model to reduce the number of model parameters without a reduction in the model prediction accuracy. 
+ \item We also apply L1 norm to the trained model parameters and prune the number of non zero values and further, fine-tune the model to reduce the number of model parameters without a reduction in the model prediction accuracy. 
+ \item Finally, we will apply the concept of Bayesian CNN to tasks like Image Super-Resolution and Generative Adversarial Networks and we will compare the results with other prominent architectures in the respective domain.
 \end{enumerate} 
 This work builds on the foundations laid out by Blundell et al. \cite{blundell2015weight}, who introduced \textit{Bayes by Backprop} for feedforward neural networks. Together with the extension to recurrent neural networks, introduced by Fortunato et al. \cite{fortunato2017bayesian}, \textit{Bayes by Backprop} is now applicable on the three most frequently used types of neural networks, i.e., feedforward, recurrent, and convolutional neural networks.
 
 
 \nomenclature[z-cnn]{$CNN$}{Convolution Neural Networks} 
+\nomenclature[z-nn]{$NN$}{Neural Networks} 
 \nomenclature[z-mnist]{$MNIST$}{Modified National Institute of Standards and Technology} 
 \nomenclature[z-cifar]{$CIFAR$}{Canadian Institute for Advanced Research} 
 \nomenclature[z-dnn]{$DNN$}{Deep Neural Networks} 

diff --git a/Chapter2/chapter2.tex b/Chapter2/chapter2.tex
@@ -52,7 +52,7 @@ \chapter{Background}
  \item Concepts overview of Variational Inference, and local reparameterization trick in Bayesian Neural Network.
  \item Backpropagation in Bayesian Networks using Bayes by Backprop.
  \item Estimation of Uncertainties in a network.
- \item Pruning a network to reduce the number of overall parameters without affecting it's performance.
+ \item Pruning a network to reduce the number of overall parameters without affecting its performance.
  \end{itemize}
  }
 }
@@ -62,7 +62,7 @@ \chapter{Background}
 \section{Neural Networks}
 \subsection{Brain Analogies}
 
-A perceptron is conceived as a mathematical model of how the neurons function in our brain by a famous psychologist Rosenblatt. According to Rosenblatt, a neuron takes a set of binary inputs (nearby neurons), multiplies each input by a continuous-valued weight (the synapse strength to each nearby neuron), and thresholds the sum of these weighted inputs to output a 1 if the sum is big enough and otherwise a 0 (the same way neurons either fire or does not fire).
+A perceptron is conceived as a mathematical model of how the neurons function in our brain by a famous psychologist Rosenblatt. According to Rosenblatt, a neuron takes a set of binary inputs (nearby neurons), multiplies each input by a continuous-valued weight (the synapse strength to each nearby neuron), and thresholds the sum of these weighted inputs to output a 1 if the sum is big enough and otherwise a 0 (the same way neurons either fire or do not fire).
 
 \begin{figure}[H]
 \begin{center}
@@ -94,7 +94,7 @@ \subsection{Convolutional Neural Network}
 Also, the network between the lower order hypercomplex cells and the higher order hypercomplex cells are structurally similar to the network between simple cells and the complex cells. In this hierarchy, a cell in a higher stage generally has a tendency to respond selectively to a more complicated feature of the stimulus pattern, and the cell at the lower stage responds to simpler features. Also, higher stage cells possess a larger receptive field and are more insensitive to the shift in the position of the stimulus pattern.
 
 Similar to a hierarchy model, a neural network starting layers learns simpler features like edges and corners and subsequent layers learn complex features like colours, textures and so on. Also, higher neural units possess a larger receptive field which builds over the initial layers. However, unlike in multilayer perceptron where all neurons from one layer are connected with all the neurons in the next layer, weight sharing is the main idea behind a convolutional neural network. 
-Example: instead of each neuron having a different weight for each pixel of the input image (28*28 weights), the neurons only have a small set of weights (5*5) that is applied to a whole bunch of small subsets of the image of the same size. Layers past the first layer work in a similar way by taking in the ‘local’ features found in the previously hidden layer rather than pixel images, and successively see a larger portions of the image since they are combining information about increasingly larger subsets of the image. Finally, the final layer makes the correct prediction for the output class.
+Example: instead of each neuron having a different weight for each pixel of the input image (28*28 weights), the neurons only have a small set of weights (5*5) that is applied to a whole bunch of small subsets of the image of the same size. Layers past the first layer work in a similar way by taking in the ‘local’ features found in the previously hidden layer rather than pixel images, and successively see larger portions of the image since they are combining information about increasingly larger subsets of the image. Finally, the final layer makes the correct prediction for the output class.
 
 The reason for why this is helpful is intuitive if not mathematically clear: without such constraints, the network would have to learn the same simple things (such as detecting edges, corners, etc) a whole bunch of times for each portion of the image. But with the constraint there, only one neuron would need to learn each simple feature - and with far fewer weights overall, it could do so much faster! Moreover, since the pixel-exact locations of such features do not matter the neuron could basically skip neighbouring subsets of the image - subsampling, now known as a type of pooling - when applying the weights, further reducing the training time. The addition of these two types of layers - convolutional and pooling layers - are the primary distinctions of Convolutional Neural Nets (CNNs/ConvNets) from plain old neural nets.
 
@@ -147,11 +147,11 @@ \subsection{Local Reparametrisation Trick}
 \section{Uncertainties in Bayesian Learning}
 
 
-Uncertainties in a network is a measure of how certain the model is with its prediction. In Bayesian modeling, there are two main types of uncertainty one can model \citep{Kiureghian}: \textit{Aleatoric} uncertainty and \textit{Epistemic} uncertainty. 
+Uncertainties in a network is a measure of how certain the model is with its prediction. In Bayesian modelling, there are two main types of uncertainty one can model \citep{Kiureghian}: \textit{Aleatoric} uncertainty and \textit{Epistemic} uncertainty. 
 
 \textit{Aleatoric} uncertainty measures the noise inherent in the observations. This type of uncertainty is present in the data collection method like the sensor noise or motion noise which is uniform along the dataset. This cannot be reduced if more data is collected. \textit{Epistemic} uncertainty, on the other hand, represents the uncertainty caused by the model. This uncertainty can be reduced given more data and is often referred to as \textit{model uncertainty}. Aleatoric uncertainty can further be categorized into \textit{homoscedastic} uncertainty, uncertainty which stays constant for different inputs, and \textit{heteroscedastic} uncertainty. Heteroscedastic uncertainty depends on the inputs to the model, with some inputs potentially having more noisy outputs than others. Heteroscedastic uncertainty is in particular important so that model prevents from outputting very confident decisions.
 
-Current work measures uncertainties  by placing a probability distributions over either the model parameters or model outputs. Epistemic uncertainty is modelled by placing a prior distribution over a model's weights and then trying to capture how much these weights vary given some data. Aleatoric uncertainty, on the other hand, is modelled by placing a distribution over the output of the model.
+Current work measures uncertainties by placing a probability distributions over either the model parameters or model outputs. Epistemic uncertainty is modelled by placing a prior distribution over a model's weights and then trying to capture how much these weights vary given some data. Aleatoric uncertainty, on the other hand, is modelled by placing a distribution over the output of the model.
 
 \subsubsection{Sources of Uncertainities}
 
@@ -169,7 +169,7 @@ \subsubsection{Sources of Uncertainities}
 
 \section{Backpropagation}
 
-Backpropagation in a Neural Networks was proposed by Rumelhart \cite{Rumelhart} in 1986 and it is the most commonly used method for training neural networks. Backpropagation is a technique to compute the gradient of the loss in terms of the network weights. It operates in two phase: firstly, the input features through the network propagates in the forward direction to compute the function output and thereby the loss associated with the parameters. Secondly, the derivatives of the training loss with respect to the weights
+Backpropagation in a Neural Networks was proposed by Rumelhart \cite{Rumelhart} in 1986 and it is the most commonly used method for training neural networks. Backpropagation is a technique to compute the gradient of the loss in terms of the network weights. It operates in two phases: firstly, the input features through the network propagates in the forward direction to compute the function output and thereby the loss associated with the parameters. Secondly, the derivatives of the training loss with respect to the weights
 are propagated back from the output layer towards the input layers.
 These computed derivatives are further used to update the weights of the network. This is a continuous process and updating of the weight occurs continuously over every iteration. 
 
@@ -215,6 +215,6 @@ \section{Model Weights Pruning}
 efficient inference using these sparse models requires purpose-built hardware capable of loading sparse matrices and/or performing sparse matrix-vector operations. Thus the overall memory usage is reduced with the new pruned model. 
 
 
-There are several ways of achieving the pruned model, the most popular one is to map the low contributing weights to zero and reducing the number of overall non-zero valued weights. This can be achieved by training a large sparse model and pruning it further which makes it comparable to training a small dense model. Pruning away the less salient features to zero has been used in this thesis and is explained in details in chapter 4. 
+There are several ways of achieving the pruned model, the most popular one is to map the low contributing weights to zero and reducing the number of overall non-zero valued weights. This can be achieved by training a large sparse model and pruning it further which makes it comparable to training a small dense model. Pruning away the less salient features to zero has been used in this thesis and is explained in details in Chapter 4. 
 
 
diff --git a/Chapter3/chapter3.tex b/Chapter3/chapter3.tex
@@ -31,7 +31,7 @@ \section{Related Work}
 
 \subsection{Bayesian Training}
 Applying Bayesian methods to neural networks has been studied in the past with various approximation methods for the intractable true posterior probability distribution $p(w|\mathcal{D})$. Buntine and Weigend \cite{buntine1991bayesian} started to propose various \textit{maximum-a-posteriori} (MAP) schemes for neural networks. They were also the first who suggested second order derivatives in the prior probability distribution $p(w)$ to encourage smoothness of the resulting approximate posterior probability distribution.
-In subsequent work by Hinton and Van Camp \cite{hinton1993keeping}, the first variational methods were proposed which naturally served as a regularizer in neural networks. He also mentioned that the amount of information in a weight can be controlled by adding Gaussian noise. When optimizing the trade-off between the expected error and the information in the weights, the noise level can be adapted during learning.
+In subsequent work by Hinton and Van Camp \cite{hinton1993keeping}, the first variational methods were proposed which naturally served as a regularizer in neural networks. He also mentioned that the amount of information in weight can be controlled by adding Gaussian noise. When optimizing the trade-off between the expected error and the information in the weights, the noise level can be adapted during learning.
 
 
 Hochreiter and Schmidhuber \cite{hochreiter1995simplifying} suggested taking an information theory perspective into account and utilising a minimum description length (MDL) loss. This penalises non-robust weights by means of an approximate penalty based upon perturbations of the weights on the outputs.
@@ -52,7 +52,7 @@ \subsection{Uncertainty Estimation}
 
 \subsection{Model Pruning}
 
-Some early work in the model pruning domain used a second-order Taylor approximation of the increase in the loss function of the network when a weight is set to zero \cite{lecun1990optimal}. A diagonal Hessian approximation was used to calculate the saliency for each parameter \cite{lecun1990optimal} and the low-saliency parameters were pruned from the network and the network was retrained.
+Some early work in the model pruning domain used a second-order Taylor approximation of the increase in the loss function of the network when weight is set to zero \cite{lecun1990optimal}. A diagonal Hessian approximation was used to calculate the saliency for each parameter \cite{lecun1990optimal} and the low-saliency parameters were pruned from the network and the network was retrained.
 
 
 Narang \cite{DBLP:journals/corr/NarangDSE17} showed that a pruned RNN and GRU model performed better for the task of speech recognition compared to a dense network of the original size. This result is very similar to the results obtained in our case where a pruned model achieved better results than a normal network. However, no comparisons can be drawn as the model architecture (CNN vs RNN) used and the tasks (Computer Vision vs Speech Recognition) are completely different. Narang \cite{DBLP:journals/corr/NarangDSE17} in his work introduced a gradual pruning scheme based on pruning all the weights in a layer