-
Notifications
You must be signed in to change notification settings - Fork 83
/
chapter5.tex
376 lines (280 loc) · 24.6 KB
/
chapter5.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
\chapter{Empirical Analysis}
\fbox{
\parbox{\textwidth}
{
Chapter Overview
\begin{itemize}
\item Applying Bayesian \acp{cnn} for the task of Image Recognition on MNIST, CIFAR-10, CIFAR-100 and STL-10 datasets.
\item Comparison of results of Bayesian \acp{cnn} with Normal \ac{cnn} architectures on similar datasets.
\item Regularization effect of Bayesian Network with dropouts.
\item Distribution of mean and variance in Bayesian \ac{cnn} over time.
\item Parameters comparison before and after model pruning.
\end{itemize}
}
}
\pagebreak
\section{Experimentation Methodology} \label{experiments}
\subsubsection{Activation Function}
The originally chosen activation functions in all architectures are \textit{ReLU}, but we must introduce another, called \textit{Softplus}, see \eqref{softplus}, because of our method to apply two convolutional or fully-connected operations. As aforementioned, one of these is determining the mean $\mu$, and the other the variance $\alpha \mu^2$. Specifically, we apply the \textit{Softplus} function because we want to ensure that the variance $\alpha \mu^2$ never becomes zero. This would be equivalent to merely calculating the MAP, which can be interpreted as equivalent to a maximum likelihood estimation (MLE), which is further equivalent to utilising single point-estimates, hence frequentist inference. The \textit{Softplus} activation function is a smooth approximation of \textit{ReLU}. Although it is practically not influential, it has the subtle and analytically important advantage that it never becomes zero for $x \rightarrow -\infty$, whereas \textit{ReLU} becomes zero for $x \rightarrow -\infty$.
\\
\begin{equation}\label{softplus}
\text{Softplus}(x) = \frac{1}{\beta} \cdot \log \big ( 1 + \exp(\beta \cdot x) \big )
\end{equation}
\\
where $\beta$ is by default set to $1$.
\newline All experiments are performed with the same hyper-parameters settings as stated in the Appendix.
\subsubsection{Network Architecture}
For all conducted experiments, we implement the foregoing description of Bayesian \acp{cnn} with variational inference in LeNet-5 \cite{lecun1998gradient} and AlexNet \cite{krizhevsky2012imagenet}. The exact architecture specifications can be found in the Appendix and in our GitHub repository\footnote{\url{https://github.com/kumar-shridhar/PyTorch-BayesianCNN}}.
\subsubsection{Objective Function}
To learn the objective function, we use \textit{Bayes by Backprop} \cite{graves2011practical, blundell2015weight}, which is a variational inference method to learn the posterior distribution on the weights $w \sim q_{\theta}(w|\mathcal{D})$ of a neural network from which weights $w$ can be sampled in backpropagation.
It regularises the weights by minimising a compression cost, known as the variational free energy or the expected lower bound on the marginal likelihood.
We tackled the problem of intractability in Chapter 2 and consequently, we arrive at the tractable cost function \eqref{cost} which is aimed to be optimized, i.e. minimised w.r.t. $\theta$, during training:
\begin{equation} \label{cost}
\mathcal{F}(\mathcal{D}, \theta)\approx \sum_{i=1}^n \log q_{\theta}(w^{(i)}|\mathcal{D})-\log p(w^{(i)})-\log p(\mathcal{D}|w^{(i)})
\end{equation}
%
where $n$ is the number of draws.
Let's break the Objective Function \eqref{cost} and discuss in more details.
\subsubsection{Variational Posterior }
The first term in the equation \eqref{cost} is the variational posterior. The variational posterior is taken as Gaussian distribution centred around mean $\mu$ and variance as $\sigma^2$.
\begin{equation}
q_{\theta}(w^{(i)}|\mathcal{D})= \prod_{i} \mathcal{N}(w_{i} | \mu,\sigma^2)
\end{equation}
We will take the log and the log posterior is defined as :
\begin{equation}
log(q_{\theta}(w^{(i)}|\mathcal{D}))= \sum_{i}log \mathcal{N}(w_{i} | \mu,\sigma^2)
\end{equation}
\subsubsection{Prior}
The second term in the equation \eqref{cost} is the prior over the weights and we define the prior over the weights as a product of individual Gaussians :
\begin{equation}
p(w^{(i)})= \prod_{i} \mathcal{N}(w_{i} | 0,\sigma_{p}^2)
\end{equation}
We will take the log and the log prior is defined as:
\begin{equation}
log (p(w^{(i)}))= \sum_{i} log \mathcal{N}(w_{i} | 0,\sigma_{p}^2)
\end{equation}
\subsubsection{ Likelihood }
The final term of the equation \eqref{cost} $\log p(\mathcal{D}|w^{(i)})$ is the likelihood term and is computed using the softmax function.
\subsubsection{Parameter Initialization}
We use a Gaussian distribution and we store mean and variance values instead of just one weight. The way mean $\mu$ and variance $\sigma$ is computed is defined in the previous chapter. Variance cannot be negative and it is ensured by using \textit{softplus} as the activation function. We express variance $\sigma$ as $\sigma_{i}=softplus(\rho_{i})$ where $\rho$ is an unconstrained parameter. \\
We take the Gaussian distribution and initialize mean $\mu$ as 0 and variance $\sigma$ (and hence $\rho$) randomly. We observed mean centred around 0 and a variance starting with a big number and gradually decreasing over time. A good initialization can also be to put a restriction on variance and initialize it small. However, it might be data dependent and a good method for variance initialization is still to be discovered. We perform gradient descent over $\theta$ = ($\mu$, $\rho$), and individual weight $w_{i} \sim \mathcal{N} (w_{i} | \mu_{i}, \sigma_{i}$).
\subsubsection{Optimizer}
For all our tasks, we take Adam optimizer \cite{kingma2014adam} to optimize the parameters. We also perform the local reparameterization trick as mentioned in the previous section and take the gradient of the combined loss function with respect to the variational parameters ($\mu$, $\rho$).
\subsubsection{Model Pruning}
We take the weights of all the layers of the network, apply an L1 norm over it and for all the weights value as zero or below a defined threshold are removed and the model is pruned. \\ Also, since the Bayesian \acp{cnn} has twice the number of parameters ($\mu$, $\sigma$) compared to a frequentist network (only 1 weight), we reduce the size of our network to half (AlexNet and LeNet- 5) by reducing the number of filters to half. The architecture used is mentioned in the Appendix.
\textit{Please note that it can be argued that reducing the number of filters to be half is a method for pruning or not. It can be seen as a method that reduces the number of overall parameters and hence can be thought of a pruning method in some sense. However, it is a subject to argument.}
\section{Case Study 1: Small Datasets (MNIST, CIFAR-10)}
We train the networks with the MNIST dataset of handwritten digits \cite{lecun1998gradient}, and CIFAR-10 dataset \cite{krizhevsky2009learning} since these datasets serve widely as benchmarks for \acp{cnn}' performances.
\subsection{Datasets}
\newline
\subsubsection{MNIST}
The MNIST database \cite{lecun-mnisthandwrittendigit-2010} of handwritten digits have a training set of 60,000 examples and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centred in a fixed-size image of 28 by 28 pixels. Each image is grayscaled and is labelled with its corresponding class that ranges from zero to nine.
\newline
\subsubsection{CIFAR-10}
The CIFAR-10 are labelled subsets of the 80 million tiny images dataset \cite{Torralba:2008:MTI:1444381.1444403}. The CIFAR-10 dataset has a training dataset of 50,000 colour images in 10 classes, with 5,000 training images per class, each image 32 by 32 pixels large. There are 10000 images for testing.
\newline
\subsection{Results}
First, we evaluate the performance of our proposed method, Bayesian \acp{cnn} with variational inference. Table \ref{tab:results} shows a comparison of validation accuracies (in percentage) for architectures trained by two disparate Bayesian approaches, namely variational inference, i.e. \textit{Bayes by Backprop} and Dropout as proposed by Gal and Ghahramani \cite{gal2015bayesian}.\\
We compare the results of these two approaches to frequentist inference approach for both the datasets. Bayesian \acp{cnn} trained by variational inference achieve validation accuracies comparable to their counter-architectures trained by frequentist inference. On MNIST, validation accuracies of the two disparate Bayesian approaches are comparable, but a Bayesian LeNet-5 with Dropout achieves a considerable higher validation accuracy on CIFAR-10, although we were not able to reproduce these reported results.
\begin{table}[H]
\tiny
\centering
\renewcommand{\arraystretch}{1.5}
\resizebox{\linewidth}{!}{
\begin{tabular}{ l c c c c }
\hline
\empty & MNIST & CIFAR-10 \\ [0.75ex]
\hline
Bayesian AlexNet (with VI) & 99 & 73 \\
Frequentist AlexNet & 99 & 73 \\
\hdashline
Bayesian LeNet-5 (with VI) & 98 & 69 \\
Frequentist LeNet-5 & 98 & 68 \\
\hdashline
Bayesian LeNet-5 (with Dropout) & 99.5 & 83 \\
\hline \\
\end{tabular}}
\renewcommand{\arraystretch}{1.5}
\caption{Comparison of validation accuracies (in percentage) for different architectures with variational inference (VI), frequentist inference and Dropout as a Bayesian approximation as proposed by Gal and Ghahramani \cite{gal2015bayesian} for MNIST, and CIFAR-10.}
\label{tab:results}
\end{table}
\begin{figure}[t!]
\begin{center}
\includegraphics[width=\linewidth]{Chapter5/Figs/results_mnist_CIFAR10.png}
\caption{Comparison of Validation Accuracies of Bayesian AlexNet and LeNet-5 with frequentist approach on MNIST and CIFAR-10 datasets}
\label{fig:MnistCIFAR10reesults}
\end{center}
\end{figure}
\newline Figure \ref{fig:MnistCIFAR10reesults} shows the validation accuracies of Bayesian vs Non-Bayesian \acp{cnn}. One thing to observe is that in initial epochs, Bayesian \acp{cnn} trained by variational inference start with a low validation accuracy compared to architectures trained by frequentist inference. This must deduce from the initialization of the variational posterior probability distributions $q_{\theta}(w|\mathcal{D})$ as uniform distributions, while initial point-estimates in architectures trained by frequentist inference are randomly drawn from a standard Gaussian distribution. (For uniformity, we changed the initialization of frequentist architectures from Xavier initialization to standard Gaussian). The latter initialization method ensures the initialized weights are neither too small nor too large. In other words, the motivation of the latter initialization is to start with weights such that the activation functions do not let them begin in saturated or dead regions. This is not true in case of uniform distributions and hence, Bayesian \acp{cnn}' starting validation accuracies can be comparably low.
\pagebreak
\newline Figure \ref{fig:std_CNN} displays the convergence of the standard deviation $\sigma$ of the variational posterior probability distribution $q_{\theta}(w|\mathcal{D})$ of a random model parameter over epochs. As aforementioned, all prior probability distributions $p(w)$ are initialized as uniform distributions. The variational posterior probability distributions $q_{\theta}(w|\mathcal{D})$ are approximated as Gaussian distributions which become more confident as more data is processed - observable by the decreasing standard deviation over epochs in Figure \ref{fig:std_CNN}. Although the validation accuracy for MNIST on Bayesian LeNet-5 has already reached 99\%, we can still see a fairly steep decrease in the parameter's standard deviation. In Figure \ref{fig:distribution}, we plot the actual Gaussian variational posterior probability distributions $q_{\theta}(w|\mathcal{D})$ of a random parameter of LeNet-5 trained on CIFAR-10 at some epochs.
%
\begin{figure}[H]
\begin{center}
\includegraphics[width=\linewidth]{Chapter5/Figs/std_CNN.png}
\caption{Convergence of the standard deviation of the Gaussian variational posterior probability distribution $q_{\theta}(w|\mathcal{D})$ of a random model parameter at epochs 1, 5, 20, 50, and 100. MNIST is trained on Bayesian LeNet-5.}
\label{fig:std_CNN}
\end{center}
\end{figure}
%
\begin{figure}[H]
\begin{center}
\includegraphics[width=\linewidth]{Chapter5/Figs/distribution.png}
\caption{Convergence of the Gaussian variational posterior probability distribution $q_{\theta}(w|\mathcal{D})$ of a random model parameter at epochs 1, 5, 20, 50, and 100. CIFAR-10 is trained on Bayesian LeNet-5.}
\label{fig:distribution}
\end{center}
\end{figure}
\newline Figure \ref{fig:distribution} displays the convergence of the Gaussian variational probability distribution of a weight taken randomly from the first layer of LeNet-5 architecture. The architecture is trained on CIFAR-10 dataset with uniform initialization.
\section{Case Study 2: Large Dataset (CIFAR-100)}
\subsection{Dataset}
\subsubsection{CIFAR-100}
This dataset is similar to the CIFAR-10 and is a labelled subset of the 80 million tiny images dataset \cite{Torralba:2008:MTI:1444381.1444403}. The dataset has 100 classes containing 600 images each. There are 500 training images and 100 validation images per class. The images are coloured with a resolution of 32 by 32 pixels.
\subsection{Results}
\begin{table}[H]
\tiny
\centering
\renewcommand{\arraystretch}{1.5}
\resizebox{\linewidth}{!}{
\begin{tabular}{ l c c }
\hline
\empty & CIFAR-100 \\ [0.75ex]
\hline
Bayesian AlexNet (with VI) & 36 \\
Frequentist AlexNet & 38 \\
Bayesian LeNet-5 (with VI) & 31 \\
Frequentist LeNet-5 & 33 \\
\hline \\
\end{tabular}}
\renewcommand{\arraystretch}{1.5}
\caption{Comparison of validation accuracies (in percentage) for different architectures with variational inference (VI), frequentist inference and Dropout as a Bayesian approximation as proposed by Gal and Ghahramani \cite{gal2015bayesian} for MNIST, CIFAR-10, and CIFAR-100.}
\label{tab:resultsCIFAR-100}
\end{table}
In Figure \ref{fig:regularization}, we show how Bayesian networks incorporate naturally effects of regularization, exemplified on AlexNet. While an AlexNet trained by frequentist inference without any regularization overfits greatly on CIFAR-100, an AlexNet trained by Bayesian inference on CIFAR-100 does not. This is evident from the high value of training accuracy for frequentist approach with no dropout or 1 layer dropout. Bayesian CNN performs equivalently to an AlexNet trained by frequentist inference with three layers of Dropout after the first, fourth, and sixth layers in the architecture.
Another thing to note here is that the Bayesian CNN with 100 samples overfits slightly lesser compared to Bayesian CNN with 25 samples. However, a higher sampling number on a smaller dataset didn't prove useful and we stuck with 25 as the number of samples for all other experiments.
\begin{figure}[H]
\begin{center}
\includegraphics[width=\linewidth]{Chapter5/Figs/results_train_test_cifar100.png}
\caption{Comparison of Training and Validation Accuracies of Bayesian AlexNet and LeNet-5 with frequentist approach with and without dropouts on CIFAR-100 datasets}
\label{fig:regularization}
\end{center}
\end{figure}
Table \ref{tab:tableCIFAR100} shows a comparison of the training and validation accuracies for AlexNet with Bayesian approach and frequentist approach. The low gap between the training and validation accuracies shows the robustness of Bayesian approach towards overfitting and shows how Bayesian approach without being regularized overfits lesser as compared to frequentist architecture with no or one dropout layer. The results are comparable with AlexNet architecture with 3 dropout layers.
\begin{table}[H]
\tiny
\centering
\renewcommand{\arraystretch}{1.5}
\resizebox{\linewidth}{!}{
\begin{tabular}{ l c c c c }
\hline
\empty & Training Accuracy & Validation Accuracy \\ [0.75ex]
\hline
Frequentist AlexNet (No dropout) & 83 & 38 \\
Frequentist AlexNet (1 dropout layer) & 72 & 40 \\
Frequentist AlexNet (3 dropout layer) & 39 & 38 \\
\hdashline
Bayesian AlexNet (25 num of samples) & 54 & 37 \\
Bayesian AlexNet (100 num of samples) & 48 & 37 \\
\hline \\
\end{tabular}}
\renewcommand{\arraystretch}{1.5}
\caption{Comparison of training and validation accuracies (in percentage) for AlexNet architecture with variational inference (VI) and frequentist inference for CIFAR-100.}
\label{tab:tableCIFAR100}
\end{table}
\section{Uncertainity Estimation}
\newline Finally, Table \ref{tab:uncertainty} compares the means of aleatoric and epistemic uncertainties for a Bayesian LeNet-5 with variational inference on MNIST and CIFAR-10. The aleatoric uncertainty of CIFAR-10 is about twenty times as large as that of MNIST. Considering that the aleatoric uncertainty measures the irreducible variability and depends on the predicted values, a larger aleatoric uncertainty for CIFAR-10 can be directly deduced from its lower validation accuracy and may be further due to the smaller number of training examples. The epistemic uncertainty of CIFAR-10 is about fifteen times larger than that of MNIST, which we anticipated since epistemic uncertainty decreases proportionally to validation accuracy.
\begin{table}[H]
\tiny
\centering
\renewcommand{\arraystretch}{1.5}
\resizebox{\linewidth}{!}{
\begin{tabular}{ l c c c }
\hline
\empty & Aleatoric uncertainty & Epistemic uncertainty \\ [0.75ex]
\hline
Bayesian LeNet-5 (MNIST) & 0.0096 & 0.0026 \\
Bayesian LeNet-5 (CIFAR-10) & 0.1920 & 0.0404 \\
\hline \\
\end{tabular}}
\renewcommand{\arraystretch}{1.5}
\caption{Aleatoric and epistemic uncertainty for Bayesian LeNet-5 calculated for MNIST and CIFAR-10, computed as proposed by Kwon et al. \cite{kwon2018uncertainty}.}
\label{tab:uncertainty}
\end{table}
\section{Model Pruning}
\subsubsection{Halving the Number of Filters}
For every parameter for a frequentist inference network, Bayesian \acp{cnn} has two parameters ($\mu$, $\sigma$). Halving the number of parameters of Bayesian AlexNet ensures the number of parameters of it is comparable with a frequentist inference network. The number of filters of ALexNet is halved and a new architecture called AlexNetHalf is defined in Figure 5.4.
\begin{table}[h!]
\centering
\renewcommand{\arraystretch}{2}
\begin{tabular}{c c c c c c}
\hline
layer type & width & stride & padding & input shape & nonlinearity \\ [0.5ex]
\hline
convolution ($11\times11$) & 32 & 4 & 5 & $M\times3\times32\times32$ & Softplus \\
max-pooling ($2\times2$) & \empty & 2 & 0 & $M\times32\times32\times32$ & \empty \\
convolution ($5\times5$) & 96 & 1 & 2 & $M\times32\times15\times15$ & Softplus \\
max-pooling ($2\times2$) & \empty & 2 & 0 & $M\times96\times15\times15$ & \empty \\
convolution ($3\times3$) & 192 & 1 & 1 & $M\times96\times7\times7$ & Softplus \\
convolution ($3\times3$) & 128 & 1 & 1 & $M\times192\times7\times7$ & Softplus \\
convolution ($3\times3$) & 64 & 1 & 1 & $M\times128\times7\times7$ & Softplus \\
max-pooling ($2\times2$) & \empty & 2 & 0 & $M\times64\times7\times7$ & \empty \\
fully-connected & 64 & \empty & \empty & $M\times64$ & \empty \\ [1ex]
\hline
\end{tabular}
\renewcommand{\arraystretch}{1.5}
\label{tab:AlexNetHalfArchitecture}
\caption{AlexNetHalf with number of filters halved compared to the original architecture.}
\end{table}
The AlexNetHalf architecture was trained and validated on the MNIST, CIFAR10 and CIFAR100 dataset and the results are shown in Table \ref{tab:resultsAlexNetHalf}. The accuracy of pruned AlexNet with only half the number of filters compared to the normal architecture shows an accuracy gain of 6 per cent in case of CIFAR10 and equivalent performance for MNIST and CIFAR100 datasets. A lesser number of filters learn the most important features which proved better at inter-class classification could be one of the explanations for the rise in accuracy. However, upon visualization of the filters, no distinct clarification can be made to prove the previous statement. \\
Another possible explanation could be the model is generalizing better after the reduction in the number of filters ensuring the model is not overfitting and validation accuracy is comparatively higher. CIFAR-100 higher validation accuracy on ALexNetHalf and a lower training accuracy than Bayesian AlexNet proves the theory. Using a lesser number of filters further enhances the regularization effect and makes the model more robust against overfitting. Similar results have been achieved by Narang \cite{DBLP:journals/corr/NarangDSE17} in his work where a pruned model achieved better accuracy compared to the original architecture in a speech recognition task. Suppressing or removing the weights that have lesser or no contribution to the prediction makes the model rely its prediction on the most prominent and unique features and hence improves the prediction accuracy.
\begin{table}[H]
\tiny
\centering
\renewcommand{\arraystretch}{1.5}
\resizebox{\linewidth}{!}{
\begin{tabular}{ l c c c c }
\hline
\empty & MNIST & CIFAR-10 & CIFAR-100 \\ [0.75ex]
\hline
Bayesian AlexNet (with VI) & 99 & 73 & 36 \\
Frequentist AlexNet & 99 & 73 & 38 \\
Bayesian AlexNetHalf (with VI) & 99 & 79 & 38 \\
\hline \\
\end{tabular}}
\renewcommand{\arraystretch}{1.5}
\caption{Comparison of validation accuracies (in percentage) for AlexNet with variational inference (VI), AlexNet with frequentist inference and AlexNet with half number of filters halved for MNIST, CIFAR-10 and CIFAR-100 datasets.}
\label{tab:resultsAlexNetHalf}
\end{table}
\subsubsection{Applying L1 Norm}
L1 norm induces sparsity in the trained model parameters and sets some values to zero. We trained a model to some epochs (number of epochs differs across datasets as we applied early stopping when validation accuracy remains unchanged for 5 epochs). We removed the zero-valued parameters of the learned weights and keep the non-zero parameters for a trained Bayesian AlexNet on MNIST and CIFAR-10 datasets. We pruned the model to make the number of parameters in a Bayesian Network comparable to the number of parameters in the point-estimate architecture. \\ Table \ref{tab:resultsL1Norm} shows the comparison of validation accuracies of the applied L1 Norm AlexNet Bayesian architecture with Bayesian AlexNet architecture and with AlexNet frequentist architecture. We got comparable results on MNIST and CIFAR10 with the experiments and the results are shown in Table \ref{tab:resultsL1Norm}
\begin{table}[H]
\tiny
\centering
\renewcommand{\arraystretch}{1.5}
\resizebox{\linewidth}{!}{
\begin{tabular}{ l c c c }
\hline
\empty & MNIST & CIFAR-10 \\ [0.75ex]
\hline
Bayesian AlexNet (with VI) & 99 & 73 \\
Frequentist AlexNet & 99 & 73 \\
Bayesian AlexNet with L1 Norm (with VI) & 99 & 71 \\
\hline \\
\end{tabular}}
\renewcommand{\arraystretch}{1.5}
\caption{Comparison of validation accuracies (in percentage) for AlexNet with variational inference (VI), AlexNet with frequentist inference and BayesianAlexNet with L1 norm applied for MNIST and CIFAR-10 datasets.}
\label{tab:resultsL1Norm}
\end{table}
One thing to note here is that the numbers of parameters of Bayesian Network after applying L1 norm is not necessarily equal to the number of parameters in the frequentist AlexNet architecture. It depends on the data size and the number of classes. However, the number of parameters in the case of MNIST and CIFAR-10 are pretty comparable and there is not much reduction in the accuracy either. Also, the early stopping was applied when there is no change in the validation accuracy for 5 epochs and the model was saved and later pruned with the application of L1 norm.
\section{Training Time}
Training time of a Bayesian \acp{cnn} is twice of a frequentist network with similar architecture when the number of samples is equal to one. In general, the training time of a Bayesian \acp{cnn}, $T$ is defined as:
\begin{align}
T = 2 * number\_of\_samples * t
\end{align}
where $t$ is the training time of a frequentist network.
The factor of 2 is present due to the double learnable parameters in a Bayesian CNN network i.e. mean and variance for every single point estimate weight in the frequentist network.
However, there is no difference in the inference time for both the networks.
\ifpdf
\graphicspath{{Chapter2/Figs/Raster/}{Chapter2/Figs/PDF/}{Chapter2/Figs/}}
\else
\graphicspath{{Chapter2/Figs/Vector/}{Chapter2/Figs/}}
\fi