Chapter7/chapter7.tex

\chapter{Conclusion and Outlook}

We propose Bayesian \acp{cnn} utilizing \textit{Bayes by Backprop} as a reliable, variational inference method for \acp{cnn} which has not been studied to-date, and estimate the models' aleatoric and epistemic uncertainties for prediction. Furthermore, we apply different ways to pruning the Bayesian \ac{cnn} and compare its results with frequentist architectures.


\newline There has been previous work by Gal and Ghahramani \cite{gal2015bayesian} who utilized the various outputs of a Dropout function to define a distribution, and concluded that one can then speak of a Bayesian \ac{cnn}. This approach finds, perhaps also due to its ease, a large confirming audience. However, we argue against this approach and claim deficiencies. Specifically, in Gal's and Ghahramani's \cite{gal2015bayesian} approach, no prior probability distributions $p(w)$ are placed on the \ac{cnn}'s parameters. But, these are a substantial part of a Bayesian interpretation for the simple reason that Bayes' theorem includes them. Thus we argue, starting with prior probability distributions $p(w)$ is essential in Bayesian methods. In comparison, we place prior probability distributions over all model parameters and update them according to Bayes' theorem with variational inference, precisely \textit{Bayes by Backprop}. We show that these neural networks achieve state-of-the-art results as those achieved by the same network architectures trained by frequentist inference. 


\newline Furthermore, we examine how uncertainties (both aleatoric and epistemic uncertainties) can be computed for our proposed method and we show how epistemic uncertainties can be reduced upon more training data. We also compare the effect of dropout in a frequentist network to the proposed Bayesian \ac{cnn} and show the natural regularization effect of Bayesian methods. To counter the twice number of parameters (mean and variance) in a Bayesian \ac{cnn} compared to a single point estimate weight in a frequentist method, we apply methods of network pruning and show that the Bayesian \ac{cnn} performs equally good or better even when the network is pruned and the number of parameters is made comparable to a frequentist method. 


\newline Finally, we show the applications of Bayesian \acp{cnn} in various domains like Image recognition, Image Super-Resolution and Generative Adversarial Networks (GANs). The results are compared with other popular approaches in the field and a comparison of results are drawn. Bayesian \acp{cnn} in general, proved to be a good idea to be applied on GANs as prior knowledge for discriminator network helps in better identification of real vs fake images. \\


As an add-on method to further enhance the stability of the optimization, \textit{posterior sharpening} \cite{fortunato2017bayesian} could be applied to Bayesian \acp{cnn} in future work. There, the variational posterior distribution $q_{\theta}(w|\mathcal{D})$ is conditioned on the training data of a batch $\mathcal{D}^{(i)}$. We can see $q_{\theta}(w|\mathcal{D}^{(i)})$ as a proposal distribution, or \textit{hyper-prior} when we rethink it as a hierarchical model, to improve the gradient estimates of the intractable likelihood function $p(\mathcal{D}|w)$. For the initialization of the mean and variance, a zero mean and one as standard deviation was used as the normal distribution seems to be the most intuitive distribution to start with. However, with the results drawn in the thesis from several experimentations, a zero-centred mean and very small standard deviation initialization seemed to be performing equally well but training faster. Xavier initialization \cite{glorot2010understanding} converges faster in a frequentist network compared to a normal initialization and a similar distribution space needs to be explored with Bayesian networks for initializing the distribution. Other properties like periodicity or spatial invariance are also captured by the priors in data space, and based on these properties an alternative to Gaussian process priors can be found. 

\newline Using normal distribution as prior for uncertainty estimation was also explored by Danijar et al. \cite{hafner2018reliable} and it was observed that standard normal prior causes the function posterior to generalize in unforeseen ways on inputs outside of the training distribution. Addition of some noise in the normal distribution as prior can help in better uncertainty estimation by the model. However, no such cases were found in our experiments but can be an interesting area to explore in future. 


\newline The network is pruned with simple methods like L1 norm and more compression tricks like vector quantization \cite{DBLP:journals/corr/GongLYB14} and group sparsity regularization  \cite{DBLP:conf/nips/AlvarezS16} can be applied. In our work, we show that reducing the number of model parameters results in a better generalization of the Bayesian architecture and even leads to improvement in the overall model accuracy on the test dataset. Upon further analysis of the model, there is no concrete learning about the change in the behaviour. A more detailed analysis by visualizing the pattern learned by each neuron and grouping them together and removing the redundant neurons which learns similar behaviour is a good way to prune the model.

\newline The concept of Bayesian \ac{cnn} is applied to the discriminative network of a GAN in our work and it has shown good initial results. However, the area of Bayesian generative networks in a GAN is still to be investigated.