--- format: html: number-depth: 3 css: summary-format.css --- # Generative Models for Classification These models instead of trying to predict the **posterior probability** ( $Pr(Y=k|X=x)$ ) directly, they try to estimate the distribution of the predictors $X$ separately in each of the response classes $Y$ ( $f_{k}(X) = Pr(X|Y=k)$ ). Then, they use the **Bayes' Theorem** and the **overall or prior probability** $\pi_{k}$ (probability of a randomly chosen observation comes from the $k$th class) to flip these around into estimates for $Pr(Y=k|X=x)$ by approximating the ***Bayes Classifier***, which has the lowest ***total error rate***. $$ p_{k}(x) = Pr(Y = k | X = x) = \frac{\pi_{k} f_{k}(x)} {\sum_{l=1}^{K} \pi_{l} f_{l}(x)} $$ Estimating the *prior probability* can be as easy calculate $\hat{\pi}_{k} = n_{k}/ n$ for each $Y$ class by assuming that the trainning data its representative of the population, but estimating the density function of $X$ for each class $f_{k}$ it's more challenging, so models need to make more simplifying assumptions to estimate it. ## Linear Discriminant Analysis (LDA) This model assumes that: - The density function of $X$ for each $Y$ class $f_{k}$ follows a **Normal (Gaussian) distribution** within each class. Even though, it is often remarkably robust to model violations like Boolean variables. - $X$ has a **different mean** across all $Y$ classes $\mu_{1}^2 \neq \dots \neq \mu_{k}^2$. - $X$ has a **common variance** across all $Y$ classes $\sigma_{1}^2 = \dots = \sigma_{k}^2$. To understand how the model calculates its parameters, let's see the **discriminant function** when the number of predictors is $p=1$ and the number of $Y$ classes is $K=2$. $$ \begin{split} \delta_{k}(x) & = \log{ \left( p_{x}(x) \right)} \\ & = \log{(\pi_{k})} - \frac{\mu_{k}^2}{2\sigma^2} + x \cdot \frac{\mu_{k}}{\sigma^2} \end{split} $$ In this function, it's clear that a class $k$ has more possibilities to be selected as mean of $x$ for that particular class increases and its variance decreases. It is also important to take in consideration the effect of $\log{(\pi_{k})}$, in consequence the proportion of classes also influence the results. If we want to extend the model to work with $p \geq 1$ we also need to consider that: - Each individual predictor follows a one-dimensional normal distribution - There is some correlation between each pair of predictors As result, the *discriminant function* is: $$ \begin{split} \delta_{k}(x) & = \log{\pi_{k}} - \frac{1}{2} \mu_{k}^T \Sigma^{-1} \mu_{k} \\ & \quad + x^T \Sigma^{-1} \mu_{k} \end{split} $$ - Where: - $x$ refers to a vector the current value of each $p$ element. - $\mu$ refers to a vector with the mean of each predictor. - $\Sigma$ refers to the covariance matrix $p \times p$ of $\text{Cov}(X)$. The model also can be extended to handle $K > 2$ after defining the $K$ class as the baseline, we can extend the *discriminant function* to have the next form: $$ \begin{split} \delta_{k}(x) & = \log{ \left( \frac{Pr(Y = k|K=x)} {Pr(Y=K|X=x)} \right)} \\ & = \log{ \left( \frac{\pi_{k}}{\pi_{K}} \right)} - \frac{1}{2} (\mu_{k} + \mu_{K})^T \Sigma^{-1} (\mu_{k} - \mu_{K}) \\ & \quad + x^{T} \Sigma^{-1} (\mu_{k} - \mu_{K}) \end{split} $$ ### Coding example To perform **LDA** we just need to create the model specification by loading the **discrim** package and using **MASS** engine. ```{r} library(tidymodels) library(ISLR) library(discrim) Smarket_train <- Smarket %>% filter(Year != 2005) Smarket_test <- Smarket %>% filter(Year == 2005) lda_spec <- discrim_linear() %>% set_mode("classification") %>% set_engine("MASS") SmarketLdaPredictions <- lda_spec %>% fit(Direction ~ Lag1 + Lag2, data = Smarket_train) |> augment(new_data = Smarket_test) conf_mat(SmarketLdaPredictions, truth = Direction, estimate = .pred_class) accuracy(SmarketLdaPredictions, truth = Direction, estimate = .pred_class) ``` ## Quadratic Discriminant Analysis (QDA) Like LDA, the QDA classifier plugs estimates for the parameters into Bayes’ theorem in order to perform prediction results and assumes that: - The observations from each class are drawn from a Gaussian distribution - Each class has its own **covariance matrix**, $X \sim N(\mu_{k}, \Sigma_{k})$ Under this assumption, the Bayes classifier assigns an observation $X = x$ to the class for which $\delta_{k}(x)$ is largest. $$ \begin{split} \delta_{k}(x) = & \quad \log{\pi_{k}} - \frac{1}{2} \log{|\Sigma_{k}|} - \frac{1}{2} \mu_{k}^T \Sigma_{k}^{-1}\mu_{k} \\ & + x^T \Sigma_{k}^{-1} \mu_{k} \\ & - \frac{1}{2} x^T \Sigma_{k}^{-1} x \end{split} $$ In consequence, QDA is more flexible than LDA and has the potential to be more accurate in settings where interactions among the predictors are important in discriminating between classes or when we need non-linear decision boundaries. The model also can be extended to handle $K > 2$ after defining the $K$ class as the baseline, we can extend the *discriminant function* to have the next form: $$ \log{ \left( \frac{Pr(Y = k|K=x)}{Pr(Y=K|X=x)} \right)} = a_k + \sum_{j=1}^{p}b_{kj}x_{j} + \sum_{j=1}^{p} \sum_{l=1}^{p} c_{kjl} x_{j}x_{l} $$ Where $a_k$, $b_{kj}$ and $c_{kjl}$ are functions of $\pi_{k}$, $\pi_{K}$, $\mu_{k}$, $\mu_{K}$, $\Sigma_{k}$ and $\Sigma_{K}$ ### Coding example To perform **QDA** we just need to create the model specification by loading the **discrim** package and using **MASS** engine. ```{r} qda_spec <- discrim_quad() %>% set_mode("classification") %>% set_engine("MASS") SmarketQdaPredictions <- qda_spec %>% fit(Direction ~ Lag1 + Lag2, data = Smarket_train) |> augment(new_data = Smarket_test) conf_mat(SmarketQdaPredictions, truth = Direction, estimate = .pred_class) accuracy(SmarketQdaPredictions, truth = Direction, estimate = .pred_class) ``` ## Naive Bayes To estimate $f_{k}(X)$ this model assumes that *Within the kth class, the p predictors are independent* (correlation = 0) and as consequence: $$ f_{k}(x) = f_{k1}(x_{1}) \times f_{k2}(x_{2}) \times \dots \times f_{kp}(x_{p}) $$ Even thought the assumption might not be true, the model often leads to pretty decent results, especially in settings where **_n_ is not large enough relative to _p_** for us to effectively estimate the joint distribution of the predictors within each class. It has been used to classify **text data**, for example, to predict whether an **email is spam or not**. To estimate the one-dimensional density function $f_{kj}$ using training data we have the following options: - We can assume that $X_{j}|Y = k \sim N(\mu_{jk}, \sigma_{jk}^2)$ - We can estimate the distribution by defining bins and creating a histogram - We can estimate the distribution by use a kernel density estimator - If $X_{j}$ is **qualitative**, we can count the proportion of training observations for the $j$th predictor corresponding to each class. The model also can be extended to handle $K > 2$ after defining the $K$ class as the baseline, we can extend the function to have the next form: $$ \log{ \left( \frac{Pr(Y = k|K=x)}{Pr(Y=K|X=x)} \right)} = \log{ \left( \frac{\pi_{k}} {\pi_{K}} \right)} + \log{ \left( \frac{\prod_{j=1}^{p} f_{kj}(x_{j}) } {\prod_{j=1}^{p} f_{Kj}(x_{j}) } \right)} $$ ### The infrequent problem The method has the problem that if you don't an example for a particular event in your training set it would estimate the probability of that event as 0. ![](img/33-naive-bayes-infrequent-problem.png) The solution to this problem involves adding a small number, usually '1', to each event and outcome combination to eliminate this veto power. This is called the **Laplace correction** or *Laplace estimator*. After adding this correction, each Venn diagram now has at least a small bit of overlap; there is no longer any joint probability of zero. ![](img/34-naive-bayes-Laplace-correction.png) ### Pre-processing This method works better with categories, so if your data has *numeric data* try to **bin** it in categories by: - Turning an Age variable in the 'child' or 'adult' categories - Turning geographic coordinates into geographic regions like 'West' or 'East' - Turning test scores into four groups by percentile - Turning hour into 'morning', 'afternoon' and 'evening' - Turning temperature into 'cold', 'warm' and 'hot' As this method works really well when we have few examples and many predictors we can transform **text documents** into a *Document Term Matrix (DTM)* using a **bag-of-words model** with package like `tidytext` or `tm`. ### Coding example To perform **Naive Bayes** we just need to create the model specification by loading the **discrim** package and using **klaR** engine. We can apply *Laplace correction* by setting `Laplace = 1` in the `parsnip::naive_Bayes` function. ```{r} nb_spec <- naive_Bayes() %>% set_mode("classification") %>% set_engine("klaR") %>% set_args(usekernel = FALSE) SmarketNbPredictions <- nb_spec %>% fit(Direction ~ Lag1 + Lag2, data = Smarket_train) |> augment(new_data = Smarket_test) conf_mat(SmarketNbPredictions, truth = Direction, estimate = .pred_class) accuracy(SmarketNbPredictions, truth = Direction, estimate = .pred_class) ```