-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
through likelihood and began regression
- Loading branch information
1 parent
b709848
commit 098f627
Showing
11 changed files
with
237 additions
and
15 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,38 +1,93 @@ | ||
\section{Bayesian Learning} | ||
\smallskip \hrule height 2pt \smallskip | ||
|
||
% Erick description of Bayesian: | ||
Inferring the probability of the parameters themselves, not the probability of the data. | ||
Whenever you see $P(\theta | D)$ you know that is some posterior distribution. | ||
That is a tidy way of representing your knowledge about $\theta$ and your uncertainty about that knowledge. (The uncertainty is held in the PDF; narrow = certain and flat = uncertain). \hfill \\ | ||
\hfill \\ | ||
|
||
Rather than estimating a single $\theta$, we obtain a distribution over possible values of $\theta$. | ||
|
||
For small sample size, prior is important! | ||
|
||
Use Bayes' Rule: | ||
$ \displaystyle P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}$ | ||
\begin{itemize} | ||
\item \textbf{Posterior}: $P(\theta | D)$ | ||
\item \textbf{Posterior}: $P(\theta | D)$. Note $P(\theta | D) \propto P(D | \theta)P(\theta)$ | ||
\item \textbf{Data Likelihood}: $P(D | \theta) $ | ||
\item \textbf{Prior}: $P(\theta)$ | ||
\item \textbf{normalization}: $P(D)$ | ||
\item \textbf{normalization}: $P(D)$. Just a constant so it doesn't matter. Hard to calculate anyway. | ||
\end{itemize} | ||
Or equivalently, $P(\theta | D) \propto P(D | \theta) P(\theta)$ | ||
Or equivalently, $P(\theta | D) \propto P(D | \theta) P(\theta)$. \textbf{Always use this form}, not the one with $P(D)$ in the denominator. \hfill \\ | ||
\hfill \\ | ||
Note: you are multiplying two PDFs here. When you plug in particular data, your two terms become numbers. \hfill \\ | ||
\hfill \\ | ||
|
||
As you get more and more data, $P(\theta | D)$ grows more and more narrow. | ||
Like with more cannon ball holes, you are more certain about your angle $\theta$. \hfill \\ | ||
\hfill \\ | ||
|
||
About the $P(D)$. It is the "marginal probability", | ||
which is basically the probability of D when you integrate out $\theta$. \hfill \\ % Erick 1/25/2016 | ||
\hfill \\ | ||
\textbf{For uniform priors, MAP reduces to MLE objective}. $P(\theta) \propto 1$ leads to $P(\theta | D) \propto P(D | theta) $ \hfill \\ \hfill \\ | ||
|
||
If you have a uniform prior, you just do MLE. \hfill \\ | ||
$P(\theta) \propto 1 \rightarrow P(\theta | D) \propto P(D | \theta)$ | ||
$P(\theta) \propto 1 \rightarrow P(\theta | D) \propto P(D | \theta)$ \hfill \\ | ||
\hfill \\ | ||
|
||
Note: if you have D first it is Likelihood, and if you have $\theta$ first it is the Posterior. ($P(D | \theta)$ $P(\theta | D)$). \hfill \\ | ||
|
||
|
||
|
||
\underline{Vocab} | ||
\begin{itemize} | ||
\item \textbf{prior}: | ||
\item \textbf{prior distribution}: | ||
\item \textbf{prior distribution}: (same as "prior") % E confirmed 1/25/2016 | ||
\item \textbf{posterior}: | ||
\item \textbf{posterior distribution}: | ||
\item \textbf{MAP}: | ||
\item \textbf{posterior distribution}: (same as "posterior") % E confirmed 1/25/2016 | ||
\item \textbf{Maximum likelihood}: Find the parameter that makes the probability highest. E.g. $\theta$ for coin toss. (A famous "point estimator") | ||
\item \textbf{MAP}: Maximum a posteriori (estimation). | ||
Maximize the posterior instead of the likelihood. Take the value that causes the highest point in the posterior distribution. | ||
\end{itemize} | ||
|
||
% Erick description of MAP: | ||
Just take the peak of your posterior. Forget about the uncertainty. | ||
Pretty much like MLE, but you also have some influence of a prior. | ||
|
||
\hfill \\ | ||
\underline{Thumbtack Problem} | ||
\underline{Thumbtack Problem}, Bayesian style (MAP) \hfill \\ % http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/3_PointEstimation.pdf | ||
Start as usual with Bayes' without $P(D)$: $P(\theta | D) \propto P(D | \theta) P(\theta)$. \hfill \\ | ||
Define parameters: $\theta$ is the probability of one side up. | ||
$\alpha_H$ and $\alpha_T$ are the number of heads and tails tossed. | ||
$\beta_H$ and $\beta_T$ are the parameters of the prior. | ||
|
||
\begin{itemize} | ||
\item use Binomial likelihood: $P(D | \theta) = \theta^{\alpha_H} (1-\theta)^{\alpha_T}$ | ||
\item To get a simple posterior form, use a conjugate prior. Conjugate prior of Binomial is the Beta Distribution. See \href{http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/3_PointEstimation.pdf}{slides} for math. | ||
\item use Binomial as the likelihood: $ \displaystyle P(D | \theta) = \theta^{\alpha_H} (1-\theta)^{\alpha_T}$ | ||
\item the prior is $ \displaystyle P(\theta) = \frac{\theta^{\beta_H}(1-\theta)^{\beta_T - 1}}{B(\beta_H, \beta_T)} \sim \Beta(\beta_H, \beta_T)$. The B in the denominator is for the beta function (not same as beta distribution). | ||
\item To get a simple posterior form, use a conjugate prior. Conjugate prior of Binomial is the Beta Distribution. See \href{http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/3_PointEstimation.pdf}{slides} for math: $P(\theta | D) \sim \Beta(\beta_H + \alpha_H, \beta_T, \alpha_T)$ | ||
\item note that there are similar terms in the prior and likelihood functions. Some will cancel out when you multiply them. | ||
\item $\displaystyle P(\theta | D) = \frac{\theta^{\beta_H + \alpha_H - 1}\cdot (1-\theta)^{\beta_T + \alpha_T - 1}}{B(\beta_H + \alpha_H, \beta_T + \alpha_T)} \sim \Beta(\beta_H + \alpha_H, \beta_T + \alpha_T )$. | ||
\item The Beta prior is equivalent to extra thumbtack flips. As $N \rightarrow \infty$, the prior is ÒforgottenÓ. But for small sample size, prior is important. | ||
\end{itemize} | ||
|
||
\underline{MAP (point) estimation}: | ||
\begin{enumerate} | ||
\item Chose a distribution to fit the data to. Your choice determines the form of the likelihood ($P(\theta | D)$). | ||
\item Chose a prior (distribution). Can use a table that shows conjugate priors for various distributions. | ||
Prior is over the parameters you are guessing. | ||
\item Now you have a posterior (multiply prior by likelihood). | ||
\item Plug in your particular data values under many values of $\theta$ to get the likelihood ($P(D| \theta)$). Recall the likelihood need not be a PDF (need not be normalized). | ||
\item Pick the value that causes the highest point on the peak. | ||
\end{enumerate} | ||
|
||
\underline{MAP estimation} \hfill \\ | ||
Closely related to Fisher's method of maximum likelihood (ML), but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. You get to pick the distribution to represent the prior. | ||
MAP estimation can therefore be seen as a regularization of ML estimation. | ||
(Another famous "point estimator") | ||
|
||
\underline{Chosing between MLE and MAP}: \hfill \\ | ||
Chose ML if you don't know enough about the domain to impose a new prior. | ||
|
||
If you are measuring a continuous variable, Gaussians are your friend. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
\section{Gaussians} | ||
\smallskip \hrule height 2pt \smallskip | ||
|
||
Properties of Gaussians: | ||
\begin{itemize} | ||
\item Affine transformation (multiplying by a scalar and adding a constant) are Gaussian. | ||
If X $\sim$ N($\mu$,$\sigma^2$) and Y = aX + b, then Y $\sim$ N($a\mu+b, a^2\sigma^2$) | ||
\item Sum of Gaussians is Gaussian. | ||
If X $\sim$ N($\mu_X, \sigma^2_X$), | ||
Y $\sim$ N($\mu_Y, \sigma^2_Y$), | ||
and Z = X+Y, then | ||
Z $\sim$ N($\mu_X+\mu_Y, \sigma_X^2 +\sigma_Y^2$) | ||
\item Easy to differentiate. | ||
\end{itemize} | ||
|
||
Learn a Gaussian: $P(x | \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}}e^\frac{-(x-\mu)^2}{2\sigma^2}$. \hfill \\ | ||
MLE for Gaussian: Prob of i.i.d. samples D = $\{x_1, \dots, x_N\}$: \hfill \\ | ||
$\displaystyle P(D|\mu, \sigma) = ( \frac{1}{\sigma \sqrt{2 \pi}})^N \prod_{i=1}^N e^\frac{-(x_i-\mu)^2}{2\sigma^2}$. \hfill \\ | ||
Note: it is \underline{not} $P(\mu, \sigma | D)$, like I thought in class. \hfill \\ | ||
Find $\mu_{MLE}$, $\sigma_{MLE} = \argmax_{\mu, \sigma} P(D | \mu, \sigma)$. \hfill \\ | ||
|
||
Log-likelihood: $ \displaystyle \ln P(D | \mu, \sigma) = \ln[\mbox{thing above}] = -N \ln \sigma \sqrt{2\pi} - \sum_{i=1}^N \frac{(x_i - \mu)^2}{2\sigma^2}$. \hfill \\ | ||
Differentiate w.r.t. $\mu$ and set = 0. End up with $ \displaystyle \widehat{\mu} = \frac{1}{N} \sum_{i=1}^N x_i$. \hfill \\ | ||
Differentiate w.r.t. $\sigma$ and set = 0. End up with $ \displaystyle \widehat{\sigma}^2_{MLE} = \frac{1}{N} \sum_{i=1}^N (x_i-\widehat{\mu})^2$. \hfill \\ | ||
But actually, that leads to a biased estimate, so people actually use $ \displaystyle \widehat{\sigma}^2_{unbiased} = \frac{1}{N-1} \sum_{i=1}^N (x_i-\widehat{\mu})^2$ \hfill \\ | ||
|
||
The conjugate priors: mean: use Gaussian prior: $ \displaystyle P(\mu | \nu, \lambda) = \frac{1}{\lambda \sqrt{2 \pi}}e^\frac{-(\mu - \nu)^2}{2\sigma^2} $. (Instead of $\sigma$, use $\lambda$ and replace the $(x-\mu)^2$ with $(\mu - \nu)^2$). \hfill \\ | ||
For variance: use Wishard Distribution: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
\section{Linear Regression} | ||
\smallskip \hrule height 2pt \smallskip | ||
|
||
\underline{Ordinary Least Squares} \hfill \\ | ||
|
||
Notation: | ||
\begin{itemize} | ||
\item \textbf{$x_i$}: an input data point. \_\_ rows by \_\_ columns. | ||
\item \textbf{$y_i$}: a predicted output | ||
\item \textbf{$\widehat{y_i}$}: a predicted output | ||
\item \textbf{$\widehat{y}$}: | ||
\item \textbf{$w_k$}: weight k | ||
\item \textbf{$\bm{w}*$}: | ||
\item \textbf{$f_k(x_i)$} | ||
\item \textbf{$t_j$}: the output variable that you either have data for or are predicting. | ||
\item \textbf{$t(\bm{x})$}: Data. "Mapping from x to t(x)" | ||
\item \textbf{$H$}: $H = \{ h_1, \dots, h_K \}$. Basis functions. In the simplest case, they can just be the value of an input variable/feature or a constant (for bias). | ||
\end{itemize} | ||
|
||
\underline{Vocab}: | ||
\begin{itemize} | ||
\item \textbf{basis function} | ||
\item \textbf{bias} - like the intercept in a linear equation. The part that doesn't depend on the features. | ||
\item \textbf{hyperplane} - a plane, usually with more than 2 dimensions. | ||
\item \textbf{input variable} - a.k.a. feature. % https://en.wikipedia.org/wiki/Dependent_and_independent_variables | ||
E.g. a column like CEO salary for rows of data corresponding to different companies. | ||
\item \textbf{response variable} - synonyms: "dependent variable", "regressand", "predicted variable", "measured variable", "explained variable", "experimental variable", "responding variable", "outcome variable", and "output variable". E.g. a predicted stock price. | ||
\item \textbf{regularization} - introducing additional information in order to solve an ill-posed problem or to prevent overfitting. | ||
% https://en.wikipedia.org/wiki/Regularization_(mathematics) | ||
E.g. applying a penalty for large parameters in the model. | ||
\item \textbf{ridge regression} - | ||
\end{itemize} | ||
|
||
\underline{Ordinary Least Squares}: \hfill \\ | ||
total error = $\displaystyle \sum_i (y_i-\hat{y_i})^2 = \sum_i(y_i - \sum_k w_k f_k(x_i))^2$ \hfill \\ | ||
Under the additional assumption that the errors be normally distributed, OLS is the maximum likelihood estimator. \hfill \\ % https://en.wikipedia.org/wiki/Ordinary_least_squares | ||
?? Use words to describe what subset of regression in general this is. What is ordinary? What are we limiting? \hfill \\ | ||
\hfill \\ | ||
|
||
The regression problem: \hfill \\ | ||
Given basis functions $\{ h_1, \dots, h_K \}$ with $h_i(\bf{x}) \in \mathbb{R}$, \hfill \\ | ||
find coefficients $\bm{w} = \{ w_1, \dots, w_k \}$. \hfill \\% | ||
$t(\bm{x}) \approx \widehat{f}(\bm{x}) = \sum_i w_i h_i(\bm{x})$ | ||
|
||
This is called linear regression b/c it is linear in the parameters. | ||
We can still fit to nonlinear functions by using nonlinear basis functions. | ||
Minimize the \textbf{residual squared error}: \hfill \\ | ||
$ \displaystyle \bm{w}* = \argmin_{\bm{w}} \sum_j (t(\bm{x}_j) - \sum_i w_i h_i(\bm{x}_j))^2$ | ||
\hfill \\ \hfill \\ | ||
|
||
For fitting a line in 2D space, your basis functions are $\{ h_1(x) = x, h_2(x) = 1 \}$ \hfill \\ \hfill \\ | ||
|
||
To fit a parabola, your basis functions could be $\{ h_1(x) = x^2, h_2(x)=x, h_3(x)=1 \}$. \hfill \\ | ||
Want a 2D parabola? Use $\{ h_1(x) = x_1^2, h_2(x)=x_2^2, h_3(x)=x_1 x_2, \dots \}$. \hfill \\ | ||
Can define any basis functions $h_i(\bm{x})$ for n-dimensional input $\bm{x} = <x_1, \dots, x_n>$ | ||
\hfill \\ \hfill \\ | ||
|
||
\underline{Regression: matrix notation}: \hfill \\ | ||
\begin{align*} | ||
\bm{w}* &= \argmin_w \sum_j(t(\bm{x}_j - \sum_i w_i h_i(\bm{x}_j))^2 \\ | ||
\bm{w}* &= \argmin_w (\bm{Hw} -\bm{t})^T (\bm{Hw} -\bm{t}) | ||
\end{align*} | ||
$ (\bm{Hw} -\bm{t})^T (\bm{Hw} -\bm{t})$ is the residual error. | ||
\includegraphics[width=3in]{figures/Least_squares_matricies.pdf} | ||
|
||
\underline{Regression: closed form solution}: % derivation: http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/4_LinearRegression.pdf | ||
\begin{align*} | ||
\bm{w}* = \argmin_w (\bm{Hw} -\bm{t})^T (\bm{Hw} -\bm{t}) & \\ | ||
\bm{F}(\bm{w}) = \argmin_w (\bm{Hw} -\bm{t})^T (\bm{Hw} -\bm{t}) & \\ | ||
\triangledown_{\bm{w}}\bm{F}(\bm{w}) = 0 \\ | ||
2 \bm{H}^T (\bm{H}\bm{w}-\bm{t}) = 0 & \\ | ||
(\bm{H}^T\bm{H}\bm{w}) - \bm{H}^T\bm{t} = 0 & \\ | ||
\bm{w}* = (\bm{H}^T\bm{H})^{-1}\bm{H}^T\bm{t} & | ||
\end{align*} | ||
|
||
\includegraphics[width=3in]{figures/Regression_matrix_math.pdf} | ||
|
||
Linear regression prediction is a linear function plus Gaussian noise: \hfill \\ | ||
$t(\bm{x}) = \sum_i w_i h_i(\bm{x}) + \epsilon $ \hfill \\ | ||
We can learn $\bf{w}$ using MLE: | ||
$P(t | x, w, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^\frac{-[t - \sum_i w_i h_i(x)]^2}{2 \sigma^2}$ | ||
Take the log and maximize with respect to w: (maximizing log-likelihood with respect to w) \hfill \\ | ||
$\displaystyle \ln P(D | \bm{w}, \sigma) = \ln(\frac{1}{\sigma \sqrt{2 \pi}})^N \prod_{j=1}^N e^\frac{-[t_j - \sum_i w_i h_i(x_j)]^2}{2 \sigma^2}$ \hfill \\ | ||
Now find the w that maximizes this: \hfill \\ | ||
$\argmax_w \ln(\frac{1}{\sigma \sqrt{2 \pi}})^N + \sum_{j=1}^N \frac{-[t_j - \sum_i w_i h_i(x_j)]^2}{2 \sigma^2}$ \hfill \\ | ||
the first term isn't impacted by $w$ so \hfill \\ | ||
$= \argmax_w \sum_{j=1}^N \frac{-[t_j - \sum_i w_i h_i(x_j)]^2}{2 \sigma^2}$ \hfill \\ | ||
switch to $\argmin_w$ when we divide by -1. The numerator is constant.: \hfill \\ | ||
$= \argmin_w [t_j - \sum_i w_i h_i(x_j)]^2 $ \hfill \\ | ||
|
||
\textbf{Least-squares Linear Regression is MLE for Gaussians!!!} \hfill \\ \hfill \\ | ||
|
||
\underline{Regularization in Linear Regression} \hfill \\ | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
\section{General Vocab} | ||
\smallskip \hrule height 2pt \smallskip | ||
|
||
\begin{itemize} | ||
\item \textbf{held-out data}: synonymous with validation data (?) | ||
\item \textbf{hypothesis space}: ? E.g. binomial distribution for coin flip. | ||
\item \textbf{prediction error}: measure of fit (?) | ||
\item \textbf{regularization}: a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting % https://en.wikipedia.org/wiki/Regularization_(mathematics) | ||
%\item \textbf{} | ||
\end{itemize} |