Skip to content

Commit

Permalink
through likelihood and began regression
Browse files Browse the repository at this point in the history
  • Loading branch information
JanetMatsen committed Jan 26, 2016
1 parent b709848 commit 098f627
Show file tree
Hide file tree
Showing 11 changed files with 237 additions and 15 deletions.
Binary file modified ML_cheatsheet.pdf
Binary file not shown.
13 changes: 12 additions & 1 deletion ML_cheatsheet.tex
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
\usepackage{graphicx}
\usepackage{array}
\usepackage{booktabs}
\usepackage{bm} % bold math \bm{}
\usepackage[bottom]{footmisc}
\usepackage{tikz}
\usetikzlibrary{shapes}
Expand All @@ -29,6 +30,7 @@
\usepackage{relsize}
\usepackage{rotating}


\newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}}
\def\independenT#1#2{\mathrel{\setbox0\hbox{$#1#2$}%
\copy0\kern-\wd0\mkern4mu\box0}}
Expand Down Expand Up @@ -153,7 +155,16 @@

\input{./tex/bayesian.tex}

Let's do this thing.
\input{./tex/gaussians.tex}

\input{./tex/linear_regression.tex}

\input{./tex/vocab.tex}


\vspace{4in}
\bigskip


\end{multicols*}
\end{document}
Binary file added figures/Least_squares_matricies.pdf
Binary file not shown.
Binary file added figures/Regression_matrix_math.pdf
Binary file not shown.
75 changes: 65 additions & 10 deletions tex/bayesian.tex
Original file line number Diff line number Diff line change
@@ -1,38 +1,93 @@
\section{Bayesian Learning}
\smallskip \hrule height 2pt \smallskip

% Erick description of Bayesian:
Inferring the probability of the parameters themselves, not the probability of the data.
Whenever you see $P(\theta | D)$ you know that is some posterior distribution.
That is a tidy way of representing your knowledge about $\theta$ and your uncertainty about that knowledge. (The uncertainty is held in the PDF; narrow = certain and flat = uncertain). \hfill \\
\hfill \\

Rather than estimating a single $\theta$, we obtain a distribution over possible values of $\theta$.

For small sample size, prior is important!

Use Bayes' Rule:
$ \displaystyle P(\theta | D) = \frac{P(D | \theta) P(\theta)}{P(D)}$
\begin{itemize}
\item \textbf{Posterior}: $P(\theta | D)$
\item \textbf{Posterior}: $P(\theta | D)$. Note $P(\theta | D) \propto P(D | \theta)P(\theta)$
\item \textbf{Data Likelihood}: $P(D | \theta) $
\item \textbf{Prior}: $P(\theta)$
\item \textbf{normalization}: $P(D)$
\item \textbf{normalization}: $P(D)$. Just a constant so it doesn't matter. Hard to calculate anyway.
\end{itemize}
Or equivalently, $P(\theta | D) \propto P(D | \theta) P(\theta)$
Or equivalently, $P(\theta | D) \propto P(D | \theta) P(\theta)$. \textbf{Always use this form}, not the one with $P(D)$ in the denominator. \hfill \\
\hfill \\
Note: you are multiplying two PDFs here. When you plug in particular data, your two terms become numbers. \hfill \\
\hfill \\

As you get more and more data, $P(\theta | D)$ grows more and more narrow.
Like with more cannon ball holes, you are more certain about your angle $\theta$. \hfill \\
\hfill \\

About the $P(D)$. It is the "marginal probability",
which is basically the probability of D when you integrate out $\theta$. \hfill \\ % Erick 1/25/2016
\hfill \\
\textbf{For uniform priors, MAP reduces to MLE objective}. $P(\theta) \propto 1$ leads to $P(\theta | D) \propto P(D | theta) $ \hfill \\ \hfill \\

If you have a uniform prior, you just do MLE. \hfill \\
$P(\theta) \propto 1 \rightarrow P(\theta | D) \propto P(D | \theta)$
$P(\theta) \propto 1 \rightarrow P(\theta | D) \propto P(D | \theta)$ \hfill \\
\hfill \\

Note: if you have D first it is Likelihood, and if you have $\theta$ first it is the Posterior. ($P(D | \theta)$ $P(\theta | D)$). \hfill \\



\underline{Vocab}
\begin{itemize}
\item \textbf{prior}:
\item \textbf{prior distribution}:
\item \textbf{prior distribution}: (same as "prior") % E confirmed 1/25/2016
\item \textbf{posterior}:
\item \textbf{posterior distribution}:
\item \textbf{MAP}:
\item \textbf{posterior distribution}: (same as "posterior") % E confirmed 1/25/2016
\item \textbf{Maximum likelihood}: Find the parameter that makes the probability highest. E.g. $\theta$ for coin toss. (A famous "point estimator")
\item \textbf{MAP}: Maximum a posteriori (estimation).
Maximize the posterior instead of the likelihood. Take the value that causes the highest point in the posterior distribution.
\end{itemize}

% Erick description of MAP:
Just take the peak of your posterior. Forget about the uncertainty.
Pretty much like MLE, but you also have some influence of a prior.

\hfill \\
\underline{Thumbtack Problem}
\underline{Thumbtack Problem}, Bayesian style (MAP) \hfill \\ % http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/3_PointEstimation.pdf
Start as usual with Bayes' without $P(D)$: $P(\theta | D) \propto P(D | \theta) P(\theta)$. \hfill \\
Define parameters: $\theta$ is the probability of one side up.
$\alpha_H$ and $\alpha_T$ are the number of heads and tails tossed.
$\beta_H$ and $\beta_T$ are the parameters of the prior.

\begin{itemize}
\item use Binomial likelihood: $P(D | \theta) = \theta^{\alpha_H} (1-\theta)^{\alpha_T}$
\item To get a simple posterior form, use a conjugate prior. Conjugate prior of Binomial is the Beta Distribution. See \href{http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/3_PointEstimation.pdf}{slides} for math.
\item use Binomial as the likelihood: $ \displaystyle P(D | \theta) = \theta^{\alpha_H} (1-\theta)^{\alpha_T}$
\item the prior is $ \displaystyle P(\theta) = \frac{\theta^{\beta_H}(1-\theta)^{\beta_T - 1}}{B(\beta_H, \beta_T)} \sim \Beta(\beta_H, \beta_T)$. The B in the denominator is for the beta function (not same as beta distribution).
\item To get a simple posterior form, use a conjugate prior. Conjugate prior of Binomial is the Beta Distribution. See \href{http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/3_PointEstimation.pdf}{slides} for math: $P(\theta | D) \sim \Beta(\beta_H + \alpha_H, \beta_T, \alpha_T)$
\item note that there are similar terms in the prior and likelihood functions. Some will cancel out when you multiply them.
\item $\displaystyle P(\theta | D) = \frac{\theta^{\beta_H + \alpha_H - 1}\cdot (1-\theta)^{\beta_T + \alpha_T - 1}}{B(\beta_H + \alpha_H, \beta_T + \alpha_T)} \sim \Beta(\beta_H + \alpha_H, \beta_T + \alpha_T )$.
\item The Beta prior is equivalent to extra thumbtack flips. As $N \rightarrow \infty$, the prior is ÒforgottenÓ. But for small sample size, prior is important.
\end{itemize}

\underline{MAP (point) estimation}:
\begin{enumerate}
\item Chose a distribution to fit the data to. Your choice determines the form of the likelihood ($P(\theta | D)$).
\item Chose a prior (distribution). Can use a table that shows conjugate priors for various distributions.
Prior is over the parameters you are guessing.
\item Now you have a posterior (multiply prior by likelihood).
\item Plug in your particular data values under many values of $\theta$ to get the likelihood ($P(D| \theta)$). Recall the likelihood need not be a PDF (need not be normalized).
\item Pick the value that causes the highest point on the peak.
\end{enumerate}

\underline{MAP estimation} \hfill \\
Closely related to Fisher's method of maximum likelihood (ML), but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. You get to pick the distribution to represent the prior.
MAP estimation can therefore be seen as a regularization of ML estimation.
(Another famous "point estimator")

\underline{Chosing between MLE and MAP}: \hfill \\
Chose ML if you don't know enough about the domain to impose a new prior.

If you are measuring a continuous variable, Gaussians are your friend.
19 changes: 18 additions & 1 deletion tex/decision_trees.tex
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
\section{Decison Trees}
\section{Decision Trees}
\smallskip \hrule height 2pt \smallskip

Summary: \hfill \\
Expand Down Expand Up @@ -42,6 +42,7 @@ \section{Decison Trees}
\item \textbf{argmax} - the input that leads to the maximum output
\item \textbf{greedy} - at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. % An Introduction to Statistical Learning.pdf pdf pg 320
\item \textbf{threshold splits} - % lec http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/2_DecisionTrees_Part2.pdf
\item \textbf{random forest}: an ensemble of decision trees which will output a prediction value. Each decision tree is constructed by using a random subset of the training data. % https://www.kaggle.com/c/titanic/details/getting-started-with-random-forests
\end{itemize}


Expand Down Expand Up @@ -91,6 +92,11 @@ \section{Decison Trees}
Start at the bottom, not the top. The top is most likely to have your best splits.
In this way, you only cut high branches if all the branches below were cut.

Don't use the validation set for pruning. % TA canvas note 1/24/2015.
\textbf{Your code should never use the validation set.}
The validation set is for \textbf{you} to learn from; the code will always learn from the training set.


\underline{Classification vs. Regression Trees} \hfill \\
In class we mostly discussed nodes with categorical attributes.
You can have continuous attributes (see HW1).
Expand All @@ -99,3 +105,14 @@ \section{Decison Trees}
If it is continuous, you need to do something more like least squares.
For regression trees, see pg 306 from \href{http:https://www-bcf.usc.edu/~gareth/ISL/}{ISL} or pg 307 of \href{http:https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf}{ESLII}.

\underline{For discrete data}: \hfill \\
"For discrete data, you can't split twice on the same feature. Once you've moved down a branch, you know that all data in that branch has the same value for the splitting feature." % TA board 1/24/2015

\underline{For continuous data}: \hfill \\
More computationally expensive than discrete data.
Often can try to change continuos data to categorical.
Might lose some smoothness for real numbers, but might be worth it

\underline{K-fold validation versus using a held-out data set}: \hfill \\
If you have enough data to pull out a held-out set, that is preferable to K-fold validation. % Farhadi said to me 1/25/2016.
% This says the opposite: http:https://stats.stackexchange.com/questions/104713/hold-out-validation-vs-k-fold-validation
2 changes: 1 addition & 1 deletion tex/essential_ideas.tex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ \section{Essential ML ideas}
\smallskip \hrule height 2pt \smallskip

\begin{itemize}
\item Never ever touch the test set
\item Never ever \underline{ever} touch the test set
\item You know you are overfitting when there is a big test between train and test results. E.g. metric of percent wrong.
\item Need to be comfortable taking a hit on fitting accuracy if you can get a benefit on the result.
\end{itemize}
28 changes: 28 additions & 0 deletions tex/gaussians.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
\section{Gaussians}
\smallskip \hrule height 2pt \smallskip

Properties of Gaussians:
\begin{itemize}
\item Affine transformation (multiplying by a scalar and adding a constant) are Gaussian.
If X $\sim$ N($\mu$,$\sigma^2$) and Y = aX + b, then Y $\sim$ N($a\mu+b, a^2\sigma^2$)
\item Sum of Gaussians is Gaussian.
If X $\sim$ N($\mu_X, \sigma^2_X$),
Y $\sim$ N($\mu_Y, \sigma^2_Y$),
and Z = X+Y, then
Z $\sim$ N($\mu_X+\mu_Y, \sigma_X^2 +\sigma_Y^2$)
\item Easy to differentiate.
\end{itemize}

Learn a Gaussian: $P(x | \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}}e^\frac{-(x-\mu)^2}{2\sigma^2}$. \hfill \\
MLE for Gaussian: Prob of i.i.d. samples D = $\{x_1, \dots, x_N\}$: \hfill \\
$\displaystyle P(D|\mu, \sigma) = ( \frac{1}{\sigma \sqrt{2 \pi}})^N \prod_{i=1}^N e^\frac{-(x_i-\mu)^2}{2\sigma^2}$. \hfill \\
Note: it is \underline{not} $P(\mu, \sigma | D)$, like I thought in class. \hfill \\
Find $\mu_{MLE}$, $\sigma_{MLE} = \argmax_{\mu, \sigma} P(D | \mu, \sigma)$. \hfill \\

Log-likelihood: $ \displaystyle \ln P(D | \mu, \sigma) = \ln[\mbox{thing above}] = -N \ln \sigma \sqrt{2\pi} - \sum_{i=1}^N \frac{(x_i - \mu)^2}{2\sigma^2}$. \hfill \\
Differentiate w.r.t. $\mu$ and set = 0. End up with $ \displaystyle \widehat{\mu} = \frac{1}{N} \sum_{i=1}^N x_i$. \hfill \\
Differentiate w.r.t. $\sigma$ and set = 0. End up with $ \displaystyle \widehat{\sigma}^2_{MLE} = \frac{1}{N} \sum_{i=1}^N (x_i-\widehat{\mu})^2$. \hfill \\
But actually, that leads to a biased estimate, so people actually use $ \displaystyle \widehat{\sigma}^2_{unbiased} = \frac{1}{N-1} \sum_{i=1}^N (x_i-\widehat{\mu})^2$ \hfill \\

The conjugate priors: mean: use Gaussian prior: $ \displaystyle P(\mu | \nu, \lambda) = \frac{1}{\lambda \sqrt{2 \pi}}e^\frac{-(\mu - \nu)^2}{2\sigma^2} $. (Instead of $\sigma$, use $\lambda$ and replace the $(x-\mu)^2$ with $(\mu - \nu)^2$). \hfill \\
For variance: use Wishard Distribution:
9 changes: 7 additions & 2 deletions tex/likelihood.tex
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
\section{Likelihood}
\section{Maximum Likelihood \& Maximum a Posteriori}
\smallskip \hrule height 2pt \smallskip

\underline{Vocab}
\begin{itemize}
\item\textbf{likelihood}: the probability of the data given a parameter. E.g. $P(D | \theta)$ (for discrete like Binomial).
Need not a pdf; need not be normalized. % http:https://www.robots.ox.ac.uk/~az/lectures/est/lect34.pdf + Erick
\item \textbf{log-likelihood}: lower-case: $l(\theta|x) = \log L(\theta | x)$
\item \textbf{MLE}: Maximum Likelihood Estimation.
\item \textbf{PAC}: Probability Approximately Correct.
\end{itemize}
Expand Down Expand Up @@ -35,5 +37,8 @@ \section{Likelihood}
\end{align*}

For Binomial, there is exponential decay in uncertainty with \# of observations. % slide 7 at http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/3_PointEstimation.pdf
You can also find the probability that you are approximately correct (see \href{http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/3_PointEstimation.pdf}{notes}).
You can also find the probability that you are approximately correct (see \href{http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/3_PointEstimation.pdf}{notes}). \hfill \\
$P(|\widehat{\theta} = \theta*| \geq \epsilon) \leq 2e^{-2N\epsilon^2}$. Can calculate N (\# of flips) to have error less than $\epsilon$ with probability of being incorrect $\delta$. Your sensitivity depends on your problem; error on stock market data might cost billions.

What if you had prior beliefs? Use MAP instead of MLE.

96 changes: 96 additions & 0 deletions tex/linear_regression.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
\section{Linear Regression}
\smallskip \hrule height 2pt \smallskip

\underline{Ordinary Least Squares} \hfill \\

Notation:
\begin{itemize}
\item \textbf{$x_i$}: an input data point. \_\_ rows by \_\_ columns.
\item \textbf{$y_i$}: a predicted output
\item \textbf{$\widehat{y_i}$}: a predicted output
\item \textbf{$\widehat{y}$}:
\item \textbf{$w_k$}: weight k
\item \textbf{$\bm{w}*$}:
\item \textbf{$f_k(x_i)$}
\item \textbf{$t_j$}: the output variable that you either have data for or are predicting.
\item \textbf{$t(\bm{x})$}: Data. "Mapping from x to t(x)"
\item \textbf{$H$}: $H = \{ h_1, \dots, h_K \}$. Basis functions. In the simplest case, they can just be the value of an input variable/feature or a constant (for bias).
\end{itemize}

\underline{Vocab}:
\begin{itemize}
\item \textbf{basis function}
\item \textbf{bias} - like the intercept in a linear equation. The part that doesn't depend on the features.
\item \textbf{hyperplane} - a plane, usually with more than 2 dimensions.
\item \textbf{input variable} - a.k.a. feature. % https://en.wikipedia.org/wiki/Dependent_and_independent_variables
E.g. a column like CEO salary for rows of data corresponding to different companies.
\item \textbf{response variable} - synonyms: "dependent variable", "regressand", "predicted variable", "measured variable", "explained variable", "experimental variable", "responding variable", "outcome variable", and "output variable". E.g. a predicted stock price.
\item \textbf{regularization} - introducing additional information in order to solve an ill-posed problem or to prevent overfitting.
% https://en.wikipedia.org/wiki/Regularization_(mathematics)
E.g. applying a penalty for large parameters in the model.
\item \textbf{ridge regression} -
\end{itemize}

\underline{Ordinary Least Squares}: \hfill \\
total error = $\displaystyle \sum_i (y_i-\hat{y_i})^2 = \sum_i(y_i - \sum_k w_k f_k(x_i))^2$ \hfill \\
Under the additional assumption that the errors be normally distributed, OLS is the maximum likelihood estimator. \hfill \\ % https://en.wikipedia.org/wiki/Ordinary_least_squares
?? Use words to describe what subset of regression in general this is. What is ordinary? What are we limiting? \hfill \\
\hfill \\

The regression problem: \hfill \\
Given basis functions $\{ h_1, \dots, h_K \}$ with $h_i(\bf{x}) \in \mathbb{R}$, \hfill \\
find coefficients $\bm{w} = \{ w_1, \dots, w_k \}$. \hfill \\%
$t(\bm{x}) \approx \widehat{f}(\bm{x}) = \sum_i w_i h_i(\bm{x})$

This is called linear regression b/c it is linear in the parameters.
We can still fit to nonlinear functions by using nonlinear basis functions.
Minimize the \textbf{residual squared error}: \hfill \\
$ \displaystyle \bm{w}* = \argmin_{\bm{w}} \sum_j (t(\bm{x}_j) - \sum_i w_i h_i(\bm{x}_j))^2$
\hfill \\ \hfill \\

For fitting a line in 2D space, your basis functions are $\{ h_1(x) = x, h_2(x) = 1 \}$ \hfill \\ \hfill \\

To fit a parabola, your basis functions could be $\{ h_1(x) = x^2, h_2(x)=x, h_3(x)=1 \}$. \hfill \\
Want a 2D parabola? Use $\{ h_1(x) = x_1^2, h_2(x)=x_2^2, h_3(x)=x_1 x_2, \dots \}$. \hfill \\
Can define any basis functions $h_i(\bm{x})$ for n-dimensional input $\bm{x} = <x_1, \dots, x_n>$
\hfill \\ \hfill \\

\underline{Regression: matrix notation}: \hfill \\
\begin{align*}
\bm{w}* &= \argmin_w \sum_j(t(\bm{x}_j - \sum_i w_i h_i(\bm{x}_j))^2 \\
\bm{w}* &= \argmin_w (\bm{Hw} -\bm{t})^T (\bm{Hw} -\bm{t})
\end{align*}
$ (\bm{Hw} -\bm{t})^T (\bm{Hw} -\bm{t})$ is the residual error.
\includegraphics[width=3in]{figures/Least_squares_matricies.pdf}

\underline{Regression: closed form solution}: % derivation: http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/4_LinearRegression.pdf
\begin{align*}
\bm{w}* = \argmin_w (\bm{Hw} -\bm{t})^T (\bm{Hw} -\bm{t}) & \\
\bm{F}(\bm{w}) = \argmin_w (\bm{Hw} -\bm{t})^T (\bm{Hw} -\bm{t}) & \\
\triangledown_{\bm{w}}\bm{F}(\bm{w}) = 0 \\
2 \bm{H}^T (\bm{H}\bm{w}-\bm{t}) = 0 & \\
(\bm{H}^T\bm{H}\bm{w}) - \bm{H}^T\bm{t} = 0 & \\
\bm{w}* = (\bm{H}^T\bm{H})^{-1}\bm{H}^T\bm{t} &
\end{align*}

\includegraphics[width=3in]{figures/Regression_matrix_math.pdf}

Linear regression prediction is a linear function plus Gaussian noise: \hfill \\
$t(\bm{x}) = \sum_i w_i h_i(\bm{x}) + \epsilon $ \hfill \\
We can learn $\bf{w}$ using MLE:
$P(t | x, w, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^\frac{-[t - \sum_i w_i h_i(x)]^2}{2 \sigma^2}$
Take the log and maximize with respect to w: (maximizing log-likelihood with respect to w) \hfill \\
$\displaystyle \ln P(D | \bm{w}, \sigma) = \ln(\frac{1}{\sigma \sqrt{2 \pi}})^N \prod_{j=1}^N e^\frac{-[t_j - \sum_i w_i h_i(x_j)]^2}{2 \sigma^2}$ \hfill \\
Now find the w that maximizes this: \hfill \\
$\argmax_w \ln(\frac{1}{\sigma \sqrt{2 \pi}})^N + \sum_{j=1}^N \frac{-[t_j - \sum_i w_i h_i(x_j)]^2}{2 \sigma^2}$ \hfill \\
the first term isn't impacted by $w$ so \hfill \\
$= \argmax_w \sum_{j=1}^N \frac{-[t_j - \sum_i w_i h_i(x_j)]^2}{2 \sigma^2}$ \hfill \\
switch to $\argmin_w$ when we divide by -1. The numerator is constant.: \hfill \\
$= \argmin_w [t_j - \sum_i w_i h_i(x_j)]^2 $ \hfill \\

\textbf{Least-squares Linear Regression is MLE for Gaussians!!!} \hfill \\ \hfill \\

\underline{Regularization in Linear Regression} \hfill \\



10 changes: 10 additions & 0 deletions tex/vocab.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
\section{General Vocab}
\smallskip \hrule height 2pt \smallskip

\begin{itemize}
\item \textbf{held-out data}: synonymous with validation data (?)
\item \textbf{hypothesis space}: ? E.g. binomial distribution for coin flip.
\item \textbf{prediction error}: measure of fit (?)
\item \textbf{regularization}: a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting % https://en.wikipedia.org/wiki/Regularization_(mathematics)
%\item \textbf{}
\end{itemize}

0 comments on commit 098f627

Please sign in to comment.