Skip to content

Commit

Permalink
through GMM audio/pdf
Browse files Browse the repository at this point in the history
  • Loading branch information
JanetMatsen committed Mar 16, 2016
1 parent 4a4470b commit cc80c5d
Show file tree
Hide file tree
Showing 11 changed files with 275 additions and 17 deletions.
4 changes: 2 additions & 2 deletions ML_cheatsheet.tex
Original file line number Diff line number Diff line change
Expand Up @@ -172,12 +172,12 @@

\input{./tex/boosting.tex}

\input{./tex/vocab.tex}

\input{./tex/clustering.tex}

\input{./tex/expectation_maximization.tex}

\input{./tex/vocab.tex}

% Reference: \includegraphics[width=2.5in]{figures/example_kernel_separation.pdf} \hfill \\


Expand Down
Binary file added figures/GMM_cartoon.pdf
Binary file not shown.
Binary file added figures/agg_clustering.pdf
Binary file not shown.
Binary file not shown.
Binary file added figures/dendogram.pdf
Binary file not shown.
Binary file added figures/gmm_for_means_only.pdf
Binary file not shown.
Binary file not shown.
182 changes: 175 additions & 7 deletions tex/clustering.tex
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ \section{Clustering}

\begin{itemize}
\item Unsupervised learning: detect patterns in unlabeled data.
Sometmes labels are too expensive, unclear, etc. to get them.
Sometimes labels are too expensive, unclear, etc. to get them.
Examples:
\begin{itemize}
\item group e-mails or search results
Expand All @@ -12,11 +12,34 @@ \section{Clustering}
\end{itemize}
\item Useful when you don't know what you are looking for.
\item Requires a definition of "similar". One option: small (squared) euclidean distance.
\item You can label then use the clusters, or use the clusters for the next level of anlaysis.
\item You can label then use the clusters, or use the clusters for the next level of analysis.
\end{itemize}

\subsection{K-Means}
An iterative clustering algorithm. \hfill \\
\begin{itemize}
\item An iterative clustering algorithm. \hfill \\
\item No step size. Discrete optimization.
\item Hard assignments. Each point gets classified by one and only one cluster.
\item will converge, but may converge on local (not global) optimum.
\begin{itemize}
\item Every time you start the algorithm, you could end up n a different place.
\item Can run it a bunch of times. % week 9 audio
\item you are running a non-convex optimization: your final output is dependent on your initialization.
\end{itemize}
\item you have to chose a number of clusters.
\item Objective: minimize the distances between each point and closest center.
\item You want your output to have a large distance between clusters and a small distance between points in a cluster. (intra vs inter cluster distance).
You want it to latch onto clumps of the data that are far apart from each other. % week 9 audio
\begin{itemize}
\item intra: \hfill \\
E.g. measure $|x_i - c_i|_2^2$ for each cluster.
\item inter: \hfill \\
Dist between closest two points in different clusters. \hfill \\
Distance between means. \hfill \\
Standard deviation of cluster distances. \hfill \\
\end{itemize}
\end{itemize}

Pick K random points as cluster means: $c^1, \dots, c^k$. \hfill \\
Alternate:
\begin{itemize}
Expand All @@ -26,7 +49,7 @@ \subsection{K-Means}
Stop when no points' assignments change. \hfill \\

Minimizing a loss that is a function of the points, assignments, and means:
$$ L( \{ x*i \}, \{ a*j \}, \{ c*k \}) = \sum_i dist(x^i, c^{a^i}$$
$$ L( \{ x*i \}, \{ a*j \}, \{ c*k \}) = \sum_i dist(x^i, c^{a^i})$$
Coordinate gradient descent on L. \hfill \\

More formally:
Expand All @@ -35,7 +58,6 @@ \subsection{K-Means}
\item For $ t = 1 \dots T$: (or stop if assignments don't change): \hfill \\
Fix means ($c$) while you change the assignments ($a$): \hfill \\
\begin{itemize}

\item for $ j = 1 \dots n$: (recompute cluster assignments):
$$ a^j = \argmin_i dist(x^j, c^i) $$
\end{itemize}
Expand All @@ -45,8 +67,154 @@ \subsection{K-Means}
\end{itemize}
Note: the point y with minimum squared Euclidean distance to a set of points {x} is their mean

\includegraphics[width=3.3in]{figures/kmeans_algorithm_example.pdf}
\includegraphics[width=3.6in]{figures/kmeans_algorithm_example.pdf}

\subsection{K-Means gets stuck in local optima.}
\subsubsection{K-Means gets stuck in local optima.}

\includegraphics[width=1.8in]{figures/k-means_gets_stuck.pdf}

\subsection{Agglomerative Clustering}
First merge very similar instances. \hfill \\
Then incrementally build larger clusters out of smaller clusters. \hfill \
Limiting the number of pairs of pairs, we can control the number of clusters.
\begin{itemize}
\item Distance = 0 means each point is its own cluster
\item Distances = infinity --> all in one cluster.
\end{itemize}

\includegraphics[width=1.0in]{figures/agg_clustering.pdf}

Algorithm:
\begin{itemize}
\item Maintain a set of clusters.
\item Initially each instance is its own cluster
\item Repeat:
\begin{itemize}
\item pick the two closest clusters
\item merge them into a new cluster
\item stop when there is only one cluster left.
\end{itemize}
\item produces not one clustering, but a family of clusterings represented by a dendogram.

\includegraphics[width=1.0in]{figures/dendogram.pdf}
\end{itemize}

\includegraphics[width=2.7in]{figures/agglomerative_clustering_distance_options.pdf}

The intra-cluster distance: a metric of how well the clustering algorithm works
$$ S_1 = \sum_{j=1}^3 \sum_i |x_i - x_c|_2^2 + \sum_{i,j} |c_j, c_i|_2^2 $$ % wk 9 audio

You have to be extremely lucky to find a data set where the result isn't dependent on the start.

Can run it a bunch of times. % week 9 audio
For each pair of points, we have a vote: \hfill \\
Do they belong to the same cluster? \hfill \\
Look at score from score of clustering algorithm \hfill \\
We get a full graph where edge scores are "do they belong to the same cluster" (and more audio I missed?) \hfill \\
Then you need to find out which components are connected. \hfill \\

For each pair of points, we have an edge distance. \hfill \\
Within-cluster edges should be strong edges. \hfill \\
This strength should be common across clustering results. \hfill \\
Cut the graph into three pieces. \hfill \\
The score of the cut is the summation of the edges you break. \hfill \\
Cutting edges with small scores is good. \hfill \\


\subsection{Probabilistic Clustering}
\begin{itemize}
\item Can use a probabilistic model that allows cluster overlaps, clusters of different sizes, etc.
\item You can tell a generative story for the data. \hfill \\
$P(X|Y) P(Y)$ is common.
\item The challenge: estimate model parameters without labeled data.
\end{itemize}

\subsection{Gaussian Mixture Models}
\begin{itemize}
\item We have clumps of data. Each clump is described with a gaussian. % wk 10 audio
\item Like softening k-means. You belong to cluster 1 with a score of 0.1, cluster 2 with score of 0.3, cluster 3 with score of 0.6
\item Think of clusters as probabilistic.
\item Assume m-dimensional data points.
\item P(Y) is still multinomial, with k classes.
\item $P(\mathbb{X} | Y=i), i=1 \dots k$ are $k$ multivariate Gaussians.
\begin{itemize}
\item mean $\mu_i$ is a m-dimensional vector.
\item variance $\Sigma_i$ is an $m$ by $m$ matrix.
\item $|x|$ is the determinant of matrix x.
\end{itemize}
\end{itemize}

$$ P(X=x | Y=i) = \frac{1}{\sqrt{(2 \pi)^m | \Sigma_i |}} \exp \left( \frac{1}{2}(x - \mu_i)^T \Sigma_i^{-1} (x- \mu_i) \right) $$

\subsubsection{GMM is not Gaussian Naive Bayes}
(We did GNB before logistic regression) \hfill \\
Gaussian Naive Bayes : multinomial over clusters $Y$, Gaussian over each $X_i$ given $Y$:
$$ P(Y_i = y_k) = \theta_k $$
(Again, $\theta$ is the model parameters)
$$ P(X_i = x | Y = y_k) = \frac{1}{\sigma_{ik} \sqrt{2 \pi}} \exp \left( \frac{-(x - \mu_{ik}^2}{2 \sigma_{ik}^2} \right) $$
? Would assume the input dimensions $X_i$ do not co-vary. \hfill \\

If the input dimensions $X_i$ do co-vary, we can use Gaussian Mixture Models.

\subsubsection{Gaussian Mixture Model Assumption}
We want to do something like MLE but now we have multiple Gaussians. \hfill \\
You don't know which label should be used for each data point(which is red, blue, green). \hfill \\
Need to guess $k$ Gaussians without knowing the $\mu$s. \hfill \\

You can marginalize: \hfill \\
Model probability without knowing who belongs to who: marginalize over all possible y values. \hfill \\
You are estimating $P(X|Y)$, but you don't know $Y$. \hfill \\
You can get rid of $Y$ and get $P(X)$ by summing over $y_i$. \hfill \\
Get $Y$ out of the equation by summing over all possible values. \hfill \\
If it was a probability table, we would be losing a column. \hfill \\


\begin{itemize}
\item $P(Y)$: there are $k$ components
\item $P(X | Y)$: each component generates data from a Gaussian with mean $\mu_i$ and covariance matrix $\Sigma_i$
\item Assume each of the features are independent of each other. \hfill \\ % week 10 audio
Then can write down $P(X|Y)$ as prod of $P(X_i | Y)$. \hfill \\ % week 10 audio
Each of them will be a Gaussian distribution. \hfill \\ % week 10 audio
\item Can encode the whole $P(X|Y)$ with a multi-dimensional gaussian. \hfill \\ % week 10 audio
For 2D data, this gives a circle. \hfill \\ % week 10 audio
For 3D data, this gives a bump. \hfill \\ % week 10 audio
When we go to 100-dim space, we also have a mu. \hfill \\ % week 10 audio
Distribution is 100-dimensional. \hfill \\ % week 10 audio
Sigma in 100-dim space is 100 by 100 covariance matrix. \hfill \\ % week 10 audio
\end{itemize}

Each data point is sampled from a \textbf{generative process}
\begin{itemize}
\item Pick a component at random: \hfill \\
choose component $i$ with probability $P(y=i)$
\item Datapoint $\sim N(\mu_i, \Sigma_i)$
\end{itemize}

\includegraphics[width=1.0in]{figures/GMM_cartoon.pdf}

\subsubsection{Supervised MLE for GMM}
(Detour/review) \hfill \\

How do we estimate parameters for Gaussian Mixtures with fully supervised data? \hfill \\
Define objective and solve optimization: \hfill \\
From above:
$$ P(X=x | Y=i) = \frac{1}{\sqrt{(2 \pi)^m | \Sigma_i |}} \exp \left( \frac{1}{2}(x - \mu_i)^T \Sigma_i^{-1} (x- \mu_i) \right) $$
And we know $ \displaystyle \mu_{ML} = \frac{1}{n} \sum_{i=1}^n x^i$ and $ \displaystyle \Sigma_{ML} = \frac{1}{n} \sum_{i=1}^n (x^i - \mu_{ML}) (x^i - \mu_{ML})^T$ \hfill \\

But we don't know Y, so we can't do that. \hfill \\

Instead, we maximize the marginal likelihood. (marginal means a variable is integrated out).
$$ \argmax_{\theta} \prod_i P(x^j; \theta) = \argmax \prod_j \sum_{i=1}^k P(y^j=i, x^j; \theta) $$

This is always a hard problem. \hfill \\
There is usually no closed form solution. \hfill \\
Even when $P(X, Y; \theta)$ is convex, $P(X; \theta)$ generally isn't. \hfill \\
For all but the simplest $P(X; \theta)$, we will also have to do gradient ascent, in a big messy space with lots of local optima. \hfill \\

\subsubsection{Simple GMM example: learn means only}
\includegraphics[width=2.5in]{figures/gmm_for_means_only.pdf}

We solve this using EM below.

\includegraphics[width=2.5in]{figures/learning_general_mixtures_of_Gaussians.pdf}

11 changes: 10 additions & 1 deletion tex/expectation_maximization.tex
Original file line number Diff line number Diff line change
@@ -1,2 +1,11 @@
\section{Expectation Maximization}
\smallskip \hrule height 2pt \smallskip
\smallskip \hrule height 2pt \smallskip

A clever method for maximizing marginal likelihood, where you alternate between computing an expectation and a maximization.

It is not magic: it is still optimizing a non-convex function with lots of local optima. The computations are just easier.

\begin{itemize}
\item as in GMM: $$ \argmax_{\theta} \prod_i P(x^j; \theta) = \argmax \prod_j \sum_{i=1}^k P(y^j=i, x^j; \theta) $$

\end{itemize}
13 changes: 11 additions & 2 deletions tex/math_stat_review.tex
Original file line number Diff line number Diff line change
Expand Up @@ -196,12 +196,21 @@ \subsection{Entropy and Information Gain}
\end{itemize}




\subsection{Bits}
If you use log base 2 for entropy, the resulting units are called bits (short for binary digits). \hfill \\ % book pg 57
How many things can you encode in 15 bits? $2^{25}$. \hfill \\ % 1/11/2015 Lecture


\subsection{Common notation}
\textbf{semicolon versus $|$ in probabilities}: \hfill \\
E.g. $P(X ; \theta)$ vs $P(X | \theta)$

$|$ is for random variables and $;$ is for parameters.

Andrew Ng verbalizes the semicolon as "parameterized by."
So $f(x ; \theta)$ would be spoken as "f of x parameterized by theta"





82 changes: 77 additions & 5 deletions tex/vocab.tex
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ \section{General Vocab}
%\item \textbf{}
\item \textbf{affine}: indicates that the subspace need not pass through the origin. % Intro to statistical learning Ch 9.1
\item \textbf{support vector}: data points that ÒsupportÓ the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyper- plane would move as well. % Intro to statistical learning Ch 9.1
\item \textbf{marginal likelihood}:
\item \textbf{marginalized out} = integrated out % https://en.wikipedia.org/wiki/Marginal_likelihood

\end{itemize}

Expand Down Expand Up @@ -62,7 +64,48 @@ \subsubsection{Size for modeling P$(Y=y | X)$}

What is the order of the size of the parameters you need to do the full conditional probability?
For Naive Bayes it is d; \textbf{linear}. And the full conditional would be exponential!


\subsubsection{Maximum Likelihood Estimation (MLE)}

\textbf{Take log, take derivative, set equal to zero.} \hfill \\

Memorize. Likelihood is DATA. \hfill \\ % e-mail to self 2/1/2015

The best mu for a gaussian is the mean.
Maximized probability of this data being produced by the distribution.
The data is most likely to be generated if the mean is mu.
Did same for sigma. We realized best way is to use the variance. \hfill \\

\textbf{Wikipedia:}

To use the method of maximum likelihood, one first specifies the joint density function for all observations. For an independent and identically distributed sample, this joint density function is:

$$ f(x_1,x_2,\ldots,x_n\mid\theta) = f(x_1\mid \theta)\times f(x_2|\theta) \times \cdots \times f(x_n\mid \theta). $$

Now we look at this function from a different perspective by considering the observed values $x1, x2, \dots, xn$ to be fixed "parameters" of this function, whereas ? will be the function's variable and allowed to vary freely; this function will be called the likelihood:


$$ \mathcal{L}(\theta\,;\,x_1,\ldots,x_n) = f(x_1,x_2,\ldots,x_n\mid\theta) = \prod_{i=1}^n f(x_i\mid\theta) $$

Note that " ; " denotes a separation between the two input arguments: $\theta$ and the observations $x_1,\ldots, x_n$.

In practice it is often more convenient to work with the logarithm of the likelihood function, called the log-likelihood:


$$ \ln\mathcal{L}(\theta\,;\,x_1,\ldots,x_n) = \sum_{i=1}^n \ln f(x_i\mid\theta) $$

or the average log-likelihood:

$$ \hat\ell = \frac1n \ln\mathcal{L} $$

The hat over $\ell$ indicates that it is akin to some estimator. Indeed, $\hat{\ell}$ estimates the expected log-likelihood of a single observation in the model.

The method of maximum likelihood estimates $\theta_0$ by finding a value of $\theta$ that maximizes $\hat\ell(\theta;x)$. This method of estimation defines a maximum-likelihood estimator (MLE) of $\theta_0$:


$$ \{ \hat\theta_\mathrm{mle}\} \subseteq \{ \underset{\theta\in\Theta}{\operatorname{arg\,max}}\ \hat\ell(\theta\,;\,x_1,\ldots,x_n) \} $$


\subsubsection{MLE vs MAP}
\begin{itemize}
\item both MLE and MAP are point estimates. No estimate of uncertainty. % https://www.youtube.com/watch?
Expand Down Expand Up @@ -112,8 +155,6 @@ \subsubsection{Generate vs. Discriminative}

One can only distinct between whether it is something, the other can say how likely. \hfill \\



Two big categories of approaches for ML: \hfill \\
\underline{Generative}:
Those that try to estimate the joint distributions between labels and features/data.
Expand Down Expand Up @@ -166,7 +207,38 @@ \subsubsection{Generate vs. Discriminative}

\subsection{NB, LR, Perceptron}
\includegraphics[width=2.5in]{figures/three_views_of_classification.pdf}



\subsection{Gradient Ascent/Descent vs Coordinate Ascent/Descent}
We discussed gradient descent with logistic regression. \hfill \\
We mentioned coordinate descent when getting ready to talk about EM. \hfill \\

\subsubsection{Gradient descent}
For finding the local minimum.
\begin{itemize}
\item Takes steps proportional to the negative of the gradient
(or of the approximate gradient) of the function at the current point.
\item If instead one takes steps proportional to the positive of the gradient,
one approaches a local maximum of that function;
the procedure is then known as gradient ascent.
\end{itemize}

\subsubsection{Coordinate descent}
\begin{itemize}
\item A non-derivative optimization algorithm
\item To find a local minimum of a function, one does line search along
one coordinate direction at the current point in each iteration.
One uses different coordinate directions cyclically throughout the procedure.
\item Has problems with non-smooth functions
% https://en.wikipedia.org/wiki/Coordinate_descent
\item Coordinate descent \textbf{does} have step size parameter. % week 10 audio
To prevent over-shooting, you many need to take smaller steps.
\item Does converge under two big assumptions: \hfill \\
(1) fixing one works. \hfill \\
(2) need loss to get smaller than it was before. \hfill \\
\item Won't always converge to global optima.
\item For coordinate descent, you have to be extremely careful that each step reduces the loss function.
If you can't prove that you should not use coordinate descent.
\end{itemize}

K-means does this: alternate between holding the assignments and the centers fixed.

0 comments on commit cc80c5d

Please sign in to comment.