through decision trees

JanetMatsen · Jan 14, 2016 · 0e16365 · 0e16365
1 parent cf1f7a8
commit 0e16365
Show file tree

Hide file tree

Showing 5 changed files with 147 additions and 11 deletions.
diff --git a/ML_cheatsheet.pdf b/ML_cheatsheet.pdf
diff --git a/ML_cheatsheet.tex b/ML_cheatsheet.tex
@@ -1,4 +1,5 @@
 \documentclass[10pt,landscape]{article}
+
 \usepackage{multicol}
 \usepackage{calc}
 \usepackage{ifthen}
@@ -20,10 +21,10 @@
 \setlist[description]{leftmargin=0pt}
 \usepackage{xfrac}
 \usepackage[pdftex,
- pdfauthor={William Chen},
- pdftitle={Probability Cheatsheet},
- pdfsubject={A cheatsheet pdf and reference guide originally made for Stat 110, Harvard's Introduction to Probability course. Formulas and equations for your statistics class.},
- pdfkeywords={probability} {statistics} {cheatsheet} {pdf} {cheat} {sheet} {formulas} {equations}
+ pdfauthor={Janet Matsen},
+ pdftitle={Machine Learning Cheatsheet},
+ pdfsubject={Notes from UW CSE 446 Winter 2016},
+ pdfkeywords={machine learning} {statistics} {cheatsheet} {pdf} {cheat} {sheet} {formulas} {equations}
  ]{hyperref}
 \usepackage{relsize}
 \usepackage{rotating}
@@ -32,6 +33,11 @@
  \def\independenT#1#2{\mathrel{\setbox0\hbox{$#1#2$}%
  \copy0\kern-\wd0\mkern4mu\box0}} 
 
+% Janet defined
+\DeclareMathOperator*{\argmin}{arg\,min}
+\DeclareMathOperator*{\argmax}{arg\,max}
+
+% Probably from Stat cheatsheet: 
 \newcommand{\noin}{\noindent} 
 \newcommand{\logit}{\textrm{logit}} 
 \newcommand{\var}{\textrm{Var}}
@@ -137,17 +143,13 @@
  \hfill \\ 
 \smallskip \hrule height 2pt \smallskip
 
+\input{./tex/essential_ideas.tex}
+
 \input{./tex/math_stat_review.tex}
 
 \input{./tex/decision_trees.tex}
 
 
-\section{Vocab}
-\smallskip \hrule height 2pt \smallskip
-\begin{itemize}
- \item decision tree
-\end{itemize}
-
 Let's do this thing. 
 
 \end{multicols*}

diff --git a/tex/decision_trees.tex b/tex/decision_trees.tex
@@ -1,3 +1,101 @@
 \section{Decison Trees}
 \smallskip \hrule height 2pt \smallskip
 
+Summary: \hfill \\
+\begin{itemize}
+ \item One of the most popular ML tools. Easy to understand, implement, and use. Computationally cheap (to solve heurisRcally). 
+ \item Uses informaton gain to select attributes (ID3, C4.5,É) 
+ \item Presented for classification, but can be used for regression and density estimation too 
+ \item Decision trees will overfit!!! 
+ \item Must use tricks to find Òsimple treesÓ, e.g., (a) Fixed depth/Early stopping, (b) Pruning, (c) Hypothesis testing
+ \item Tree-based methods partition the feature space into a set of rectangles. % Elem of Stat. Learning pg 305
+ \item Interpretability is a key advantage of the recursive binary tree.
+\end{itemize}
+
+Pros: 
+\begin{itemize}
+ \item easy to explain to people
+ \item more closely mirror human decision-making than do the regression and classification approaches
+ \item can be displayed graphically, and are easily interpreted even by a non-expert
+ \item can easily handle qualitative predictors without the need to create dummy variables
+\end{itemize} 
+
+Cons: 
+\begin{itemize}
+ \item trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches
+ \item can be very non-robust. A small change in the data can cause a large change in the final estimated tree
+\end{itemize} 
+
+Vocab:
+\begin{itemize}
+ \item \textbf{classification tree} - used to predict a qualitative response rather than a quantitative one %ISL pg 311
+ \item \textbf{regression tree} - predicts a quantitative (continuous) variable. 
+ \item \textbf{depth of tree} -the maximum number of queries that can happen before a leaf is reached and a result obtained %wikipedia
+ \item \textbf{split} - 
+ \item \textbf{node} - synonymous with split. A place where you split the data. 
+ \item \textbf{node purity} - 
+ \item \textbf{univariate split} - A split is called univariate if it uses only a single variable, otherwise multivariate.
+ \item \textbf{multivariate decision tree} - can split on things like A + B or Petal.Width / Petal.Length < 1. If the multivariate split is a conjunction of univariate splits (e.g. A and B), you probably want to put that in the tree structure instead. % http:https://www.ismll.uni-hildesheim.de/lehre/ml-07w/skript/ml-2up-04-decisiontrees.pdf
+ \item \textbf{univariate decision tree} - a tree with all univariate splits/nodes. E.g. only split on one attribute at a time. 
+ % http:https://www.ismll.uni-hildesheim.de/lehre/ml-07w/skript/ml-2up-04-decisiontrees.pdf
+ \item \textbf{binary decision tree} 
+ \item \textbf{argmax} - the input that leads to the maximum output
+ \item \textbf{greedy} - at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. % An Introduction to Statistical Learning.pdf pdf pg 320
+ \item \textbf{threshold splits} - % lec http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/2_DecisionTrees_Part2.pdf
+\end{itemize}
+
+
+
+Protocol:
+\begin{enumerate}
+ \item Start from empty decision tree
+ \item Split on next best attribute (feature). 
+ \begin{itemize}
+ \item Use something like information gain to select next attribute. $\displaystyle \argmax_i IG(X_i) = \argmax_i H(Y) - H(Y | X_i) $ %\( \displaystyle \argmin_x \)
+ \end{itemize} 
+ \item Recurse
+\end{enumerate}
+
+When do we stop decision trees?
+\begin{itemize}
+ \item DonÕt split a node if all matching records have the same output value
+ \item Only split if all of your bins will have data in them. His words: "none of the attributes can create multiple nonempty children." He also said "no attributes can distinguish", and showed that for the remaining training data, each category only had data for one label. And third, "If all records have exactly the same set of input attributes then donÕt recurse"
+\end{itemize}
+He noted that you might not want to stop splitting just because all of your information nodes have zero information gain. 
+You would miss out on things like XOR. %http:https://courses.cs.washington.edu/courses/cse446/16wi/Slides/2_DecisionTrees_Part2.pdf
+
+Decision trees will overfit. If your labels have no noise, the training set error is always zero. 
+To prevent overfitting, we must introduce some bias towards simpler trees.
+Methods available: 
+\begin{itemize}
+ \item Many strategies for picking simpler trees 
+ \item Fixed depth 
+ \item Fixed number of leaves 
+ \item Or something smarterÉ 
+\end{itemize}
+
+One definition of \underline{overfitting}: If your data is generated from a distribution $D(X,Y)$ and you have a hypothesis space $H$: \hfill \\
+Define errors for hypothesis $h \in H$: training error = $error_{train}(h)$, Data (true) error = $error_D(h)$. 
+The hypothesis $h$ overfits the training data if there exists an $h'$ such that $error_{train}(h) < error_{train}(h')$ and $error_{D}(h) > error_{train}(D)$.
+In plain english, if there is an alternative hypothesis that gives you more error on the training data but less error in the test data then you have overfit your data. 
+\hfill \\ \hfill \\
+
+\underline{How to Build Small Trees} \hfill \\
+Two reasonable approaches: 
+\begin{itemize}
+ \item Optimize on the held-out (development) set. If growing the tree larger hurts performance, then stop growing. But this requires a larger amount of data
+ \item Use statistical significance testing. Test if the improvement for any split it likely due to noise. If so, donÕt do the split. Chi Square test w/ MaxPchance = something like 0.05. 
+\end{itemize}
+
+\underline{Pruning Trees} \hfill \\
+Start at the bottom, not the top. The top is most likely to have your best splits. 
+In this way, you only cut high branches if all the branches below were cut. 
+
+\underline{Classification vs. Regression Trees} \hfill \\
+In class we mostly discussed nodes with categorical attributes. 
+You can have continuous attributes (see HW1). 
+You can also have either discrete or continuous output. 
+When output is discrete, you can chose your splits based on entropy.
+If it is continuous, you need to do something more like least squares. 
+For regression trees, see pg 306 from \href{http:https://www-bcf.usc.edu/~gareth/ISL/}{ISL} or pg 307 of \href{http:https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf}{ESLII}.
+
diff --git a/tex/essential_ideas.tex b/tex/essential_ideas.tex
@@ -0,0 +1,8 @@
+\section{Essential ML ideas}
+\smallskip \hrule height 2pt \smallskip
+
+\begin{itemize}
+ \item Never ever touch the test set
+ \item You know you are overfitting when there is a big test between train and test results. E.g. metric of percent wrong. 
+ \item Need to be comfortable taking a hit on fitting accuracy if you can get a benefit on the result.
+\end{itemize}
diff --git a/tex/math_stat_review.tex b/tex/math_stat_review.tex
@@ -10,7 +10,14 @@ \section{Math/Stat Review}
  \item[Law of Total Probability]: $\sum\limits_x P(X=x) = 1$
  \item[Product Rule]: $P(A,B) = P(A \mid B) \cdot P(B)$ % TA lecture 1/7/2015
  \item[Sum Rule]: $P(A) = \sum\limits_{x \in \Omega} P(A, B=b)$ % TA lecture 1/7/2015
-\end{description} 
+\end{description}
+
+Vocab:
+\begin{itemize}
+ \item \textbf{likelihood function} $L(\theta | O)$ is called as the likelihood function. $\theta$ = unknown parameters, $O$ is the observed outcomes. The likelihood function is conditioned on the observed $O$ and that it is a function of the unknown parameters $\theta$. Not a probability density function.
+ \item \textbf{"likelihood" vs "probability"}: if discrete, $L(\theta | O)= P(O | \theta)$. If continuous, $P(O|\theta)=0$ so instead we estimate $\theta$ given $O$ by maximizing $L(\theta | O)= f(O | \theta)$ where $f$ is the pdf associated with the outcomes $O$. 
+ % http:https://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability
+\end{itemize} 
 
 \subsection{Law of Total Probability (LOTP)} % from cheat sheet: https://github.com/wzchen/probability_cheatsheet/blob/master/probability_cheatsheet.tex
 Let ${ B}_1, { B}_2, { B}_3, ... { B}_n$ be a \emph{partition} of the sample space (i.e., they are disjoint and their union is the entire sample space).
@@ -143,9 +150,30 @@ \subsection{Conditional Entropy}
 
 Note: $H[Y \mid X] \leq H[Y]$: knowing something can't make you know less. 
 
+\subsection{Entropy and Information Gain}
+\textbf{Information Gain} - $IG(X) = H(Y) - H(Y \mid X)$ \hfill \\
+Y is the node on top. X are the nodes below. He might have used lower case. \hfill \\
+\textbf{Conditional Information Gain}: $H(y|x) = -sum P(y|x) log_2$ \hfill \\
+ \hfill \\
 
+\begin{itemize}
+ \item Low uncertainty $\leftrightarrow$ Low entropy.
+ \item Lowering entropy $\leftrightarrow$ More information gain. 
+\end{itemize}
 
+The discrete distribution with maximum entropy is the uniform distribution. For K values of X, $H(X) = \log_2 K$ \hfill \\ % book pg 57
+Conversely, the distribution with minimum entropy (which is zero) is any delta-function that puts all its mass on one state. Such a distribution has no uncertainty. 
+\hfill \\
+
+\underline{Binary Entropy Function}: $p(X = 1) = \theta$ and $p(X = 0) = 1 - \theta$
+\begin{align*}
+ H(X) &= - [p(X=1) \log_2 p(X=1)+p(X=0) \log_2 p(X=0)] \\
+ & = - [\theta \log_2 \theta+(1 - \theta) \log_2(1 - \theta)]
+\end{align*}
 
+\subsection{Bits}
+If you use log base 2 for entropy, the resulting units are called bits (short for binary digits). \hfill \\ % book pg 57
+How many things can you encode in 15 bits? $2^{25}$. \hfill \\ % 1/11/2015 Lecture