STA 380: Intro to Machine Learning

Welcome to part 2 of STA 380, a course on machine learning in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Instructors:

Dr. James Scott. Office hours on M T W, 12:30 to 1:15 PM, WEL 5.228G.
Dr. David Puelz. Office hours on M T W, 4-4:45p in CBA 6.444.

Exercises

The exercises are available here. These are due Sunday, August 18th at 11:59 PM, U.S central time. Pace yourself over the next few weeks, and start early on the first couple of problems!

Outline of topics

(1) The data scientist's toolbox

Slides: The data scientist's toolbox
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

Readings:

Your assignment after the first class day is to get yourself up and running on GitHub, if you're not already.

(2) Probability: a refresher

Slides: Some fun topics in probability

Optional reference: Chapter 1 of these course notes. There's a lot more technical stuff in here, but Chapter 1 really covers the basics of what every data scientist should know about probability.

(3) Data visualization

Topics: plotting pitfalls; the grammar of graphics; data visualization with R.

Slides:

Data visualization

R materials:

Lessons 4-6 of Data Science in R: A Gentle Introduction. You'll find lesson 5 a bit basic so feel free to breeze through that. The main thing you need to take away from lesson 5 is the use of pipes (%>%) and the summarize function.
Some R examples can be found in datavis_intro.R and nycflights_wrangle.R.

(4) Neural networks: the basics

Intro to neural network slides here. Jupyter notebooks here.

(5) Clustering

Basics of clustering; K-means clustering; hierarchical clustering; spectral clustering

Slides: Introduction to clustering.

Scripts and data:

Readings:

ISL Section 10.1 and 10.3 or Elements Chapter 14.3 (more advanced)
K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.

(6) Dimensionality reduction: PCA and tSNE

Principal component analysis (PCA). T-distributed stochastic neighbor embedding (tSNE).

Slides: Introduction to PCA and tSNE

Scripts and data for class:

Readings:

ISL Section 10.2 for the basics or Elements Chapter 14.5 (more advanced)
Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.

(7) Networks and association rules

Networks and association rule mining.

Slides: Intro to networks. Note: these slides refer to "lastfm.R" but this is the same thing as "playlists.R" below.

Software you'll need:

Gephi, a great piece of software for exploring graphs
The Gephi quick-start tutorial

Scripts and data:

medici.R and medici.txt
playlists.R and playlists.csv
microfi.R, microfi_households.csv, and microfi_edges.txt.

(8) Text data

Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).

Slides:

Scripts and data:

(9) Treatment effects

Treatment effects; multi-armed bandits and Thompson sampling; high-dimensional treatment effects with the lasso.

Slides:

Treatments.

Scripts and data:

mab.R and Ads_CTR_Optimisation.csv
abortion.R and abortion.dat
smallbeer.R and smallbeer.csv
hockey.R and all files in data/hockey/

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
R		R
data		data
exercises		exercises
notebooks		notebooks
notes		notes
slides		slides
.gitignore		.gitignore
README.md		README.md
syllabus.md		syllabus.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STA 380: Intro to Machine Learning

Exercises

Outline of topics

(1) The data scientist's toolbox

(2) Probability: a refresher

(3) Data visualization

(4) Neural networks: the basics

(5) Clustering

(6) Dimensionality reduction: PCA and tSNE

(7) Networks and association rules

(8) Text data

(9) Treatment effects

About

Releases

Packages

Contributors 3

Languages

jgscott/STA380

Folders and files

Latest commit

History

Repository files navigation

STA 380: Intro to Machine Learning

Exercises

Outline of topics

(1) The data scientist's toolbox

(2) Probability: a refresher

(3) Data visualization

(4) Neural networks: the basics

(5) Clustering

(6) Dimensionality reduction: PCA and tSNE

(7) Networks and association rules

(8) Text data

(9) Treatment effects

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages