Skip to content

jgscott/STA380

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STA 380: Intro to Machine Learning

Welcome to part 2 of STA 380, a course on machine learning in the MS program in Business Analytics at UT-Austin. All course materials can be found through this GitHub page. Please see the course syllabus for links and descriptions of the readings mentioned below.

Instructors:

  • Dr. James Scott. Office hours on M T W, 12:30 to 1:15 PM, WEL 5.228G.
  • Dr. David Puelz. Office hours on M T W, 4-4:45p in CBA 6.444.

Exercises

The exercises are available here. These are due Sunday, August 18th at 11:59 PM, U.S central time. Pace yourself over the next few weeks, and start early on the first couple of problems!

Outline of topics

(1) The data scientist's toolbox

Slides: The data scientist's toolbox
Good data-curation and data-analysis practices; R; Markdown and RMarkdown; the importance of replicable analyses; version control with Git and Github.

Readings:

Your assignment after the first class day is to get yourself up and running on GitHub, if you're not already.

(2) Probability: a refresher

Slides: Some fun topics in probability

Optional reference: Chapter 1 of these course notes. There's a lot more technical stuff in here, but Chapter 1 really covers the basics of what every data scientist should know about probability.

(3) Data visualization

Topics: plotting pitfalls; the grammar of graphics; data visualization with R.

Slides:

R materials:

(4) Neural networks: the basics

Intro to neural network slides here. Jupyter notebooks here.

(5) Clustering

Basics of clustering; K-means clustering; hierarchical clustering; spectral clustering

Slides: Introduction to clustering.

Scripts and data:

Readings:

  • ISL Section 10.1 and 10.3 or Elements Chapter 14.3 (more advanced)
  • K-means++ original paper or simple explanation on Wikipedia. This is a better recipe for initializing cluster centers in k-means than the more typical random initialization.

(6) Dimensionality reduction: PCA and tSNE

Principal component analysis (PCA). T-distributed stochastic neighbor embedding (tSNE).

Slides: Introduction to PCA and tSNE

Scripts and data for class:

Readings:

  • ISL Section 10.2 for the basics or Elements Chapter 14.5 (more advanced)
  • Shalizi Chapters 18 and 19 (more advanced). In particular, Chapter 19 has a lot more advanced material on factor models, beyond what we covered in class.

(7) Networks and association rules

Networks and association rule mining.

Slides: Intro to networks. Note: these slides refer to "lastfm.R" but this is the same thing as "playlists.R" below.

Software you'll need:

Scripts and data:

(8) Text data

Co-occurrence statistics; naive Bayes; TF-IDF; topic models; vector-space models of text (if time allows).

Slides:

Scripts and data:

(9) Treatment effects

Treatment effects; multi-armed bandits and Thompson sampling; high-dimensional treatment effects with the lasso.

Slides:

Scripts and data:

About

STA 380: Predictive Modeling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published