Skip to content

nandanrao/summerschool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 

Repository files navigation

Data Science Fundamentals for Programmers

Join us for a one-week immersive course that brings you deep into the foundations of Data Science to learn to create algorithms that learn from data and can be put into production today. You will gain statistical intuition and learn to think probabalistically through code, the medium you already know and love. All-day schedules consist of interactive excersizes, competitions, and a final personal project.

Requirements

This course is for programmers who have little or no background in machine learning or statistics, but a love for hacking, and basic working knowledge of at least one language with statistical libraries (Python, R, Java/Clojure/Scala, Matlab, etc.).

Course structure

The mornings, from 9am - 12pm every day, consist of lectures focused on building theoretical knowledge, followed by open-ended exercises in the afternoon, from 1pm - 6pm, completed in any language/framework of the students' preference, designed to demystify the theory and show the underlying hackability of these ideas.

Everyone is encouraged to bring data, ideas, products, and problems to work on specifically and will start on a personal project on the last day that can be continued after the course, professionally or personally.

Pre course

  • Intro to Probability / Statistics
  • Statistical Programming in Python / R

Day 0

The Sunday before class starts we have a short afternoon session to go over probability basics, followed by beers.

Day 1: Kaggle Competitions

The first day is for getting your hands dirty and implementing real prediction algorithms with the latest libraries and tools, leaving theory in the dust and gaining familiarity through usage. We introduce a series of modern tools and everyone competes in a data hackathon that lasts into the evening.

Topics:

  • A Practical Introduction to Errors
  • Languages and Libraries for Machine Learning
  • Modern Winning Algorithms

Exercises:

  • Data Hackathon: Kaggle Competitions.

Day 2: Statistical Learning Theory

Here we go all the way back to the beginning of the field and formulate the problem formally: what does it mean to model the world probablistically. We then look at two of the most fundamental ideas that will be woven throughout the week: what does it mean for a model to overfit and how can we trade off the bias and variance of a model. Exercises consist of exploring and proving these ideas in code.

Topics:

  • History of Statistical Learning
  • Formulating the Problem: Approximation vs Estimation Error
  • Controlling the Bayes Risk
  • Overfitting
  • Bias / Variance

Exercises:

  • Perceptron
  • K-nearest Neighbors

Day 3: Regression and Continuous Distributions

We push ideas of probabilistic thinking further by considering how different assumptions about the randomness in the world can lead to very different conclusions. We introduce Bayesian statistics to show how formalizing prior knowledge allows us to learn more from the data than we would otherwise be able to learn.

Topics:

  • What does the mean really tell you?
  • Probability distributions and noise.
  • Bayes' Rule. Bayesian Regression.
  • Latent variables. Underfitting!
  • Regularization

Exercises:

  • Condition Numbers and Inversions
  • BYO Regularization
  • Robust Regression
  • Gaussian Processes

Day 4: Small Shapes in Big Data

Data often consists of many variables and in order to generalize we believe the same outcomes can be described with much fewer, even if we need to create them ourselves. Here we understand what it means to work with high-dimensional data, and solidify how we think about dimensionality.

Topics:

  • A Brief History of Data Visualization
  • Manifolds, Factors, and Lower-Dimensional Structure
  • Sparsity
  • The Curse of Dimensionality
  • Egeindecomposition & PCA

Exercises:

  • Probablistic PCA
  • Multidimensional Scaling

Day 5: Simple NLP & Discrete Distributions

We explore several simple techniques in Natural Language Processing (NLP) that are extremely effective in practice and also because it allows us to extend our knowledge of Bayesian formulations and see again how reasonable assumptions can make us learn vast amounts from our data. Students work on their final personal project for the rest of the day.

Topics:

  • Preprocessing, Tokenization, Bag-Of-Words Models, TF-IDF.
  • Discrete distributions.
  • Word and Document Embedding
  • Intro to Network/Graph Analysis
  • Optimization
  • Probablistic Programming

Exercises:

  • Final Personal Project.

Day 5+: Mentoring

Saturday is an optional day to come in and continue working on personal projects, ask questions, etc.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published