Data Science Fundamentals for Programmers

Join us for a one-week immersive course that brings you deep into the foundations of Data Science to learn to create algorithms that learn from data and can be put into production today. You will gain statistical intuition and learn to think probabalistically through code, the medium you already know and love. All-day schedules consist of interactive excersizes, competitions, and a final personal project.

Requirements

This course is for programmers who have little or no background in machine learning or statistics, but a love for hacking, and basic working knowledge of at least one language with statistical libraries (Python, R, Java/Clojure/Scala, Matlab, etc.).

Course structure

The mornings, from 9am - 12pm every day, consist of lectures focused on building theoretical knowledge, followed by open-ended exercises in the afternoon, from 1pm - 6pm, completed in any language/framework of the students' preference, designed to demystify the theory and show the underlying hackability of these ideas.

Everyone is encouraged to bring data, ideas, products, and problems to work on specifically and will start on a personal project on the last day that can be continued after the course, professionally or personally.

Pre course

Intro to Probability / Statistics
Statistical Programming in Python / R

Day 0

The Sunday before class starts we have a short afternoon session to go over probability basics, followed by beers.

Day 1: Kaggle Competitions

The first day is for getting your hands dirty and implementing real prediction algorithms with the latest libraries and tools, leaving theory in the dust and gaining familiarity through usage. We introduce a series of modern tools and everyone competes in a data hackathon that lasts into the evening.

Topics:

A Practical Introduction to Errors
Languages and Libraries for Machine Learning
Modern Winning Algorithms

Exercises:

Data Hackathon: Kaggle Competitions.

Day 2: Statistical Learning Theory

Here we go all the way back to the beginning of the field and formulate the problem formally: what does it mean to model the world probablistically. We then look at two of the most fundamental ideas that will be woven throughout the week: what does it mean for a model to overfit and how can we trade off the bias and variance of a model. Exercises consist of exploring and proving these ideas in code.

Topics:

History of Statistical Learning
Formulating the Problem: Approximation vs Estimation Error
Controlling the Bayes Risk
Overfitting
Bias / Variance

Exercises:

Perceptron
K-nearest Neighbors

Day 3: Regression and Continuous Distributions

We push ideas of probabilistic thinking further by considering how different assumptions about the randomness in the world can lead to very different conclusions. We introduce Bayesian statistics to show how formalizing prior knowledge allows us to learn more from the data than we would otherwise be able to learn.

Topics:

What does the mean really tell you?
Probability distributions and noise.
Bayes' Rule. Bayesian Regression.
Latent variables. Underfitting!
Regularization

Exercises:

Condition Numbers and Inversions
BYO Regularization
Robust Regression
Gaussian Processes

Day 4: Small Shapes in Big Data

Data often consists of many variables and in order to generalize we believe the same outcomes can be described with much fewer, even if we need to create them ourselves. Here we understand what it means to work with high-dimensional data, and solidify how we think about dimensionality.

Topics:

A Brief History of Data Visualization
Manifolds, Factors, and Lower-Dimensional Structure
Sparsity
The Curse of Dimensionality
Egeindecomposition & PCA

Exercises:

Probablistic PCA
Multidimensional Scaling

Day 5: Simple NLP & Discrete Distributions

We explore several simple techniques in Natural Language Processing (NLP) that are extremely effective in practice and also because it allows us to extend our knowledge of Bayesian formulations and see again how reasonable assumptions can make us learn vast amounts from our data. Students work on their final personal project for the rest of the day.

Topics:

Preprocessing, Tokenization, Bag-Of-Words Models, TF-IDF.
Discrete distributions.
Word and Document Embedding
Intro to Network/Graph Analysis
Optimization
Probablistic Programming

Exercises:

Final Personal Project.

Day 5+: Mentoring

Saturday is an optional day to come in and continue working on personal projects, ask questions, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
materials		materials
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Science Fundamentals for Programmers

Requirements

Course structure

Pre course

Day 0

Day 1: Kaggle Competitions

Topics:

Exercises:

Day 2: Statistical Learning Theory

Topics:

Exercises:

Day 3: Regression and Continuous Distributions

Topics:

Exercises:

Day 4: Small Shapes in Big Data

Topics:

Exercises:

Day 5: Simple NLP & Discrete Distributions

Topics:

Exercises:

Day 5+: Mentoring

About

Releases

Packages

Contributors 2

nandanrao/summerschool

Folders and files

Latest commit

History

Repository files navigation

Data Science Fundamentals for Programmers

Requirements

Course structure

Pre course

Day 0

Day 1: Kaggle Competitions

Topics:

Exercises:

Day 2: Statistical Learning Theory

Topics:

Exercises:

Day 3: Regression and Continuous Distributions

Topics:

Exercises:

Day 4: Small Shapes in Big Data

Topics:

Exercises:

Day 5: Simple NLP & Discrete Distributions

Topics:

Exercises:

Day 5+: Mentoring

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages