MSDS621 Introduction to Machine Learning

“In God we trust; all others bring data.” — Attributed to W. Edwards Deming and George Box

This course introduces students to the key processes and concepts of machine learning (for table-like structured data), such as:

data cleaning
dealing with missing data
basic feature engineering
feature selection
model implementation
model training
model assessment
mode interpretation

We study a few key models deeply, rather than providing a broad but superficial survey of models. As part of this course, students implement linear and logistic regression with regularization through gradient descent, a Naive Bayes model for text sentiment analysis, decision trees, and random forest models.

Implementing these models yourself is critical to truly understanding them. As Richard Feynman wrote, "What I cannot create, I do not understand." (From his blackboard at the time of his death.) With an intuition behind how the models work, you'll be able to understand and predict their behavior much more easily.

Administrivia

INSTRUCTOR. Terence Parr. I’m a professor in the computer science and data science program departments and was founding director of the MS in Analytics program at USF (which became the MS data science program). Please call me Terence or Professor (the use of “Terry” is a capital offense).

SPATIAL COORDINATES:

Class is held at 101 Howard in 1st floor classroom 155-156.
Exams are held in 154-156. Both sections meet together.
My office is room 607 @ 101 Howard up on mezzanine above the open area on 5th floor

TEMPORAL COORDINATES. Thu Oct 17 to Tue Dec 3.

Section 01: 10 - 11:50 AM Room 155-156
Section 02: 1:15 - 3:05 PM Room 155-156
Exams: Fri 5-6PM Nov 8; Fri 10-11:30AM Dec 6; Room 154-156

INSTRUCTION FORMAT. Class runs for 1:50 hours, 2 days/week. Instructor-student interaction during lecture is encouraged and we'll mix in mini-exercises / labs during class. All programming will be done in the Python 3 programming language, unless otherwise specified.

COURSE BOOK. The Mechanics of Machine Learning (in progress)

TARDINESS. Please be on time for class. It is a big distraction if you come in late.

ACADEMIC HONESTY. You must abide by the copyright laws of the United States and academic honesty policies of USF. You may not copy code from other current or previous students. All suspicious activity will be investigated and, if warranted, passed to the Dean of Sciences for action. Copying answers or code from other students or sources during a quiz, exam, or for a project is a violation of the university’s honor code and will be treated as such. Plagiarism consists of copying material from any source and passing off that material as your own original work. Plagiarism is plagiarism: it does not matter if the source being copied is on the Internet, from a book or textbook, or from quizzes or problem sets written up by other students. Giving code or showing code to another student is also considered a violation.

The golden rule: You must never represent another person’s work as your own.

If you ever have questions about what constitutes plagiarism, cheating, or academic dishonesty in my course, please feel free to ask me.

Note: Leaving your laptop unattended is a common means for another student to take your work. It is your responsibility to guard your work. Do not leave your printouts laying around or in the trash. All persons with common code are likely to be considered at fault.

USF policies and legal declarations

Students with Disabilities

If you are a student with a disability or disabling condition, or if you think you may have a disability, please contact USF Student Disability Services (SDS) for information about accommodations.

Behavioral Expectations

All students are expected to behave in accordance with the Student Conduct Code and other University policies.

Academic Integrity

USF upholds the standards of honesty and integrity from all members of the academic community. All students are expected to know and adhere to the University's Honor Code.

Counseling and Psychological Services (CAPS)

CAPS provides confidential, free counseling to student members of our community.

Confidentiality, Mandatory Reporting, and Sexual Assault

For information and resources regarding sexual misconduct or assault visit the Title IX coordinator or USFs Callisto website.

Student evaluation

Artifact	Grade Weight	Due date
Linear models	10%	Thu Oct 31, 11:59PM
Naive Bayes	8%	Monday Nov 11, 11:59PM
Decision trees	15%	Mon Nov 25, 11:59PM
Random Forest	12%	Sun Dec 8, 11:59PM
Exam 1	25%	Fri Nov 8, 5PM-6PM
Exam 2	30%	Fri, Dec 6 10AM-11:30AM

All projects will be graded with the specific input or tests given in the project description, so you understand precisely what is expected of your program. Consequently, projects will be graded in binary fashion: They either work or they do not. Each failed unit test gets a fixed amount off, no partial credit. The only exception is when your program does not run on the grader's or my machine because of some cross-platform issue. This is typically because a student has hardcoded some file name or directory into their program. In that case, we will take off a minimum of 10% instead of giving you a 0, depending on the severity of the mistake. Please go to github and verify that the website has the proper files for your solution. That is what I will download for testing.

Each project has a hard deadline and only those projects working correctly before the deadline get credit. My grading script pulls from github at the deadline. All projects are due at the start of class on the day indicated, unless otherwise specified.

I reserve the right to change projects until the day they are assigned.

Grading standards. I consider an A grade to be above and beyond what most students have achieved. A B grade is an average grade for a student or what you could call "competence" in a business setting. A C grade means that you either did not or could not put forth the effort to achieve competence. Below C implies you did very little work or had great difficulty with the class compared to other students.

Syllabus

Notebooks

There are a number of notebooks associated with the lecture slides.

The following notebook takes you through a number of important processes, which you are free to do at your leisure. Even if we haven't covered the topics in lecture, you can still get something out of the notebook.

Getting a sense of the training and testing procedure notebook

Getting started

The first lecture is an overview of the entire machine learning process:

Overview (Day 1)

Regularization for linear models

Review of linear models (slides) (Day 1)
- Lab: Plotting decision surfaces for linear models (Day 1)
Regularization of linear models L1, L2 (slides) (Day 2)
- Lab: Exploring regularization for linear regression (Day 2)
- Lab: Regularization for logistic regression (Day 2)
Gradient Descent optimization (slides) (Day 3)
- Lab: Gradient descent in action (Day 3)
(Regularization project)

Models

We will learn 3 models in depth for this course: naive bayes, decision trees, and random forests but will examine k-nearest-neighbor (kNN) briefly.

Naive Bayes (slides) (Day 4)
- Lab: Naive bayes by hand (Day 4)
- (Naive Bayes project)
Intro to non-parametric machine learning models (slides) (Day 5)
Decision trees (slides) (Day 5)
- Lab: Partitioning feature space (Day 6)
- Binary tree crash course (slides) (Day 6)
- Lab: Binary trees (Day 6)
- Training decision trees (slides) (Day 7)
- (Decision trees project)
Random Forests (slides) (Day 7)
- (Random Forest project)

Mechanics

Preparing data for modeling (slides) (Day 8)
Basic feature engineering (slides) (Day 9)

Model assessment

Bias-variance trade-off (slides) (Day 10)
Model assessment (slides) (Day 11)

Model interpretation

Feature importance (slides) (Day 12)
Partial dependence

Unsupervised learning

Clustering (slides) (Day 13)
- k-means clustering
- Hierarchical clustering
- Breiman's trick for clustering with RFs

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
code/linreg		code/linreg
data		data
images		images
labs		labs
lectures		lectures
notebooks		notebooks
projects		projects
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSDS621 Introduction to Machine Learning

Administrivia

Student evaluation

Syllabus

Notebooks

Getting started

Regularization for linear models

Models

Mechanics

Model assessment

Model interpretation

Unsupervised learning

About

Releases

Packages

Languages

License

mikiotada/msds621

Folders and files

Latest commit

History

Repository files navigation

MSDS621 Introduction to Machine Learning

Administrivia

Student evaluation

Syllabus

Notebooks

Getting started

Regularization for linear models

Models

Mechanics

Model assessment

Model interpretation

Unsupervised learning

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages