Skip to content

Common ML Algorithms

Ishaan Javali edited this page Jul 1, 2019 · 20 revisions

ML Classification Algorithms

One of the methods of doing machine learning is through using algorithms. Many of the popular machine learning algorithms come from the scikit learn module which can be installed through the pip package manager.

Using an algorithm:

  1. A classifier must be created and have its hyperparameters optimized.
  2. A classifier must be trained on the data.
  3. A classifier must be tested on the test data.

K Nearest Neighbors

Calculates the Euclidean distance between a point of data and the k nearest points of pre-classified data. It classifies the new point as the class of the majority of its neighbors.

  • Pros: Simple algorithm. No assumptions. Evolves with new data. One hyperparameter to adjust.
  • Cons: Very sensitive to outliers. Sensitive to an imbalance in the number of data points per class.

Support Vector Classifier (SVC)

Finds a hyperplane that separates the classes with the maximum possible perpendicular distance between the nearest points from each class. This distance is called the margin and the nearest points are called support vectors.

  • Pros: It has a hyperparameter for regularization that helps with avoiding overfitting.
  • Cons: Have to find the right hyperparameters such as the kernel type.
  • Kernel Types: Kernels define the shape of the hyperplane.
    • Linear: The hyperplane is linear.
    • Poly: The hyperplane is in the shape of a polynomial figure.
    • RBF: Space of Gaussian distributions.
  • Comparison:
    • Learning time: Linear < Poly < RBF
    • Fitting Ability: Linear < Poly < RBF
    • Overfitting Risk: Linear < Poly < RBF
    • Underfitting Risk: RBF < Poly < Linear
    • Number of Hyperparameters: Linear(0) < RBF(2) < Poly(3)

Decision Trees

Flowchart-like tree structure that branches to illustrate outcomes for different decisions.

  • Pros: Easy to understand. Time Complexity: O(n) where n is depth.
  • Cons: Unstable with low variance in the data.
  • Metrics for Splitting the Tree:
    • Gini - Measures the probability of a random sample being classified incorrectly.
    • Entropy - Splits the tree based on the information gain.

Logistic Regression

Uses the logit function, also called sigmoid function, to calculate probabilities in order to make predictions.

  • Pros: Easy and fast prediction of classes.
  • Cons: Probability for predicting classes - may not work efficiently for multiple classes.
  • Types:
    • Binary Logistic Regression - Two classes, hence binary.
    • Multinomial Logistic Regression - Used for classifying multiple classes.

Naive Bayes

It is a classifier based on Bayes’ theorem and classifies values independently of other values. It uses probability to predict a class.

  • Pros: Easy and fast prediction of classes.
  • Cons: Unable to make predictions for a class that was not observed in the training set. Makes an assumption that data values are mutually exclusive.
  • Types:
    • Gaussian Naive Bayes - Two classes, hence binary.
    • Multinomial - Used for classifying multiple classes. = Bernoulli - Assumes that all features have binary values.

Up Next

Check out the Google Doc for better formatting, colors, images, diagrams, and more.