Spam detection is one of the major applications of Machine Learning in the interwebs today. Pretty much all of the major email service providers have spam detection systems built in and automatically classify such mail as 'Junk Mail'.
Here we will be using the Naive Bayes algorithm to create a model that can classify SMS messages as spam or not spam, based on the training we give to the model.
Being able to identify spam messages is a binary classification problem as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is a supervised learning problem, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.
This project has been broken down in to the following steps: Step 1.1: Understanding our dataset
Step 1.2: Data Preprocessing
Step 2.1: Bag of Words (BoW)
Step 2.2: Implementing BoW from scratch
Step 2.3: Implementing Bag of Words in scikit-learn
Step 3.1: Training and testing sets
Step 3.2: Applying Bag of Words processing to our dataset.
Step 4.1: Bayes Theorem implementation from scratch
Step 4.2: Naive Bayes implementation from scratch
Step 5: Naive Bayes implementation using scikit-learn
Step 6: Evaluating our model
Step 7: Conclusion