Spam Text Classification Project => Full Code
The project was conducted on the KAGGLE platform.
This project uses SMS Spam Collection Dataset
- A set of SMS-tagged messages collected for SMS Spam investigation.
- A total of 5574 English messages are provided, labeled Spam(1), Ham(0).
- Remove non-text objects such as emojis or numbers and dots.
- Make words lowercase: The machine treats the same word with different case as different words.
- Stopword Removal: Stopwords are words that do not affect the importance of text in text classification. (ex: the, we, a , will)
- Stem: The Bag of Word model i will use in this project will be affected by more frequent occurrences of words. Several words with the same meaning (ex: runnable, running , is run) have been changed to the same.
- Get all the words in all texts, count the number of occurrences of each word, and select a specific word (Cluster Word) that occurs most frequently.
- Assuming that a total of 1000 cluster words are selected, the number of occurrences of these 1000 words becomes a feature of the classification problem.
- Classification proceeds by learning the classifier with the extracted features.
- Use CountVectorizer provided by Sklearn.
- For this project, I used SVM, which is said to work best with BoW.