In this repository we perform Text Classification and Clustering experiments. Also, generating Word Clouds for each article category.
The input consists of 2225 documents from a news site that corresponds to stories in five local areas from 2004-2005.
Document Categories
- Business
- Entertainment
- Politics
- Sport
- Tech
First line of each document is the title and the rest is the content of the article.
The whole procedure consists of:
- Create a data set of all documents
- Text pre-processing
- Remove special characters, lower case
- Remove Stopwords
- Lemmatization
- Stemming
- Tokenization
- Generate Word Clouds
- Vectorization
- Classification and Clustering
I also implemented a KNN Classifier using max heap, but it was too slow for this data set.
Classifier: MultinomialNB, SVM, RF, KNN
Vectorization: Bag Of Words, Tf-idf
Dimensionality Reduction: PCA, SVD and ICA
Clusterer: Kmeans
Vectorization: Bag Of Words, Tf-idf, Word2vec
Dimensionality Reduction: PCA, SVD and ICA