Tf-idf is a renowned NLP technique for vectorizing documents. I decided to investigate it a bit deeper to see how it works and what it can and cannot do (what kind of tasks it should be used for).
In this notebook, I do mainly two things :
1 - compare a bag-of-word vectorization vs a tf-idf vectorization on a text classification task on the 20newsgroups dataset
2 - build code to get the most and least important words in a document according to the tf-idf weights
This project has been very instructive, particularly the second part during which I learned a lot on numpy arrays and how you can index on them (with nonzero() and argsort function() ). I also learned tf-idf is especially good for text classification (or information retrieval).
-
Notifications
You must be signed in to change notification settings - Fork 0
vrivier/document-classification
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Document classification using tf-idf and SGD on scikit-learn
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published