sklearn-tfidf-sort-words

Tf-idf is a renowned NLP technique for vectorizing documents. I decided to investigate it a bit deeper to see how it works and what it can and cannot do (what kind of tasks it should be used for).

In this notebook, I do mainly two things :
1 - compare a bag-of-word vectorization vs a tf-idf vectorization on a text classification task on the 20newsgroups dataset
2 - build code to get the most and least important words in a document according to the tf-idf weights

This project has been very instructive, particularly the second part during which I learned a lot on numpy arrays and how you can index on them (with nonzero() and argsort function() ). I also learned tf-idf is especially good for text classification (or information retrieval).

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
sklearn_tfidf.ipynb		sklearn_tfidf.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sklearn-tfidf-sort-words

About

Releases

Packages

Languages

vrivier/document-classification

Folders and files

Latest commit

History

Repository files navigation

sklearn-tfidf-sort-words

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages