Skip to content
Mihir Singh edited this page Mar 28, 2020 · 10 revisions

Data

  • IDE: Visual Studio Code 1.43.2.0 and PowerShell 7.0.0-preview.2
  • Language: Python 3.7
  • Method: Webscraping
  • Tools: Selenium and PhantomJS Web Driver
Dataset consists of articles from popular blog-sites. More specifically their Title, Body and Tags. All the data was scraped using the Webscraper. This was used a generate a corpus of nearly 19,000 articles, their titles and the associated tags.

Language Processing

  • IDE: Visual Studio Code 1.43.2.0 and PowerShell 7.0.0-preview.2
  • Language: Python 3.7
  • Tools: NLTK, Sklearn's TFIDF Feature Extraction and Pickling
A language processing model was trained on the text features generated from our corpus. It was trained such that it can transform any new text and predict, or more accurately - suggest tags that it can be associated with. The tags from these are limited and the vocabulary of the model is totally dependent on the dataset which also is limited. To overcome this we have a more basic model for assistance.
A model based on the frequency distribution of words in any new text calculates the most occurring terms and cross-references it with the tags available in our dataset. If the recurring terms or topics are relevant as tags, those are suggested as well.
Lastly, any relevant tags or topics present in the title of the article are added to the mix as well.
A collection of these tags are suggested to the user through the Browser Extension.
Clone this wiki locally