Text Analysis

Data

IDE: Visual Studio Code 1.43.2.0 and PowerShell 7.0.0-preview.2
Language: Python 3.7
Method: Webscraping
Tools: Selenium and PhantomJS Web Driver

Dataset consists of articles from popular blog-sites. More specifically their Title, Body and Tags. All the data was scraped using the Webscraper. This was used a generate a corpus of nearly 19,000 articles, their titles and the associated tags.

Language Processing

IDE: Visual Studio Code 1.43.2.0 and PowerShell 7.0.0-preview.2
Language: Python 3.7
Tools: NLTK, Sklearn's TFIDF Feature Extraction and Pickling

A language processing model was trained on the text features generated from our corpus. It was trained such that it can transform any new text and predict, or more accurately - suggest tags that it can be associated with. The tags from these are limited and the vocabulary of the model is totally dependent on the dataset which also is limited. To overcome this we have a more basic model for assistance.
A model based on the frequency distribution of words in any new text calculates the most occurring terms and cross-references it with the tags available in our dataset. If the recurring terms or topics are relevant as tags, those are suggested as well.
Lastly, any relevant tags or topics present in the title of the article are added to the mix as well.
A collection of these tags are suggested to the user through the Browser Extension.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text Analysis

Data

Language Processing

Clone this wiki locally