Skip to content

A complete Python text analytics package that allows users to search for a Wikipedia article, scrape it, conduct basic text analytics and integrate it to a data pipeline without writing excessive code.

License

Notifications You must be signed in to change notification settings

kohjiaxuan/Wikipedia-Article-Scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wikipedia-Article-Scraper

A full Python package that allows users to search for a Wiki article, scrape it and have features for text analytics.
Please feel free to download and use it (wikiscrape.py)!
Linkedin Example
Linkedin Example

wikiscrape Package by KJX

This Python code can be used to search for a Wikipedia Article and do text analytics on it


To save a wikiscrape object, simply type:
import wikiscrape
var = wikiscrape.wiki('Search Article')

e.g. paris = wikiscrape.wiki('PArIs','Yes','french','Yes') means to search for the article Paris, auto format to proper case (Yes for 2nd argument, default Yes), search for French wikipedia (french for 3rd argument, default No) and apply nltk stoplist for french (Yes for 4th argument, default No)

Full capabilities of wikiscrape package include:

  1. Able to search in multiple languages
  2. Give suggestions on search terms if search is ambiguous
  3. Gives a short summary (2 paragraphs) of the article if it is retrieved successfully
  4. Retrieve full text or exact number of paragraphs in string output for data pipeline
  5. var.HELP() for the full list of functions available
  6. Basic error handling, including checking data type of arguments and reverting to defaults if errorneous args are given

Text Analytics capabilities include:

  1. A frequency counter on the most common words in the Wikipedia article (after omitting common English words, or stoplist from NLTK for foreign languages). Can also find the Nth % of most common words, where 0 =< N =< 100.
  2. A graph plot of the most common words in the Wikipedia article
  3. A graph plot on the most frequent Years mentioned in the article, to understand the Years of interest of the article
  4. A summary on the total number of words and total number of unique words after implementing the stoplist of common words.
  5. Analytics functions available in wikiscrape object are commonwords, commonwordspct, plotwords, plotyear, totalwords, summary, gettext.

    Refer to images in the repository for examples. The earliest image 'bar.png' made 4 months ago was the initial design for the bar chart for word frequency.

Libraries used: requests, bs4, collections, matplotlib, re, os, nltk (optional, only if using stoplist)

Package itself already has a comprehensive stoplist built inside to remove common words before text analytics

Updates:

  1. 26 May 2019 - Added plotyear() function to plot the most frequent years mentioned
  2. 9 June 2019 - Added markdown for explaination and added comments in the code for understanding
  3. For any questions or suggestions, please contact me at my Github account - https://github.com/kohjiaxuan/

About

A complete Python text analytics package that allows users to search for a Wikipedia article, scrape it, conduct basic text analytics and integrate it to a data pipeline without writing excessive code.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages