A full Python package that allows users to search for a Wiki article, scrape it and have features for text analytics.
Please feel free to download the script file for personal usage! (wikiscrape.py)
To save a wikiscrape object, simply type:
import wikiscrape
var = wikiscrape.wiki('Search Article')
e.g. paris = wikiscrape.wiki('PArIs','Yes','french','Yes') means to search for the article Paris, auto format to proper case (Yes for 2nd argument, default Yes), search for French wikipedia (french for 3rd argument, default No) and apply nltk stoplist for french (Yes for 4th argument, default No)
- Able to search in multiple languages
- Give suggestions on search terms if search is ambiguous
- Gives a short summary (2 paragraphs) of the article if it is retrieved successfully
- Retrieve full text or exact number of paragraphs in string output for data pipeline
- var.HELP() for the full list of functions available
- Basic error handling, including checking data type of arguments and reverting to defaults if errorneous args are given
- A frequency counter on the most common words in the Wikipedia article (after omitting common English words, or stoplist from NLTK for foreign languages). Can also find the Nth % of most common words, where 0 =< N =< 100.
- A graph plot of the most common words in the Wikipedia article
- A graph plot on the most frequent Years mentioned in the article, to understand the Years of interest of the article
- A summary on the total number of words and total number of unique words after implementing the stoplist of common words.
- Analytics functions available in wikiscrape object are commonwords, commonwordspct, plotwords, plotyear, totalwords, summary, gettext.
Refer to images in the repository for examples. The earliest image 'bar.png' made 4 months ago was the initial design for the bar chart for word frequency.
Libraries used: requests, bs4, collections, matplotlib, re, os, nltk (optional, only if using stoplist)
Package itself already has a comprehensive stoplist built inside to remove common words before text analytics
- 26 May 2019 - Added plotyear() function to plot the most frequent years mentioned, and removed years in the frequency count of word counter (commonwords & commonwordspct functions).
- 9 June 2019 - Added markdown for explanation and added comments in the code for understanding
- 13 June 2019 - Updated documentation for plotyear, plotwords, summary and gettext function in .HELP().
- 25 November 2019 - Update coming very soon, stay tuned!
- For any questions or suggestions, please contact me at my Linkedin account - https://www.linkedin.com/in/kohjiaxuan/