Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
(OLD) Extracting and cleaning Wikipedia Site.ipynb		(OLD) Extracting and cleaning Wikipedia Site.ipynb
AG2. Beautiful Soup India Wiki Practice.ipynb		AG2. Beautiful Soup India Wiki Practice.ipynb
FINAL. wikiscrape package (updated 26 May for year graph).ipynb		FINAL. wikiscrape package (updated 26 May for year graph).ipynb
README.md		README.md
WIKI01. Class and Function for Wikipedia Page.ipynb		WIKI01. Class and Function for Wikipedia Page.ipynb
WIKI02. wikiscrape-Copy1.ipynb		WIKI02. wikiscrape-Copy1.ipynb
WIKI03. Top 100 words in English from Wiki.ipynb		WIKI03. Top 100 words in English from Wiki.ipynb
WIKI04. wikiscrape without summary output.ipynb		WIKI04. wikiscrape without summary output.ipynb
WIKI05. importing wikiscrape test.ipynb		WIKI05. importing wikiscrape test.ipynb
WIKI06. Wikiscrape Example (LinkedIn).ipynb		WIKI06. Wikiscrape Example (LinkedIn).ipynb
WIKI07. Tinkering of Wikipedia Scraper.ipynb		WIKI07. Tinkering of Wikipedia Scraper.ipynb
WIKI08. wikiscrape (updated 26 May for year graph).ipynb		WIKI08. wikiscrape (updated 26 May for year graph).ipynb
WIKI09. Wikiscrape Testing Year Plot 26 May 2019.ipynb		WIKI09. Wikiscrape Testing Year Plot 26 May 2019.ipynb
WikiScrape_Example.PNG		WikiScrape_Example.PNG
bar.png		bar.png
coldplayyearcount.png		coldplayyearcount.png
coldplayyearcount2.jpg		coldplayyearcount2.jpg
coldplayyearcount20.jpg		coldplayyearcount20.jpg
coldplayyearcount20.png		coldplayyearcount20.png
wikiscrape old.txt		wikiscrape old.txt
wikiscrape.py		wikiscrape.py
wordcount.png		wordcount.png
yearcount.png		yearcount.png

Repository files navigation

Wikipedia-Article-Scraper

A full Python package that allows users to search for a Wiki article, scrape it and have features for text analytics.
Please feel free to download and use it (wikiscrape.py)!

wikiscrape Package by KJX

This Python code can be used to search for a Wikipedia Article and do text analytics on it

To save a wikiscrape object, simply type:
import wikiscrape
var = wikiscrape.wiki('Search Article')

e.g. paris = wikiscrape.wiki('PArIs','Yes','french','Yes') means to search for the article Paris, auto format to proper case (Yes for 2nd argument, default Yes), search for French wikipedia (french for 3rd argument, default No) and apply nltk stoplist for french (Yes for 4th argument, default No)

Full capabilities of wikiscrape package include:

Able to search in multiple languages
Give suggestions on search terms if search is ambiguous
Gives a short summary (2 paragraphs) of the article if it is retrieved successfully
Retrieve full text or exact number of paragraphs in string output for data pipeline
var.HELP() for the full list of functions available
Basic error handling, including checking data type of arguments and reverting to defaults if errorneous args are given

Text Analytics capabilities include:

A frequency counter on the most common words in the Wikipedia article (after omitting common English words, or stoplist from NLTK for foreign languages). Can also find the Nth % of most common words, where 0 =< N =< 100.
A graph plot of the most common words in the Wikipedia article
A graph plot on the most frequent Years mentioned in the article, to understand the Years of interest of the article
A summary on the total number of words and total number of unique words after implementing the stoplist of common words.
Analytics functions available in wikiscrape object are commonwords, commonwordspct, plotwords, plotyear, totalwords, summary, gettext.

Refer to images in the repository for examples. The earliest image 'bar.png' made 4 months ago was the initial design for the bar chart for word frequency.

Libraries used: requests, bs4, collections, matplotlib, re, os, nltk (optional, only if using stoplist)

Package itself already has a comprehensive stoplist built inside to remove common words before text analytics

Updates:

26 May 2019 - Added plotyear() function to plot the most frequent years mentioned
9 June 2019 - Added markdown for explaination and added comments in the code for understanding
For any questions or suggestions, please contact me at my Github account - https://github.com/kohjiaxuan/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wikipedia-Article-Scraper

wikiscrape Package by KJX

This Python code can be used to search for a Wikipedia Article and do text analytics on it

Full capabilities of wikiscrape package include:

Text Analytics capabilities include:

Libraries used: requests, bs4, collections, matplotlib, re, os, nltk (optional, only if using stoplist)

Updates:

About

Releases

Packages

Languages

License

kohjiaxuan/Wikipedia-Article-Scraper

Folders and files

Latest commit

History

Repository files navigation

Wikipedia-Article-Scraper

wikiscrape Package by KJX

This Python code can be used to search for a Wikipedia Article and do text analytics on it

Full capabilities of wikiscrape package include:

Text Analytics capabilities include:

Libraries used: requests, bs4, collections, matplotlib, re, os, nltk (optional, only if using stoplist)

Updates:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages