Skip to content

Commit

Permalink
plotwords accept arg to ignore latest N years, created test python files
Browse files Browse the repository at this point in the history
  • Loading branch information
kohjiaxuan committed May 8, 2020
1 parent 84a7097 commit db02129
Show file tree
Hide file tree
Showing 17 changed files with 261 additions and 958 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.idea/
.ipynb_checkpoints/
__pycache__/
823 changes: 0 additions & 823 deletions FINAL. wikiscrape package (updated 26 Nov).ipynb

This file was deleted.

Large diffs are not rendered by default.

20 changes: 11 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,22 @@ e.g. paris = wikiscrape.wiki('PArIs','Yes','french','Yes','Yes') means to search
Refer to images in the repository for examples. The earliest image 'bar.png' made 4 months ago was the initial design for the bar chart for word frequency. Examples of the newest images (last edit: 27 Nov 2019) are 'ColdplayWordCount2.png' and 'Donald_Trump_40words.png'.
<br><br>

#### Libraries used: requests, bs4, collections, matplotlib, re, os, nltk (optional, only if using stoplist or lemmatization)
#### Libraries used: requests, bs4, collections, matplotlib, re, os, math, datetime, nltk (optional, only if using stoplist or lemmatization)
Refer to requirements.txt <br>
Package itself already has a comprehensive stoplist built inside to remove common words before text analytics <br>

#### Updates: <br>
1. 26 May 2019 - Added plotyear() function to plot the most frequent years mentioned, and removed years in the frequency count of word counter (commonwords & commonwordspct functions).
2. 9 June 2019 - Added markdown for explanation and added comments in the code for understanding <br>
3. 13 June 2019 - Updated documentation for plotyear, plotwords, summary and gettext function in .HELP(). <br>
4. 25 November 2019 - Update coming very soon to patch issues and improve on Wikipedia package, stay tuned! <br>
5. 27 November 2019 - Major update to the Python package, including: <br>
#### Updates:
1. <b>26 May 2019</b> - Added plotyear() function to plot the most frequent years mentioned, and removed years in the frequency count of word counter (commonwords & commonwordspct functions).
2. <b>9 June 2019</b> - Added markdown for explanation and added comments in the code for understanding <br>
3. <b>13 June 2019</b> - Updated documentation for plotyear, plotwords, summary and gettext function in .HELP(). <br>
4. <b>25 November 2019</b> - Update coming very soon to patch issues and improve on Wikipedia package, stay tuned! <br>
5. <b>27 November 2019</b> - Major update to the Python package, including: <br>
a. Adding of lemmatization feature (using NLTK) before using text analytics functions <br>
b. Better documentation via docstrings and updating HELP function <br>
c. Improving of graph plotting design and font size for plotwords and plotyear <br>
d. Fixed some bugs for graph plotting including values not showing or showing up erroneously <br>
e. Refactored the code, provided better names for key variables for user understanding <br>
f. Performance improvement of article search by removing unused variables and functions <br>
g. Tested all functions and also error handling in case user puts in wrong parameters <br><br>
6. For any questions or suggestions, please contact me at my Linkedin account - https://www.linkedin.com/in/kohjiaxuan/ <br>
g. Tested all functions and also error handling in case user puts in wrong parameters <br>
6. <b>09 May 2020</b> - New feature to exclude N number of latest years in plotwords (e.g. from current year 2020, 2019, ...) and made graph titles larger
7. For any questions or suggestions, please contact me at my Linkedin account - https://www.linkedin.com/in/kohjiaxuan/ <br>
1 change: 1 addition & 0 deletions Tests/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
import test
7 changes: 7 additions & 0 deletions Tests/commonwords.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from test import testclass

if __name__ == "__main__":
print(testclass.commonwords(50))

def commonwordstest(num):
return testclass.commonwords(num)
8 changes: 8 additions & 0 deletions Tests/help1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from test import testclass

if __name__ == "__main__":
testclass.HELP()

def help1test():
testclass.HELP()
return True
7 changes: 7 additions & 0 deletions Tests/help2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from test import testclass

if __name__ == "__main__":
help(testclass)

def help2test():
return help(testclass)
4 changes: 4 additions & 0 deletions Tests/plotwords.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from test import testclass

if __name__ == "__main__":
testclass.plotwords('wordcount', 30, 1, 20)
4 changes: 4 additions & 0 deletions Tests/plotyear.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from test import testclass

if __name__ == "__main__":
testclass.plotyear('yearcount', 30)
7 changes: 7 additions & 0 deletions Tests/summary.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from test import testclass

if __name__ == "__main__":
testclass.summary(4)

def summarytest(num):
return testclass.summary(num, True)
23 changes: 23 additions & 0 deletions Tests/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from os.path import dirname, abspath
import sys

# Get path of parent folder
parentpath = dirname(dirname(abspath(__file__)))

# Add to directory defined by sys
sys.path.append(parentpath)

# print(parentpath)
# print(sys.path)

# Now you can import wikiscrape
import wikiscrape

def newclass(title):
return wikiscrape.wiki(title, 'yes', 'en', True, True)

# Simple test
if __name__ == "__main__":
newclass('armin van buuren')
else:
testclass = wikiscrape.wiki('armin van buuren', 'yes', 'en', True, True)
1 change: 1 addition & 0 deletions __init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .wikiscrape import *
120 changes: 73 additions & 47 deletions wikiscrape.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
#!/usr/bin/env python
# coding: utf-8

import requests #Get the HTML code
from bs4 import BeautifulSoup #Tidy up the code
from collections import Counter #Counter to count occurances of each word
import matplotlib.pyplot as plt #graph plotting
import re #regular expression to check if language setting is exactly 2 letters (for non common langs) in the argument
import os #for plotwords to tell where file is saved
import math #for calculating font size of graphs using exponential
import requests # Get the HTML code
from bs4 import BeautifulSoup # Tidy up the code
from collections import Counter # Counter to count occurances of each word
import matplotlib.pyplot as plt # graph plotting
import re # regular expression to check if language setting is exactly 2 letters (for non common langs) in the argument
import os # for plotwords to tell where file is saved
import math # for calculating font size of graphs using exponential
import datetime # for getting current year

#var = wikiscrape.wiki('Article Search',optional arguments 2-4)
#Arg 1 is article name in string, Arg 2 is to format in proper case (default Yes), Arg 3 is language (default EN), Arg 4 is use stoplist of NLTK (default No)
Expand Down Expand Up @@ -43,24 +41,28 @@ def __init__(self,title,option='Yes',lang='en',checknltk='No',lemmatize='No'):
self.nltkrun = False
if isinstance(checknltk, str): #check for string yes, no and other permutations
if checknltk.lower().strip() in {'yes','true','y','t'}:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.corpus import wordnet
self.nltkrun = True
elif checknltk.lower().strip() in {'no','false','n','f','na','n/a','nan'}:
self.nltkrun = False
else:
self.nltkrun = False
try:
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
self.nltkrun = True
except:
print("stopwords and wordnet are not downloaded. To download, execute pip install nltk. Next, input nltk.download('stopwords') and nltk.download('wordnet')")
# nltk.download('stopwords')
# nltk.download('wordnet')
self.nltkrun = False
elif isinstance(checknltk, bool): #check for boolean yes/no
if checknltk == True:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.corpus import wordnet
self.nltkrun = True
try:
import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
self.nltkrun = True
except:
print("stopwords and wordnet are not downloaded. To download, execute pip install nltk. Next, input nltk.download('stopwords') and nltk.download('wordnet')")
# nltk.download('stopwords')
# nltk.download('wordnet')
self.nltkrun = False
else:
self.nltkrun = False
else: #run default if options are invalid - don't run nltk stoplist
Expand All @@ -73,10 +75,13 @@ def __init__(self,title,option='Yes',lang='en',checknltk='No',lemmatize='No'):
from nltk.stem import WordNetLemmatizer
self.lemmatizer = WordNetLemmatizer()
self.to_lemmatize = True
elif lemmatize.lower().strip() in {'no','false','n','f','na','n/a','nan'}:
self.to_lemmatize = False
else:
self.to_lemmatize = False
print('Lemmatizing of Wikipedia text is enabled!')
elif isinstance(lemmatize, bool):
if lemmatize == True:
from nltk.stem import WordNetLemmatizer
self.lemmatizer = WordNetLemmatizer()
self.to_lemmatize = True
print('Lemmatizing of Wikipedia text is enabled!')

#Default: Stopword list obtained from nltk
self.nltkstopword = []
Expand Down Expand Up @@ -527,11 +532,12 @@ def totalwords(self): #word count are all BEFORE banlist
return [self.fullcount,self.fullcount2,self.fullwords,self.fullwords2]

#Plot the most common words, 2nd argument allows you to choose number of words to plot, and 3rd arg is the Nth most common word to start plotting from
def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
'''plotwords accepts 3 optional arguments.
def plotwords(self,graphname='wordcount',wordcount2=20,startword=1,removeyear=10):
'''plotwords accepts 4 optional arguments.
The first argument is the filename to save as (default: wordcount.png).
The second argument (default: 20) is for the number of most frequent words to show as a GRAPH.
The third argument is the Nth most frequent word to start plotting from. (default: 1, starting from most frequent word).'''
The third argument is the Nth most frequent word to start plotting from. (default: 1, starting from most frequent word).
The fourth argument removes the latest N years from the most frequent words (default: remove latest 10 years)'''
if isinstance(wordcount2, int) == True and isinstance(startword, int) == True:
if startword < 1 or wordcount2 < 1:
self.notify = 2 #Error as out of range, use default
Expand All @@ -546,6 +552,11 @@ def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
self.wordcount2 = 20
self.startword = 1

if self.notify == 1:
print('Word count or start position specified is currently not an integer. Hence default of 20 words starting from 1st word is used for graph\n')
elif self.notify == 2:
print('Word count or start position specified must be 1 or greater. Default of 20 words starting from 1st word is used for graph\n')

# Change file name
if isinstance(graphname, str) == True:
self.graphname = graphname + '.png'
Expand All @@ -556,8 +567,27 @@ def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
#if start position is not modified (start from most common word, use default dict)
#otherwise, have to make a new dictionary for plotting graph by getting start th to start + wordcount th words

self.yearban = ['2019','2018','2017','2016','2015','2014','2013','2012','2011','2010','0000']
# Banlist, omit these years in graph
self.curyear = datetime.datetime.now().year
# Banlist, omit the last n years in plotwords graph
self.yearban = ['0000']

if isinstance(removeyear, int) == True:
if removeyear >= 0:
self.removeyear = removeyear
else:
self.removeyear = 10 # Error as years to remove cannot be negative
print('Number of latest years to exclude in word frequency graph cannot be negative. Excluding the most recent 10 years by default, starting from ' + str(self.curyear))
else:
self.removeyear = 10 # Error as not integer input, use default
print('Number of latest years to exclude in word frequency graph is invalid. Excluding the most recent 10 years by default, starting from ' + str(self.curyear))

for i in range(self.removeyear):
self.yearban.append(str(self.curyear - i))
if self.curyear - i == 0:
break
# print(self.yearban)

# Store words and freq in dictionary
self.topwords2 = {}
self.wordno_graph = 0
for i, (word, freq) in enumerate(dict(self.wordcounter.most_common()).items()):
Expand All @@ -583,7 +613,7 @@ def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
plt.rc('xtick', labelsize=20)
plt.style.use('ggplot')
self.localgraph = plt.barh(range(len(self.topwords2)),self.wordvalues,tick_label=self.wordnames)
plt.title('Word Frequency of Wiki Article: ' + self.graphtitle + ' for the Top ' + str(self.wordno_graph) + ' words, starting from word number ' + str(self.startword),fontsize=18)
plt.title('Word Frequency of Wiki Article: ' + self.graphtitle + ' for the Top ' + str(self.wordno_graph) + ' words, starting from word number ' + str(self.startword),fontsize=22)

#Colored bar graphs divided by green (most frequent words), orange (moderate), red (not as frequent)
for i in range(self.wordno_graph):
Expand All @@ -597,13 +627,7 @@ def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
plt.savefig(self.graphname)
plt.rcParams['figure.figsize'] = [22, 18]
plt.show()


if self.notify == 1:
print('Word count or start position specified is currently not an integer. Hence default of 20 words starting from 1st word is used for graph\n')
elif self.notify == 2:
print('Word count or start position specified must be 1 or greater. Default of 20 words starting from 1st word is used for graph\n')


self.cwd = os.getcwd()
print('Graph is saved as ' + self.graphname + ' in directory: ' + str(self.cwd))

Expand Down Expand Up @@ -658,7 +682,7 @@ def plotyear(self,graphname='yearcount',yearcount3=20):
plt.rc('xtick', labelsize=20)
plt.style.use('ggplot')
self.yeargraph = plt.barh(range(len(self.yearlist)),self.yearvalues,tick_label=self.yearnames)
plt.title('Interest in ' + self.graphtitle + ' over the years measured by Frequency Count of each Year',fontsize=20)
plt.title('Interest in ' + self.graphtitle + ' over the years measured by Frequency Count of each Year',fontsize=22)

for i in range(self.actualyearcount):
if i <= float(self.actualyearcount)/3:
Expand Down Expand Up @@ -732,12 +756,14 @@ def summary(self, paravalue=2, outsummary='no'):
def HELP(self):
'''Explains how to use the class object wiki and also retrieves a list of methods with their actions.'''
print('The wiki() class accepts 5 arguments. The first one is a compulsory title of the Wikipedia page. Second is to format the search string to proper/title case (Yes/No, default: Yes).')
print('Third is for language settings (e.g. English, de, francais, etc., default: English). Fourth and fifth is for implementing NLTK stoplist in provided languages and lemmatizing text respectively (Yes/No, default: No).\n\n')
print('Third is for language settings (e.g. English, de, francais, etc., default: English).')
print('Fourth is for implementing NLTK stoplist in provided language based on 3rd arg (Yes/No, default: standard stoplist provided).')
print('Fifth is for lemmatizing text (Yes/No, default: No).\n\n')
print('Functions/Methods of Wikipedia scraper package: \n')
print('commonwords accepts 1 optional argument (default: 100) for the number of most common words in the site and their frequencies to show.\n')
print('commonwordspct accepts 1 optional argument (default: 10) on the percentage threshold of word count to determine the most frequent words to show.\n')
print('plotwords accepts 3 optional arguments. The first argument is the filename to save as (default: wordcount.png). The second argument (default: 20) is for the number of most frequent words to show as a GRAPH. The third argument is the Nth most frequent word to start plotting from. (default: 1, starting from most frequent word). The third argument is the filename to save as.\n')
print('plotwords accepts 4 optional arguments. The first argument is the filename to save as (default: wordcount.png). The second argument (default: 20) is for the number of most frequent words to show as a GRAPH. The third argument is the Nth most frequent word to start plotting from. (default: 1, starting from most frequent word). The third argument is the filename to save as. The fourth argument removes the latest N years from the most frequent words (default: remove latest 10 years)\n')
print('plotyear accepts 2 optional argument. The first argument is the filename to save as (default: yearcount.png). The second argument (default: 20) is the number of years to plot in the graph. The frequency count of the most common years will be plotted. This allows the user to understand the years of interest for the Wikipedia Topic.\n')
print('totalwords accepts 0 argument and shows the total word count and unique word count\n')
print('summary accepts 2 optional arguments, the first one for the number of paragraphs to show (default: 2) and the second one - Yes to output string and No to print text (default: No). It gives a summary of the Wikipedia page\n')
print('gettext accepts 1 optional argument - Yes to output string and No to print text (default: No). It retrieves the full text of the Wikipedia title\n')
print('gettext accepts 1 optional argument - Yes to output string and No to print text (default: No). It retrieves the full text of the Wikipedia title\n')
Binary file modified wordcount.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified yearcount.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit db02129

Please sign in to comment.