plotwords accept arg to ignore latest N years, created test python files

kohjiaxuan · May 8, 2020 · db02129 · db02129
1 parent 84a7097
commit db02129
Show file tree

Hide file tree

Showing 17 changed files with 261 additions and 958 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,3 @@
+.idea/
+.ipynb_checkpoints/
+__pycache__/
diff --git a/...d Lemmatizing (shortening of words).ipynb → ...d Lemmatizing (shortening of words).ipynb b/...d Lemmatizing (shortening of words).ipynb → ...d Lemmatizing (shortening of words).ipynb
diff --git a/FINAL. wikiscrape package (updated 26 Nov).ipynb b/FINAL. wikiscrape package (updated 26 Nov).ipynb
diff --git a/...package (updated 26 Nov)-checkpoint.ipynb → FINAL. wikiscrape package.ipynb b/...package (updated 26 Nov)-checkpoint.ipynb → FINAL. wikiscrape package.ipynb
diff --git a/README.md b/README.md
@@ -37,20 +37,22 @@ e.g. paris = wikiscrape.wiki('PArIs','Yes','french','Yes','Yes') means to search
 Refer to images in the repository for examples. The earliest image 'bar.png' made 4 months ago was the initial design for the bar chart for word frequency. Examples of the newest images (last edit: 27 Nov 2019) are 'ColdplayWordCount2.png' and 'Donald_Trump_40words.png'.
 <br><br>
 
-#### Libraries used: requests, bs4, collections, matplotlib, re, os, nltk (optional, only if using stoplist or lemmatization)
+#### Libraries used: requests, bs4, collections, matplotlib, re, os, math, datetime, nltk (optional, only if using stoplist or lemmatization)
+Refer to requirements.txt <br>
 Package itself already has a comprehensive stoplist built inside to remove common words before text analytics <br>
 
-#### Updates: <br>
-1. 26 May 2019 - Added plotyear() function to plot the most frequent years mentioned, and removed years in the frequency count of word counter (commonwords & commonwordspct functions).
-2. 9 June 2019 - Added markdown for explanation and added comments in the code for understanding <br>
-3. 13 June 2019 - Updated documentation for plotyear, plotwords, summary and gettext function in .HELP(). <br>
-4. 25 November 2019 - Update coming very soon to patch issues and improve on Wikipedia package, stay tuned! <br>
-5. 27 November 2019 - Major update to the Python package, including: <br>
+#### Updates:
+1. <b>26 May 2019</b> - Added plotyear() function to plot the most frequent years mentioned, and removed years in the frequency count of word counter (commonwords & commonwordspct functions).
+2. <b>9 June 2019</b> - Added markdown for explanation and added comments in the code for understanding <br>
+3. <b>13 June 2019</b> - Updated documentation for plotyear, plotwords, summary and gettext function in .HELP(). <br>
+4. <b>25 November 2019</b> - Update coming very soon to patch issues and improve on Wikipedia package, stay tuned! <br>
+5. <b>27 November 2019</b> - Major update to the Python package, including: <br>
  a. Adding of lemmatization feature (using NLTK) before using text analytics functions <br>
  b. Better documentation via docstrings and updating HELP function <br>
  c. Improving of graph plotting design and font size for plotwords and plotyear <br>
  d. Fixed some bugs for graph plotting including values not showing or showing up erroneously <br>
  e. Refactored the code, provided better names for key variables for user understanding <br>
  f. Performance improvement of article search by removing unused variables and functions <br>
- g. Tested all functions and also error handling in case user puts in wrong parameters <br><br>
-6. For any questions or suggestions, please contact me at my Linkedin account - https://www.linkedin.com/in/kohjiaxuan/ <br>
+ g. Tested all functions and also error handling in case user puts in wrong parameters <br>
+6. <b>09 May 2020</b> - New feature to exclude N number of latest years in plotwords (e.g. from current year 2020, 2019, ...) and made graph titles larger
+7. For any questions or suggestions, please contact me at my Linkedin account - https://www.linkedin.com/in/kohjiaxuan/ <br>
diff --git a/Tests/__main__.py b/Tests/__main__.py
@@ -0,0 +1 @@
+import test
diff --git a/Tests/commonwords.py b/Tests/commonwords.py
@@ -0,0 +1,7 @@
+from test import testclass
+
+if __name__ == "__main__":
+ print(testclass.commonwords(50))
+
+def commonwordstest(num):
+ return testclass.commonwords(num)
diff --git a/Tests/help1.py b/Tests/help1.py
@@ -0,0 +1,8 @@
+from test import testclass
+
+if __name__ == "__main__":
+ testclass.HELP()
+
+def help1test():
+ testclass.HELP()
+ return True
diff --git a/Tests/help2.py b/Tests/help2.py
@@ -0,0 +1,7 @@
+from test import testclass
+
+if __name__ == "__main__":
+ help(testclass)
+
+def help2test():
+ return help(testclass)
diff --git a/Tests/plotwords.py b/Tests/plotwords.py
@@ -0,0 +1,4 @@
+from test import testclass
+
+if __name__ == "__main__":
+ testclass.plotwords('wordcount', 30, 1, 20)
diff --git a/Tests/plotyear.py b/Tests/plotyear.py
@@ -0,0 +1,4 @@
+from test import testclass
+
+if __name__ == "__main__":
+ testclass.plotyear('yearcount', 30)
diff --git a/Tests/summary.py b/Tests/summary.py
@@ -0,0 +1,7 @@
+from test import testclass
+
+if __name__ == "__main__":
+ testclass.summary(4)
+
+def summarytest(num):
+ return testclass.summary(num, True)
diff --git a/Tests/test.py b/Tests/test.py
@@ -0,0 +1,23 @@
+from os.path import dirname, abspath
+import sys
+
+# Get path of parent folder
+parentpath = dirname(dirname(abspath(__file__)))
+
+# Add to directory defined by sys
+sys.path.append(parentpath)
+
+# print(parentpath)
+# print(sys.path)
+
+# Now you can import wikiscrape
+import wikiscrape
+
+def newclass(title):
+ return wikiscrape.wiki(title, 'yes', 'en', True, True)
+
+# Simple test
+if __name__ == "__main__":
+ newclass('armin van buuren')
+else:
+ testclass = wikiscrape.wiki('armin van buuren', 'yes', 'en', True, True)
diff --git a/__init__.py b/__init__.py
@@ -0,0 +1 @@
+from .wikiscrape import *
diff --git a/wikiscrape.py b/wikiscrape.py
@@ -1,13 +1,11 @@
-#!/usr/bin/env python
-# coding: utf-8
-
-import requests #Get the HTML code
-from bs4 import BeautifulSoup #Tidy up the code
-from collections import Counter #Counter to count occurances of each word
-import matplotlib.pyplot as plt #graph plotting
-import re #regular expression to check if language setting is exactly 2 letters (for non common langs) in the argument
-import os #for plotwords to tell where file is saved
-import math #for calculating font size of graphs using exponential
+import requests # Get the HTML code
+from bs4 import BeautifulSoup # Tidy up the code
+from collections import Counter # Counter to count occurances of each word
+import matplotlib.pyplot as plt # graph plotting
+import re # regular expression to check if language setting is exactly 2 letters (for non common langs) in the argument
+import os # for plotwords to tell where file is saved
+import math # for calculating font size of graphs using exponential
+import datetime # for getting current year
 
 #var = wikiscrape.wiki('Article Search',optional arguments 2-4)
 #Arg 1 is article name in string, Arg 2 is to format in proper case (default Yes), Arg 3 is language (default EN), Arg 4 is use stoplist of NLTK (default No)
@@ -43,24 +41,28 @@ def __init__(self,title,option='Yes',lang='en',checknltk='No',lemmatize='No'):
  self.nltkrun = False 
  if isinstance(checknltk, str): #check for string yes, no and other permutations
  if checknltk.lower().strip() in {'yes','true','y','t'}:
- import nltk
- nltk.download('stopwords')
- nltk.download('wordnet')
- from nltk.corpus import stopwords
- from nltk.corpus import wordnet
- self.nltkrun = True
- elif checknltk.lower().strip() in {'no','false','n','f','na','n/a','nan'}:
- self.nltkrun = False
- else:
- self.nltkrun = False
+ try:
+  import nltk
+  from nltk.corpus import stopwords
+  from nltk.corpus import wordnet
+  self.nltkrun = True
+ except:
+  print("stopwords and wordnet are not downloaded. To download, execute pip install nltk. Next, input nltk.download('stopwords') and nltk.download('wordnet')")
+  # nltk.download('stopwords')
+  # nltk.download('wordnet')
+  self.nltkrun = False
  elif isinstance(checknltk, bool): #check for boolean yes/no
  if checknltk == True:
- import nltk
- nltk.download('stopwords')
- nltk.download('wordnet')
- from nltk.corpus import stopwords
- from nltk.corpus import wordnet
- self.nltkrun = True
+ try:
+ import nltk
+ from nltk.corpus import stopwords
+ from nltk.corpus import wordnet
+ self.nltkrun = True
+ except:
+ print("stopwords and wordnet are not downloaded. To download, execute pip install nltk. Next, input nltk.download('stopwords') and nltk.download('wordnet')")
+ # nltk.download('stopwords')
+ # nltk.download('wordnet')
+ self.nltkrun = False
  else:
  self.nltkrun = False
  else: #run default if options are invalid - don't run nltk stoplist
@@ -73,10 +75,13 @@ def __init__(self,title,option='Yes',lang='en',checknltk='No',lemmatize='No'):
  from nltk.stem import WordNetLemmatizer
  self.lemmatizer = WordNetLemmatizer()
  self.to_lemmatize = True
- elif lemmatize.lower().strip() in {'no','false','n','f','na','n/a','nan'}:
- self.to_lemmatize = False
- else:
- self.to_lemmatize = False
+ print('Lemmatizing of Wikipedia text is enabled!')
+ elif isinstance(lemmatize, bool):
+ if lemmatize == True:
+ from nltk.stem import WordNetLemmatizer
+ self.lemmatizer = WordNetLemmatizer()
+ self.to_lemmatize = True
+ print('Lemmatizing of Wikipedia text is enabled!')
 
  #Default: Stopword list obtained from nltk
  self.nltkstopword = []
@@ -527,11 +532,12 @@ def totalwords(self): #word count are all BEFORE banlist
  return [self.fullcount,self.fullcount2,self.fullwords,self.fullwords2]
 
  #Plot the most common words, 2nd argument allows you to choose number of words to plot, and 3rd arg is the Nth most common word to start plotting from
- def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
- '''plotwords accepts 3 optional arguments. 
+ def plotwords(self,graphname='wordcount',wordcount2=20,startword=1,removeyear=10):
+ '''plotwords accepts 4 optional arguments. 
  The first argument is the filename to save as (default: wordcount.png).
  The second argument (default: 20) is for the number of most frequent words to show as a GRAPH. 
- The third argument is the Nth most frequent word to start plotting from. (default: 1, starting from most frequent word).'''
+ The third argument is the Nth most frequent word to start plotting from. (default: 1, starting from most frequent word).
+ The fourth argument removes the latest N years from the most frequent words (default: remove latest 10 years)'''
  if isinstance(wordcount2, int) == True and isinstance(startword, int) == True:
  if startword < 1 or wordcount2 < 1:
  self.notify = 2 #Error as out of range, use default
@@ -546,6 +552,11 @@ def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
  self.wordcount2 = 20
  self.startword = 1
 
+ if self.notify == 1:
+ print('Word count or start position specified is currently not an integer. Hence default of 20 words starting from 1st word is used for graph\n')
+ elif self.notify == 2:
+ print('Word count or start position specified must be 1 or greater. Default of 20 words starting from 1st word is used for graph\n')
+
  # Change file name
  if isinstance(graphname, str) == True:
  self.graphname = graphname + '.png'
@@ -556,8 +567,27 @@ def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
  #if start position is not modified (start from most common word, use default dict)
  #otherwise, have to make a new dictionary for plotting graph by getting start th to start + wordcount th words
 
- self.yearban = ['2019','2018','2017','2016','2015','2014','2013','2012','2011','2010','0000'] 
- # Banlist, omit these years in graph
+ self.curyear = datetime.datetime.now().year
+ # Banlist, omit the last n years in plotwords graph
+ self.yearban = ['0000'] 
+
+ if isinstance(removeyear, int) == True:
+ if removeyear >= 0:
+ self.removeyear = removeyear
+ else:
+ self.removeyear = 10 # Error as years to remove cannot be negative
+ print('Number of latest years to exclude in word frequency graph cannot be negative. Excluding the most recent 10 years by default, starting from ' + str(self.curyear))
+ else:
+ self.removeyear = 10 # Error as not integer input, use default
+ print('Number of latest years to exclude in word frequency graph is invalid. Excluding the most recent 10 years by default, starting from ' + str(self.curyear))
+
+ for i in range(self.removeyear):
+ self.yearban.append(str(self.curyear - i))
+ if self.curyear - i == 0:
+ break
+ # print(self.yearban)
+
+ # Store words and freq in dictionary
  self.topwords2 = {}
  self.wordno_graph = 0
  for i, (word, freq) in enumerate(dict(self.wordcounter.most_common()).items()):
@@ -583,7 +613,7 @@ def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
  plt.rc('xtick', labelsize=20) 
  plt.style.use('ggplot')
  self.localgraph = plt.barh(range(len(self.topwords2)),self.wordvalues,tick_label=self.wordnames)
- plt.title('Word Frequency of Wiki Article: ' + self.graphtitle + ' for the Top ' + str(self.wordno_graph) + ' words, starting from word number ' + str(self.startword),fontsize=18)
+ plt.title('Word Frequency of Wiki Article: ' + self.graphtitle + ' for the Top ' + str(self.wordno_graph) + ' words, starting from word number ' + str(self.startword),fontsize=22)
 
  #Colored bar graphs divided by green (most frequent words), orange (moderate), red (not as frequent)
  for i in range(self.wordno_graph):
@@ -597,13 +627,7 @@ def plotwords(self,graphname='wordcount',wordcount2=20,startword=1):
  plt.savefig(self.graphname)
  plt.rcParams['figure.figsize'] = [22, 18]
  plt.show()
-
-
- if self.notify == 1:
- print('Word count or start position specified is currently not an integer. Hence default of 20 words starting from 1st word is used for graph\n')
- elif self.notify == 2:
- print('Word count or start position specified must be 1 or greater. Default of 20 words starting from 1st word is used for graph\n')
-
+
  self.cwd = os.getcwd()
  print('Graph is saved as ' + self.graphname + ' in directory: ' + str(self.cwd))
 
@@ -658,7 +682,7 @@ def plotyear(self,graphname='yearcount',yearcount3=20):
  plt.rc('xtick', labelsize=20) 
  plt.style.use('ggplot')
  self.yeargraph = plt.barh(range(len(self.yearlist)),self.yearvalues,tick_label=self.yearnames)
- plt.title('Interest in ' + self.graphtitle + ' over the years measured by Frequency Count of each Year',fontsize=20)
+ plt.title('Interest in ' + self.graphtitle + ' over the years measured by Frequency Count of each Year',fontsize=22)
 
  for i in range(self.actualyearcount):
  if i <= float(self.actualyearcount)/3:
@@ -732,12 +756,14 @@ def summary(self, paravalue=2, outsummary='no'):
  def HELP(self):
  '''Explains how to use the class object wiki and also retrieves a list of methods with their actions.'''
  print('The wiki() class accepts 5 arguments. The first one is a compulsory title of the Wikipedia page. Second is to format the search string to proper/title case (Yes/No, default: Yes).')
- print('Third is for language settings (e.g. English, de, francais, etc., default: English). Fourth and fifth is for implementing NLTK stoplist in provided languages and lemmatizing text respectively (Yes/No, default: No).\n\n')
+ print('Third is for language settings (e.g. English, de, francais, etc., default: English).')
+ print('Fourth is for implementing NLTK stoplist in provided language based on 3rd arg (Yes/No, default: standard stoplist provided).')
+ print('Fifth is for lemmatizing text (Yes/No, default: No).\n\n')
  print('Functions/Methods of Wikipedia scraper package: \n')
  print('commonwords accepts 1 optional argument (default: 100) for the number of most common words in the site and their frequencies to show.\n')
  print('commonwordspct accepts 1 optional argument (default: 10) on the percentage threshold of word count to determine the most frequent words to show.\n')
- print('plotwords accepts 3 optional arguments. The first argument is the filename to save as (default: wordcount.png). The second argument (default: 20) is for the number of most frequent words to show as a GRAPH. The third argument is the Nth most frequent word to start plotting from. (default: 1, starting from most frequent word). The third argument is the filename to save as.\n')
+ print('plotwords accepts 4 optional arguments. The first argument is the filename to save as (default: wordcount.png). The second argument (default: 20) is for the number of most frequent words to show as a GRAPH. The third argument is the Nth most frequent word to start plotting from. (default: 1, starting from most frequent word). The third argument is the filename to save as. The fourth argument removes the latest N years from the most frequent words (default: remove latest 10 years)\n')
  print('plotyear accepts 2 optional argument. The first argument is the filename to save as (default: yearcount.png). The second argument (default: 20) is the number of years to plot in the graph. The frequency count of the most common years will be plotted. This allows the user to understand the years of interest for the Wikipedia Topic.\n')
  print('totalwords accepts 0 argument and shows the total word count and unique word count\n')
  print('summary accepts 2 optional arguments, the first one for the number of paragraphs to show (default: 2) and the second one - Yes to output string and No to print text (default: No). It gives a summary of the Wikipedia page\n')
- print('gettext accepts 1 optional argument - Yes to output string and No to print text (default: No). It retrieves the full text of the Wikipedia title\n')
+ print('gettext accepts 1 optional argument - Yes to output string and No to print text (default: No). It retrieves the full text of the Wikipedia title\n')
diff --git a/wordcount.png b/wordcount.png
diff --git a/yearcount.png b/yearcount.png