Skip to content

Commit

Permalink
Major update to wikiscrape.py, tidied up files in repo and updates ar…
Browse files Browse the repository at this point in the history
…e listed in readme
  • Loading branch information
kohjiaxuan committed Dec 2, 2019
1 parent 632a750 commit bfba605
Show file tree
Hide file tree
Showing 44 changed files with 3,444 additions and 571 deletions.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Binary file modified ArminBuuren.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified Armin_top10words_25112019.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified Armin_top30years_25112019.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified Armin_top40words_25112019.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified ColdplayWordCount2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified ColdplayYearCount3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified Donald_Trump_30years.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified Donald_Trump_40words.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 4 additions & 4 deletions FINAL. wikiscrape package (updated 26 Nov).ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -411,8 +411,8 @@
" self.parashort.append(paragraph)\n",
" self.noofpara += 1\n",
"\n",
" #Data cleaning for printing out summary of Wikipedia (2 paragraphs) if search is successful - cleantext_summary\n",
" self.parashort2 = self.cleantext_summary(self.parashort)\n",
" #Data cleaning for printing out summary of Wikipedia (2 paragraphs) if search is successful - __cleantext_summary\n",
" self.parashort2 = self.__cleantext_summary(self.parashort)\n",
" \n",
" #REMOVE UNWANTED ARRAYS\n",
" self.parashort = [] \n",
Expand Down Expand Up @@ -453,7 +453,7 @@
" print('Other useful information: Enclose title argument with single quotes. Spaces are allowed, and title is case insensitive.')\n",
" \n",
" \n",
" def cleantext_summary(self, corpus):\n",
" def __cleantext_summary(self, corpus):\n",
" '''Gets summary of the text, internal method'''\n",
" #Data cleaning for printing out summary of Wikipedia (2 paragraphs) if search is successful\n",
" corpus = list(str(corpus)) #chop everything into letters for usage\n",
Expand Down Expand Up @@ -747,7 +747,7 @@
" self.parashort.append(paragraph)\n",
" self.noofpara += 1\n",
"\n",
" self.parashort2 = self.cleantext_summary(self.parashort)\n",
" self.parashort2 = self.__cleantext_summary(self.parashort)\n",
" #REMOVE UNWANTED ARRAYS\n",
" self.parashort = [] \n",
" \n",
Expand Down
Binary file added Images/ArminBuuren.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Images/Armin_top10words_25112019.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Images/Armin_top30years_25112019.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Images/Armin_top40words_25112019.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Images/ColdplayWordCount2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Images/ColdplayYearCount3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Images/Donald_Trump_30years.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Images/Donald_Trump_40words.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes.
File renamed without changes.
File renamed without changes.

Large diffs are not rendered by default.

222 changes: 132 additions & 90 deletions WIKI10. Added Lemmatizing (shortening of words).ipynb

Large diffs are not rendered by default.

Binary file modified __pycache__/wikiscrape.cpython-36.pyc
Binary file not shown.
384 changes: 0 additions & 384 deletions wikiscrape old.txt

This file was deleted.

13 changes: 5 additions & 8 deletions wikiscrape.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
#!/usr/bin/env python
# coding: utf-8

# In[ ]:


import requests #Get the HTML code
from bs4 import BeautifulSoup #Tidy up the code
from collections import Counter #Counter to count occurances of each word
Expand Down Expand Up @@ -367,8 +364,8 @@ def __init__(self,title,option='Yes',lang='en',checknltk='No',lemmatize='No'):
self.parashort.append(paragraph)
self.noofpara += 1

#Data cleaning for printing out summary of Wikipedia (2 paragraphs) if search is successful - cleantext_summary
self.parashort2 = self.cleantext_summary(self.parashort)
#Data cleaning for printing out summary of Wikipedia (2 paragraphs) if search is successful - __cleantext_summary
self.parashort2 = self.__cleantext_summary(self.parashort)

#REMOVE UNWANTED ARRAYS
self.parashort = []
Expand Down Expand Up @@ -409,7 +406,7 @@ def __init__(self,title,option='Yes',lang='en',checknltk='No',lemmatize='No'):
print('Other useful information: Enclose title argument with single quotes. Spaces are allowed, and title is case insensitive.')


def cleantext_summary(self, corpus):
def __cleantext_summary(self, corpus):
'''Gets summary of the text, internal method'''
#Data cleaning for printing out summary of Wikipedia (2 paragraphs) if search is successful
corpus = list(str(corpus)) #chop everything into letters for usage
Expand Down Expand Up @@ -703,7 +700,7 @@ def summary(self, paravalue=2, outsummary='no'):
self.parashort.append(paragraph)
self.noofpara += 1

self.parashort2 = self.cleantext_summary(self.parashort)
self.parashort2 = self.__cleantext_summary(self.parashort)
#REMOVE UNWANTED ARRAYS
self.parashort = []

Expand Down Expand Up @@ -743,4 +740,4 @@ def HELP(self):
print('plotyear accepts 2 optional argument. The first argument is the filename to save as (default: yearcount.png). The second argument (default: 20) is the number of years to plot in the graph. The frequency count of the most common years will be plotted. This allows the user to understand the years of interest for the Wikipedia Topic.\n')
print('totalwords accepts 0 argument and shows the total word count and unique word count\n')
print('summary accepts 2 optional arguments, the first one for the number of paragraphs to show (default: 2) and the second one - Yes to output string and No to print text (default: No). It gives a summary of the Wikipedia page\n')
print('gettext accepts 1 optional argument - Yes to output string and No to print text (default: No). It retrieves the full text of the Wikipedia title\n')
print('gettext accepts 1 optional argument - Yes to output string and No to print text (default: No). It retrieves the full text of the Wikipedia title\n')
Binary file modified wordcount.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified yearcount.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit bfba605

Please sign in to comment.