Skip to content

Commit

Permalink
Modify notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
Haowen Ni committed Nov 2, 2017
1 parent a32d6f5 commit d49cc09
Showing 1 changed file with 31 additions and 3 deletions.
34 changes: 31 additions & 3 deletions hcds-a2-bias.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,9 @@
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Import population data\n",
Expand Down Expand Up @@ -132,7 +134,9 @@
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Based on population data, construct a dictionary as following:\n",
Expand Down Expand Up @@ -1296,7 +1300,9 @@
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"gafa_to_articles = pandas.DataFrame(\n",
Expand Down Expand Up @@ -1532,6 +1538,28 @@
"## Analysis and Write-Up\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What I have learned in this project?\n",
" \n",
"This project provides me an opportunity to practice basic skills about data preprocessing in Python such as loading csv files, constructing dictionaries and splitting dataset into batches, and using pandas dataframe to combine datasets and proceeding some advanced aggregations and calculations. Also, I became more familiar with using WikiMedia API in Python. I learned that to improve the efficiency of using API, I could pass in multiple revision id for each API call. It was a really interesting experience to discover the fact that the API may not return the desired result and raise an error sometimes, I spent a large amount of time on debugging this issue. Actually, several days ago I found there are only two revision ids that do not have any article quality prediction, but today I found there are two more revision ids that do not have predictions. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What I found from the calculations? \n",
" \n",
"For the first ratio we calculated, the ratio of number of politician articles to country, I realized that the top-ranked countries are countries that are relatively small regarding both territory and population size, but there are still many people talking about the politicians of those countries becaues small countries tend to have more political and domestic events and changes. \n",
"\n",
"In contrast, the bottom-ranked countries have much less percentages of converage of politician articles on English Wikipedia compared to those top-ranked countries (about 100000 times difference). There are two cases: for countries like India and China, even though the numbers of politician articles in English Wikipedia are not small, the final coverage percentage will still be very very small because the population of those countries (the denominators in the caculations) are extremely large. On the other hand, North Korea, even though the country population is not large relatively, the number of politician articles on English Wikipedia is relatively small due to the fact that there is an Internet blocking between North Korea's Internet and Worldwide Internet, so there are less people who are able to publish or even know about North Korea's politics.\n",
"\n",
"For the second ratio we calculated, the ratio of umber of 'high qualities' ariticles to total number of articles, we see North Korea gets the first rank. I think there is some bias existed in this result because articles talking about North Korea's politicians may be not be actually written with very good quality, but because of the fact that the articles about North Korea are too rare, the algorithm somehow considers those articles contain extremely precious value and marks them as feature or good articles. "
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down

0 comments on commit d49cc09

Please sign in to comment.