Skip to content

lzctony/data-512-a2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data 512 A2: Bias in data

Project Goal

The goal of this assignment is to explore the concept of 'bias' through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. First, we need to use a machine learning service called ORES (Objective Revision Evaluation Service) to estimate the quality of each article for the wekipedia's dataset. Then, we will combine a dataset of Wikipedia articles with a dataset of country populations by matching the country names. After combining these two dataset, I am able to to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries such as the following:

  • the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
  • the countries with the highest and lowest proportion of high quality articles about politicians.

Tool

I use Jupyter Notebook write Python code to access, manipulate and plot the data.

You can install Python and Jupyter Notebook by downloading Python 3.6 version from ANACONDA

The following Python Packages are used to throughout this project:

  • requests
  • json
  • csv
  • pandas

License of The Source Data

The wikipedia dataset (Politicians by Country from the English-language Wikipedia)

The Population dataset

Copyright © 2016, Population Reference Bureau. All rights reserved.

ORES By using the API ORES (Objective Revision Evaluation Service), you agree to Wikimedia's Terms of Use and Privacy Policy.

Also under CC-BY-SA 4.0.

Wikipedia Dataset

The wikipedia dataset can be found on Figshare. The data was extracted via the Wikimedia API using the associated code. It is formatted as a CSV and saved as page_data.csv in the "data" directory. Columns are:

Column Description
country 'containing the sanitised country name, extracted from the category name'
page 'containing the unsanitised page title'
last_edit 'containing the edit ID of the last edit to the page'

You can also download the dataset directly by clicking DOWNLOAD

Getting Article Quality Predictions

Next step, we need to get the predicted quality scores for each article in the Wikipedia dataset above. We're using a Wikimedia API endpoint for a machine learning system called ORES (Objective Revision Evaluation Service). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

Column Description
FA Featured article
GA Good article
B B-class article
C C-class article
Start Start-class article
Stub Stub-class article

The documentation can be found here.

When you query the API, you will notice that ORES returns a prediction value that contains the name of one category, as well as probability values for each of the 6 quality categories above. For this project, I only capture and use the value for prediction.

Note: There are four articles in the Wikipedia data that could not get the prediction values.

Population data

The population data is on the Population Research Bureau website. Download this data as a CSV file (hint: look for the 'Microsoft Excel' icon in the upper right).

It is formatted as a CSV and saved as Population Mid-2015.csv. Columns are:

Column Description
Location 'country name'
Location Type 'country'
TimeFrame 'Mid-2015'
Data Type 'string'
Data 'population'
Footnotes ''

You can also download the dataset directly by clicking DOWNLOAD

Final Data File

After retrieving and including the ORES (Objective Revision Evaluation Service) data for each article. I am able to merge the wikipedia data and population data together for further analysis. Both datasets have fields containing country names for just that purpose. After merging the data, I drop the obervations that cannot be matched (Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice versa). The variables of the final dataset are as shown below:

Column Description
country 'country name'
article_name 'article name'
revision_id 'last edit id'
article_quality 'article quality prediction by ORES'
population 'population of the country'

Note: the revision_id here is the same thing as last_edit in Wikipedia Dataset, which we used to get article predictions from the ORES API.

Table

After obtaining the final dataset, based on the analyses below:

  • the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
  • the countries with the highest and lowest proportion of high quality articles about politicians. I have made 4 tables. Moreover, table1 and table2 refer to the first analysis while table3 and table4 refer to the second analysis.
  1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
country article_population_percentage(%)
Nauru 0.488029%
Tuvalu 0.466102%
San Marino 0.248485%
Monaco 0.105020%
Liechtenstein 0.077189%
Marshall Islands 0.067273%
Iceland 0.062268%
Tonga 0.060987%
Andorra 0.043590%
Federated States of Micronesia 0.036893%
  1. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
country article_population_percentage(%)
India 0.000075%
China 0.000083%
Indonesia 0.000084%
Uzbekistan 0.000093%
Ethiopia 0.000107%
Korea, North 0.000156%
Zambia 0.000168%
Thailand 0.000172%
Congo, Dem. Rep. of 0.000194%
Bangladesh 0.000202%
  1. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
country high_quality_percentage(%)
Korea, North 23.076923%
Saudi Arabia 11.764706%
Uzbekistan 10.344828%
Central African Republic 10.294118%
Romania 9.770115%
Guinea-Bissau 9.523810%
Bhutan 9.090909%
Vietnam 8.376963%
Dominica 8.333333%
Mauritania 7.692308%
  1. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
country high_quality_percentage(%)
Turkmenistan 0.0%
Tajikistan 0.0%
Monaco 0.0%
Mozambique 0.0%
Nauru 0.0%
Tonga 0.0%
Cape Verde 0.0%
Guadeloupe 0.0%
Kazakhstan 0.0%
Suriname 0.0%

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published