Data 512 A2: Bias in data

Project Goal

The goal of this assignment is to explore the concept of 'bias' through data on Wikipedia articles - specifically, articles on political figures from a variety of countries. First, we need to use a machine learning service called ORES (Objective Revision Evaluation Service) to estimate the quality of each article for the wekipedia's dataset. Then, we will combine a dataset of Wikipedia articles with a dataset of country populations by matching the country names. After combining these two dataset, I am able to to perform an analysis of how the coverage of politicians on Wikipedia and the quality of articles about politicians varies between countries such as the following:

the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
the countries with the highest and lowest proportion of high quality articles about politicians.

Tool

I use Jupyter Notebook write Python code to access, manipulate and plot the data.

You can install Python and Jupyter Notebook by downloading Python 3.6 version from ANACONDA

The following Python Packages are used to throughout this project:

requests
json
csv
pandas

License of The Source Data

The wikipedia dataset (Politicians by Country from the English-language Wikipedia)

Terms & Conditions

The Population dataset

ORES By using the API ORES (Objective Revision Evaluation Service), you agree to Wikimedia's Terms of Use and Privacy Policy.

Terms of Use (Wikimedia Foundation terms of use)
Privacy Policy (Privacy policy) See https://ores.wikimedia.org for more information on how to use the API.

Also under CC-BY-SA 4.0.

Wikipedia Dataset

The wikipedia dataset can be found on Figshare. The data was extracted via the Wikimedia API using the associated code. It is formatted as a CSV and saved as page_data.csv in the "data" directory. Columns are:

Column	Description
country	'containing the sanitised country name, extracted from the category name'
page	'containing the unsanitised page title'
last_edit	'containing the edit ID of the last edit to the page'

You can also download the dataset directly by clicking DOWNLOAD

Getting Article Quality Predictions

Next step, we need to get the predicted quality scores for each article in the Wikipedia dataset above. We're using a Wikimedia API endpoint for a machine learning system called ORES (Objective Revision Evaluation Service). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

Column	Description
FA	Featured article
GA	Good article
B	B-class article
C	C-class article
Start	Start-class article
Stub	Stub-class article

The documentation can be found here.

When you query the API, you will notice that ORES returns a prediction value that contains the name of one category, as well as probability values for each of the 6 quality categories above. For this project, I only capture and use the value for prediction.

Note: There are four articles in the Wikipedia data that could not get the prediction values.

Population data

The population data is on the Population Research Bureau website. Download this data as a CSV file (hint: look for the 'Microsoft Excel' icon in the upper right).

It is formatted as a CSV and saved as Population Mid-2015.csv. Columns are:

Column	Description
Location	'country name'
Location Type	'country'
TimeFrame	'Mid-2015'
Data Type	'string'
Data	'population'
Footnotes	''

You can also download the dataset directly by clicking DOWNLOAD

Final Data File

After retrieving and including the ORES (Objective Revision Evaluation Service) data for each article. I am able to merge the wikipedia data and population data together for further analysis. Both datasets have fields containing country names for just that purpose. After merging the data, I drop the obervations that cannot be matched (Either the population dataset does not have an entry for the equivalent Wikipedia country, or vice versa). The variables of the final dataset are as shown below:

Column	Description
country	'country name'
article_name	'article name'
revision_id	'last edit id'
article_quality	'article quality prediction by ORES'
population	'population of the country'

Note: the revision_id here is the same thing as last_edit in Wikipedia Dataset, which we used to get article predictions from the ORES API.

Table

After obtaining the final dataset, based on the analyses below:

the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
the countries with the highest and lowest proportion of high quality articles about politicians. I have made 4 tables. Moreover, table1 and table2 refer to the first analysis while table3 and table4 refer to the second analysis.

10 highest-ranked countries in terms of number of politician articles as a proportion of country population

country	article_population_percentage(%)
Nauru	0.488029%
Tuvalu	0.466102%
San Marino	0.248485%
Monaco	0.105020%
Liechtenstein	0.077189%
Marshall Islands	0.067273%
Iceland	0.062268%
Tonga	0.060987%
Andorra	0.043590%
Federated States of Micronesia	0.036893%

10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

country	article_population_percentage(%)
India	0.000075%
China	0.000083%
Indonesia	0.000084%
Uzbekistan	0.000093%
Ethiopia	0.000107%
Korea, North	0.000156%
Zambia	0.000168%
Thailand	0.000172%
Congo, Dem. Rep. of	0.000194%
Bangladesh	0.000202%

10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

country	high_quality_percentage(%)
Korea, North	23.076923%
Saudi Arabia	11.764706%
Uzbekistan	10.344828%
Central African Republic	10.294118%
Romania	9.770115%
Guinea-Bissau	9.523810%
Bhutan	9.090909%
Vietnam	8.376963%
Dominica	8.333333%
Mauritania	7.692308%

10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

country	high_quality_percentage(%)
Turkmenistan	0.0%
Tajikistan	0.0%
Monaco	0.0%
Mozambique	0.0%
Nauru	0.0%
Tonga	0.0%
Cape Verde	0.0%
Guadeloupe	0.0%
Kazakhstan	0.0%
Suriname	0.0%

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSE		LICENSE
Population Mid-2015.csv		Population Mid-2015.csv
README.md		README.md
final_data.csv		final_data.csv
hcds-a2-bias.ipynb		hcds-a2-bias.ipynb
page_data.csv		page_data.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data 512 A2: Bias in data

Project Goal

Tool

License of The Source Data

Wikipedia Dataset

Getting Article Quality Predictions

Population data

Final Data File

Table

About

Releases

Packages

Languages

License

lzctony/data-512-a2

Folders and files

Latest commit

History

Repository files navigation

Data 512 A2: Bias in data

Project Goal

Tool

License of The Source Data

Wikipedia Dataset

Getting Article Quality Predictions

Population data

Final Data File

Table

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages