Skip to content

HWNi/data-512-a2

Repository files navigation

DATA512 - A2: Bias in data

About the Project

In this assignment, the author explored the potential 'bias' that exist in English politicians articles. Specifically, the author analyzed the differences on the coverage of politicians on Wikipedia and the article quality about politicians between various countries. The article qualities are collected by using a machine learning API called ORES provided by Wikipedia.

Getting Started

You will need Python 3.X and Jupyter notebook installed to reproduce this project by running hcds-a2-biasn.ipynb. To install Python 3, see download and beginner's guide To install Jupyter Notebook, follow installation

Addtionally, you will need following packages:

Datasets used in this project

Wikipedia article data

The dataset can be found from here

Note: Please read through the documentation for this repository, then download and unzip it. You will need page_data.csv in the 'data' directory. Otherwise, there is a copy of this dataset in this github repo which can be found from here

page_data.csv contains following columns:

  1. 'country', country that relates to the article
  2. 'page', title of the article
  3. 'rev_id', Revision Id, the id to identify the article

Population Mid 2015 data

The dataset can be found from Population Research Bureau website

Note: Please look for the 'Microsoft Excel' incon in the upper right and download this data as a CSV file.

Population Mid-2015.csv contains following columns:

  1. 'Location', name of country
  2. 'Location Type', type of location, which are all listed as country
  3. 'TimeFrame', the time when the data were collected
  4. 'Data Type', data type of population, which are all listed as number
  5. 'Data', population
  6. 'Footnotes', applicable footnotes, all blank in this dataset

Using Wikimedia ORES API

In this project, we will use a WikiMedia API called ORES ("Objective Revision Evaluation Service") (See dcoumentation). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

  1. FA - Featured article
  2. GA - Good article
  3. B - B-class article
  4. C - C-class article
  5. Start - Start-class article
  6. Stub - Stub-class article

License

  • The Wikipedia article data, along with the code used to generate that data, are released under the CC-BY-SA 4.0 license.

  • See About PRB for the license and copyright information for the population data from Population Reference Bureau

  • By using Wikimedia ORES API, you agree to Wikimedia's Terms of Use and Privacy Policy.

This project is licensed under the MIT License - see the LICENSE.md file for details

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published