Skip to content

josh-tobin/labeling-challenge

Repository files navigation

labeling-challenge

About

The goal of this challenge is to get experience doing quality control for labeled data.

The original source of this data and the project is a UPenn class: https://crowdsourcing-class.org/. Highly suggest checking it out to learn more about crowdsourcing / data labeling!

Getting started

0. Set up a pyenv virtual environment (recommended)

Note: it is not necessary to use pyenv, but you do need python 3.6.

Follow this tutorial to install pyenv and learn about it:

https://amaral.northwestern.edu/resources/guides/pyenv-tutorial

Then create a virtual env for this project:

pyenv virtualenv 3.6.5 labeling-challenge
pyenv activate 3.6.5/envs/labeling-challenge

1. Install requirements

Run the following:

pip install -r requirements.txt

2. Get familiar with the data

This data is from a real mTurk project for Adjectives and Attribute Matching.

First, take a close look at the instructions that were provided to the labelers:

Instructions pt1

Instructions pt2

Now inspect the raw data file, raw_data.csv.

A few things to note:

  • You can get a unique id for the worker in column WorkerId
  • Lifetime approval rate: percentage of times requester has approved work of this mTurker across all tasks
  • Input.attr_id is the unique id for the attribute, Input.adj_* are the adjectives and Answer.adj_* the labeler's answers
  • If the labeler answered 'No' or 'not an adj', these are both listed as no in the dataset

3. Run the starter code

Open summarize_labels.py and take a look.

Then try to run it:

python summarize_labels.py

And inspect the answers it produces by opening summarized_data.csv.

How good are the labels? Run

python evaluate_results.py

4. Can you do better?

Write your own label summarization algorithm in summarize_labels.py.

Want a hint?
Think about how you can assess whether certain labelers are reliable or not.

You could also think about whether 50% is the right threshold to use.

Want another hint?
Columns Input.adj_11 through Input.adj_16 have known ground truth. 11-15 are True and 16 is False. How can you use this to evalute the labelers?
One last hint!
Come up with a "reliability score" for the labelers by assessing their performance on columns 11-16. Predict the label based on the weighted average of scores, not the simple average. You can also consider dropping unreliable labelers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages