Skip to content
forked from Maluuba/newsqa

Tools for accessing Maluuba's NewsQA Dataset (public version)

Notifications You must be signed in to change notification settings

HosikCho/newsqa

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Maluuba NewsQA

Tools for using Maluuba's news questions and answer data.

You can find more information about the dataset here.

Data Description

The combined dataset is made of several columns to show the story text and the derived answers from several crowdsourcers.

Column Name Description
story_id The identifier for the story. Comes from the member name in the CNN stories package.
story_text The text for the story.
question A question about the story.
answer_char_ranges Character based indices to answers in story_text. E.g. `196:228
is_answer_absent Proportion of crowdsourcers that thought there was no answer to the question in the story.
is_question_bad Proportion of crowdsourcers that thought the question does not make sense.
validated_answers After crowdsourcing, we validated some answers when consensus was required. This shows how crowdsourcers voted during validation. E.g. {"none": 1, "294:297": 2} means that 1 crowdsourcer thought that none of the answers were good and 2 crowdsourcers thought that 294:297 was the best answer.

PEP8

The code in this repository complies with PEP8 standards with a maximum line length of 99 characters.

Requirements

  • Download the CNN stories from here to the maluuba/newsqa folder (for legal reasons, we can't automatically download these for you)
  • Download the questions and answers from here to the maluuba/newsqa folder
  • Extract the dowloaded tar.gz contents into the maluuba/newsqa folder (we'll automate this step in the future)
  • Use Python 2
  • Run pip install --requirement requirements.txt
  • Run python maluuba/newsqa/example.py --help to see instructions

Package the Dataset

Run

python maluuba/newsqa/example.py

Split the Dataset

To split the dataset into train, dev, and test, run

python maluuba/newsqa/split_dataset.py

The file to check will be printed.

Legal

Notice: CNN articles are used here by permission from The Cable News Network (CNN). CNN does not waive any rights of ownership in its articles and materials. CNN is not a partner of, nor does it endorse, Maluuba or its activities.

Terms: See LICENSE.pdf.

About

Tools for accessing Maluuba's NewsQA Dataset (public version)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%