Tools for using Maluuba's news questions and answer data.
You can find more information about the dataset here.
The combined dataset is made of several columns to show the story text and the derived answers from several crowdsourcers.
Column Name | Description |
---|---|
story_id | The identifier for the story. Comes from the member name in the CNN stories package. |
story_text | The text for the story. |
question | A question about the story. |
answer_char_ranges | Character based indices to answers in story_text. E.g. `196:228 |
is_answer_absent | Proportion of crowdsourcers that thought there was no answer to the question in the story. |
is_question_bad | Proportion of crowdsourcers that thought the question does not make sense. |
validated_answers | After crowdsourcing, we validated some answers when consensus was required. This shows how crowdsourcers voted during validation. E.g. {"none": 1, "294:297": 2} means that 1 crowdsourcer thought that none of the answers were good and 2 crowdsourcers thought that 294:297 was the best answer. |
The code in this repository complies with PEP8 standards with a maximum line length of 99 characters.
- Download the CNN stories from here to the maluuba/newsqa folder (for legal reasons, we can't automatically download these for you)
- Download the questions and answers from here to the maluuba/newsqa folder
- Extract the dowloaded tar.gz contents into the maluuba/newsqa folder (we'll automate this step in the future)
- Use Python 2
- Run
pip install --requirement requirements.txt
- Run
python maluuba/newsqa/example.py --help
to see instructions
Run
python maluuba/newsqa/example.py
To split the dataset into train, dev, and test, run
python maluuba/newsqa/split_dataset.py
The file to check will be printed.
Notice: CNN articles are used here by permission from The Cable News Network (CNN). CNN does not waive any rights of ownership in its articles and materials. CNN is not a partner of, nor does it endorse, Maluuba or its activities.
Terms: See LICENSE.pdf
.