Skip to content

Toloka/CrowdSpeech

Repository files navigation

About

This repository provides data and code for "CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription" paper.

The collected transcriptions stored in data/*-crowd.tsv, ground-truth transcriptions stored in data/*-gt.txt. We also provide a code for the annotation process and speech synthesis in annotation and speech_sythesis folders, respectively.

Citation

@inproceedings{CrowdSpeech,
  author    = {Pavlichenko, Nikita and Stelmakh, Ivan and Ustalov, Dmitry},
  title     = {{CrowdSpeech and Vox~DIY: Benchmark Dataset for Crowdsourced Audio Transcription}},
  year      = {2021},
  booktitle = {Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks},
  eprint    = {2107.01091},
  eprinttype = {arxiv},
  eprintclass = {cs.SD},
  url       = {https://openreview.net/forum?id=3_hgF1NAXU7},
  language  = {english},
  pubstate  = {forthcoming},
}

Data

CrowdSpeech and VoxDIY datasets stored in the data folder. Each dataset is associated with two filed: <dataset>-<split>-crowd.tsv and <dataset>-<split>-gt.txt. The first one contains three columns INPUT:audio — an audio file given to crowd workers, OUTPUT:transcription — worker's transcription and ASSIGNMENT:worker_id — a unique worker identifier. The second file contains two tab-separated columns without header: an audio file and the ground-truth transcription.

You can also download the CrowdSpeech dataset from HuggingFace.

Evaluation

First, you may need to install some dependencies:

pip3 install crowd-kit toloka-kit jiwer

Then, you can easily evaluate all our baseline aggregation methods by a single command:

python3 baselines.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

In order to get the Oracle result, run

python3 oracle.py data/<dataset>-gt.txt data/<dataset>-crowd.tsv

You can also get the Inter-Rater Agreement by running

python3 agreement.py data/<dataset>-crowd.tsv

VoxDIY

You can find an IPython notebook with a code for the data collection process for the VoxDIY. For the quality control, we use a special class, TaskProcessor, that gets all the submits that are not accepted or rejected at the moment, calculates workers' skills, and checks if a submit should be accepted or rejected.

T5 Model

Our data is also available at HuggingFace Hub as well as the T5 model trained on train-clean, dev-clean and dev-other parts of CrowdSpeech.

This snippet shows the example of the model's inference:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, AutoConfig
mname = "toloka/t5-large-for-text-aggregation"
tokenizer = AutoTokenizer.from_pretrained(mname)
model = AutoModelForSeq2SeqLM.from_pretrained(mname)

input = "samplee text | sampl text | sample textt"
input_ids = tokenizer.encode(input, return_tensors="pt")
outputs = model.generate(input_ids)
decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(decoded)  # sample text

License

Code

© YANDEX LLC, 2021. Licensed under the Apache License, Version 2.0. See LICENSE file for more details.

Data

© YANDEX LLC, 2021. Licensed under the Creative Commons Attribution 4.0 license. See data/LICENSE file for more details.

Acknowledgements

LibriSpeech dataset is used under the Creative Commons Attribution 4.0 license.

CrowdWSA2019 dataset is used under the Creative Commons Attribution 4.0 license.

About

Benchmark Dataset for Crowdsourced Audio Transcription

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published