JTubeSpeech: Corpus of Japanese speech collected from YouTube

This repository provides 1) a list of YouTube videos with Japanese subtitles and 2) scripts for making new lists of new languages.

Description

data/{lang}/{YYYYMM}.csv lists as follows. See step4 for download.

	videoid	auto	sub	channelid
0	0017RsBbUHk	True	True	UCTW2tw0Mhho72MojB1L48IQ
1	00PqfZgiboc	False	True	UCzoghTgl4dvIW9GZF6UC-BA
---	---	---	---	---

lang: Language ID (ja [Japanese], en [English], ...)
YYYYMM: Year and month when we collect data
videoid: YouTube video ID. Its YouTube page is https://www.youtube.com/watch?v={videoid}.
auto: The video has an automatic subtitle or not.
sub: The video has a manual (i.e., human-generated) subtitle or not.
channelid: YouTube Channel ID. Its YouTube page is https://www.youtube.com/channel/{channelid}.

Statistics

	ja/202103.csv	{lang}/{YYYYMM}.csv
#videos-sub-true	110,000 (10,000 hours)	(TBA)
#videos-auto-true	4,960,000	(TBA)

Contributors

Shinnosuke Takamichi (The University of Tokyo, Japan) [main contributor]
Ludwig Kürzinger (Technical University of Munich, Germany)
Takaaki Saeki (The University of Tokyo, Japan)
Sayaka Shiota (Tokyo Metropolitan University, Japan)
Shinji Watanabe (Carnegie Mellon University, USA)

Scripts for data collection

scripts/*.py are scripts for data collection from YouTube. Since processes of the scripts are language independent, users can collect data of their favorite langauges. youtube-dl and ffmpeg are required.

step1: making search words

The script scripts/make_search_word.py downloads the wikipedia dump file and finds words for searching videos. {lang} is the languag code, e.g., ja (Japanese) and en (English).

$ python scripts/make_search_word.py {lang}

step2: obtaining video IDs

The script scripts/obtain_video_id.py obtains YouTube video IDs by searching by words. {filename_word_list} is a word list file made in step1. After this step, the process will take a long time. It is recommended to split the files (e.g., {filename_word_list}) and run them in parallel.

$ python scripts/obtain_video_id.py {lang} {filename_word_list}

step3: checking if subtitles are available

The script scripts/retrieve_subtitle_exists.py retrieves whether the video has subtitles or not. {filename_videoid_list} is a videoID list file made in step2. This process will make a CSV file.

$ python scripts/retrieve_subtitle_exists.py {lang} {filename_videoid_list}

step4: downloading videos with manual subtitles

The script scripts/download_video.py downloads audio and manual subtitles. Note that, this process requires a very large amount of storage.{filename_subtitle_list} is a subtitle list file made in step3. The audio and subtitles will be saved in video/{lang}/wav16k and video/{lang}/txt, respectively.

$ python scripts/download_video.py {lang} {filename_subtitle_list}

step5 (ASR): alignment and scoring

Subtitles are not always correctly aligned with the audio and in some cases, subtitles not fit to the audio. The script scripts/align.py aligns subtitles and audio with CTC segmentation using an ESPnet 2 ASR model:

$ python scripts/align.py {asr_train_config} {asr_model_file} {wavdir} {txtdir} {output_dir}

The result is written into a segments file segments.txt and a log file segments.log in the output directory. Using the segments file, bad utterances or audio files can be sorted-out:

min_confidence_score=-0.3
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${output_dir}/segments.txt

step5 (ASV): speaker variation scoring

There are three types of videos: text-to-speech (a.k.a., TTS) video, single-speaker (i.e., monologue) video, and multi-speaker (e.g., dialogue) video. The script scripts/xxx.py obtains scores of speaker variation within a video to classify videos into three types.

$ python scripts/xxx.py

Reference

coming soon

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JTubeSpeech: Corpus of Japanese speech collected from YouTube

Description

Statistics

Contributors

Scripts for data collection

step1: making search words

step2: obtaining video IDs

step3: checking if subtitles are available

step4: downloading videos with manual subtitles

step5 (ASR): alignment and scoring

step5 (ASV): speaker variation scoring

Reference

Link

About

Releases

Packages

Languages

License

supikiti/jtubespeech

Folders and files

Latest commit

History

Repository files navigation

JTubeSpeech: Corpus of Japanese speech collected from YouTube

Description

Statistics

Contributors

Scripts for data collection

step1: making search words

step2: obtaining video IDs

step3: checking if subtitles are available

step4: downloading videos with manual subtitles

step5 (ASR): alignment and scoring

step5 (ASV): speaker variation scoring

Reference

Link

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages