Published October 5, 2023 | Version v1
Other Open

Pretrained models and data scripts for Partial Rank Similarity MOS Prediction

  • 1. IIIT Delhi
  • 2. NII
  • 3. Microsoft

Description

This repository contains pretrained models and scripts/instructions for obtaining data for our paper accepted to ASRU 2023:
"Partial Rank Similarity Minimization Method for Quality MOS Prediction of Unseen Speech Synthesis Systems in Zero-shot and Semi-supervised Setting," by Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, and Rajiv Ratn Shah.
Please cite this paper if you use any of these pretrained models.

This pretrained model goes with the code found here:
https://github.com/nii-yamagishilab/partial_rank_similarity

See that codebase's README for more information about usage.

This pretrained model was finetuned from a pretrained model from the Fairseq project:
https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec
(Wav2Vec 2.0 Base, no finetuning)

The pretrained models were finetuned in the first stage using the BVCC dataset:
"How do Voices from Past Speech Synthesis Challenges Compare Today?"
Erica Cooper and Junichi Yamagishi, SSW 2021.
https://www.isca-speech.org/archive/ssw_2021/cooper21_ssw.html
https://doi.org/10.5281/zenodo.6572573

COPYING

Please see LICENSE-wav2vec2.txt for the terms and conditions of the pretrained models.

DATA

The terms of use of most of the BVCC dataset audio samples are that it is not permitted to redistribute them.  Therefore, please see the instructions in README.txt for obtaining all of the data samples.

This project also uses data from ASVspoof 2019, which has been redistributed in a derivative form for use with this project here: https://zenodo.org/record/8412617
Please see the instructions in the README for more information about how to combine that data with this repository.
"ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech"
Xin Wang, Junichi Yamagishi, Massimiliano Todisco, Héctor Delgado, Andreas Nautsch, Nicholas Evans, Md Sahidullah, Ville Vestman, Tomi Kinnunen, Kong Aik Lee, Lauri Juvela, Paavo Alku, Yu-Huai Peng, Hsin-Te Hwang, Yu Tsao, Hsin-Min Wang, Sébastien Le Maguer, Markus Becker, Fergus Henderson, Rob Clark, Yu Zhang, Quan Wang, Ye Jia, Kai Onuma, Koji Mushika, Takashi Kaneda, Yuan Jiang, Li-Juan Liu, Yi-Chiao Wu, Wen-Chin Huang, Tomoki Toda, Kou Tanaka, Hirokazu Kameoka, Ingmar Steiner, Driss Matrouf, Jean-François Bonastre, Avashna Govender, Srikanth Ronanki, Jing-Xuan Zhang, Zhen-Hua Ling.
Computer Speech and Language Colume 64, 2020.

ACKNOWLEDGMENTS

This study is supported by JST CREST Grant Number JPMJCR18A6 and by MEXT KAKENHI grant 21K11951.  RR Shah is partly supported by the CAI and CDNM at IIIT Delhi, India.  Hemant Yadav is supported by Microsoft Research India PhD Fellowship program.  We thank the organizers of the Blizzard Challenge, Voice Conversion Challenge, ESPnet-TTS, and ASVspoof 2019 for making their audio samples and listening test data available for research.

Files

LICENSE-wav2vec2.txt

Files (1.6 GB)

Name Size Download all
md5:90fdcae29239c9364071f0eb7cdb27b2
1.6 GB Download
md5:0805c95399118cccaea727606d7df2ab
1.5 kB Download
md5:c8681f4616b98dbab833824f6fb8a2c8
2.4 kB Preview Download
md5:f6b1ee02e26fff14597b297d8f9271b2
3.1 kB Preview Download