soundstorm-speechtokenizer

Introduction

Implementation of SoundStorm built upon SpeechTokenizer. We employ RVQ-1 of SpeechTokenizer as the semantic tokens described in the paper, using it as a condition to generate tokens for the subsequent RVQ layers.

This repository is a modification of lucidrains/soundstorm-pytorch. While the Conformer implementation remains intact from the original, I've rewritten the SoundStorm model and its training components.

Samples

We used two RTX-3090 gpus to train a toy model on LibriSpeech-960. Samples of zero-shot TTS on our demo page. Voice conversion samples and unprompt samples are provided in samples.

Objective Metrics

Zero-shot TTS

Model	Speaker Similarity
VALL-E (our)	0.7593
USLM	0.8381
USLM (SoundStorm)	0.8827

Voice Conversion

Model	Speaker Similarity
SoundStorm	0.8985

Release

[9/25] 🔥 We released checkpoint trained on LibriSpeech.

Model storage

Model	Dataset	Discription
soundstorm_speechtokenizer	LibriSpeech	conformer={'dim':1024,'depth': 12,'heads':8, 'dim_head': 128,'attn_flash': False}

Installation

soundstorm-speechtokenizer requires Python>=3.8, and a reasonly recent version of PyTorch. To install soundstorm_speechtokenizer, you can run from this repository:

git clone https://github.com/ZhangXInFD/soundstorm-speechtokenizer.git
cd soundstorm-speechtokenizer
pip install .

Usage

import torch, torchaudio
from soundstorm_speechtokenizer import SoundStorm, ConformerWrapper
from speechtokenizer import SpeechTokenizer
from einops import rearrange

conformer = ConformerWrapper(codebook_size=1024,
                            num_quantizers=7,
                            conformer={'dim':1024, 
                                      'depth': 12, 
                                      'heads':8, 
                                      'dim_head': 128, 
                                      'attn_flash': False
                                      },
                                )

soundstorm = SoundStorm(net=conformer,
                        num_semantic_token_ids=1024,
                        semantic_pad_id=1024,
                        pad_id=1024,
                        schedule = 'cosine')

# get your pre-encoded codebook ids from the soundstream from a lot of raw audio

codes = torch.randint(0, 1024, (2, 1024, 7)) # (batch, seq, num RVQ)

# do the below in a loop for a ton of data

loss, acc, generated = soundstorm(codes)
loss.backward()

Train

We provide a trainer to train SoundStorm, which supports both audio input and token sequence input. An example of training is shown in train.py. You should generate a text file that record the files used to train and valid before training. An example used to process LibriSpeech-960 is provided in ls_preprocess.py.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
samples		samples
soundstorm_speechtokenizer		soundstorm_speechtokenizer
LICENSE		LICENSE
README.md		README.md
ls_preprocess.py		ls_preprocess.py
setup.py		setup.py
train.py		train.py
voice_conversion.py		voice_conversion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

soundstorm-speechtokenizer

Introduction

Samples

Objective Metrics

Zero-shot TTS

Voice Conversion

Release

Model storage

Installation

Usage

Train

License

ZhangXInFD/soundstorm-speechtokenizer

Folders and files

Latest commit

History

Repository files navigation

soundstorm-speechtokenizer

Introduction

Samples

Objective Metrics

Zero-shot TTS

Voice Conversion

Release

Model storage

Installation

Usage

Train