PhoneLM

About

UPDATE UPDATE (04/09/2023): Model overfitting on single and multiple samples works. Generalisation seems to be harder. Probably because original MegaByte model from lucidrains trained to predict entire sequence from 1 token to n-1 tokens. Need to training method to only predict the response and not the token, otherwise the model is having to pointlessly learn to predict the prompt as well.

UPDATE: Generalisation training seems some what promising. Model consistently outputs the correct number of audio tokens and can deal with the temporal context somewhat well. However, main issue seems to be more with the "spatial" component of predicting the sequence, i.e., predicting the correct codebook codes per timestep.

Text to speech using phonemes as inputs and audio codec codes as outputs. Loosely based on MegaByte, VALL-E and Encodec.

Method

Use G2P to encode text.
Use encodec to encode and decode audio.
Custom LJSpeech dataloader to include phonemes and encodec audio codes

LJSpeech

Overfit model on one sample from LJSpeech
- Combine token space of text and audio codec codes
- LJ016-0073-synth.wav The initial "Mr. Cope" can just about be made out
- Using a codebook of 2 seems to be too aggressive.
- LJ003-0259-synth.wav "And attracted attention by their". Codebook of 2 is possible. Main issues is sequence length.
- Scaling up sequence length is easier than scaling up codebook size. This is for the arrangement of [time1_code_1, time_1_code_2, ...]. Perhaps [time1_code_1, time_2_code_1, ...] might perform better? So synthesize all codebook1 then all codebook 2.
- Longer duration prompts and audio targets seem to perform worse. Will try experimenting with shorter prompts (try to stick to roughly 3 second audio snippets.)
[-] Generalise (Using either 1 second prompt + clip, or 1.5 sec prompt and clip)
- Get any prompt to audio working (even if unintelligible and using clamping)
- [-] Get any coherent output

Inspiration

This model is loosely based on the VALL-E paper by Microsoft. It uses the MegaByte inspired model from Lucidrains as the Transformer Decoder model. Just as in VALL-E, a users text prompt is converted into phonemes using G2P (Grapheme-to-phoneme), and then the encodec audio codec codes are predicted. However, unlike VALL-E, only an autoregressive model is used. The VALL-E paper uses an autoregressive model to accept phonemes and audio codec code snippets of a source audio and uses that to predict the first codebook codes. The rest of the codebook codes are then predicted when the AR model is finished, it accepts the entire sequence, and then predicts all of the codebook 2 to codebook N codes. However, this increases the complexity of the approach as two models are now required and raises the possibility that the NAR model can not attend to all past inputs unlike the AR which can reduce audio quality output and may lead to repeating of outputs. In practice, the use of phonemes as input into VALL-E may alleviate this, however, this approach explores just predicting the entire sequence auto-regressively (across all codebooks at once).

This is inspired by the fact that the authors of the original MegaByte paper perform autoregressive audio prediction on raw audio data. They treat the audio files as just raw byte sequences and train a model to predict audio on 2TB worth of audio and find that compared to a vanilla transformer or Perceiver architectures, it scores a higher bpb. In principle, this means that the model is more efficient and accurate at modelling raw audio byte sequences than other approaches. The other benefits of the method is that the patch based auto-regressive generation may be well suited to the codebooks used by encodec. As the patch size can be set to 4 (for 4 codebooks each of which can be 1 of 1024 values), this means the local model of the MegaByte model can focus on modelling individual audio codec elements and the global model can focus on the larger context. Hopefully this greatly improves audio quality compared to VALL-E while being much simpler to train.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.gitignore		.gitignore
LICENSE		LICENSE
LJ003-0259-synth.wav		LJ003-0259-synth.wav
LJ016-0073-synth.wav		LJ016-0073-synth.wav
README.md		README.md
attend.py		attend.py
encodec_util.py		encodec_util.py
g2p_util.py		g2p_util.py
general_out.wav		general_out.wav
ljspeech.py		ljspeech.py
main.ipynb		main.ipynb
megabyte.py		megabyte.py
old_out.wav		old_out.wav
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhoneLM

About

Method

LJSpeech

Inspiration

About

Releases

Packages

Languages

License

MiscellaneousStuff/PhoneLM

Folders and files

Latest commit

History

Repository files navigation

PhoneLM

About

Method

LJSpeech

Inspiration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages