Skip to content

VITS-based Voice Conversion focused on simplicity, quality and performance.

License

Notifications You must be signed in to change notification settings

IAHispano/Applio

Repository files navigation

Applio

Contributors Release Stars Fork Issues

VITS-based Voice Conversion focused on simplicity, quality, and performance.

🌐 Website📚 Documentation☎️ Discord

🛒 Plugins📦 Compiled🎮 Playground🔎 Google Colab (UI)🔎 Google Colab (No UI)

Table of Contents

Installation

Download the latest version from GitHub Releases or use the Compiled Versions.

Windows

./run-install.bat

macOS

For macOS, you need to install the requirements in a Python environment version 3.9 to 3.11. Here are the steps:

python3 -m venv .venv
source .venv/bin/activate
chmod +x run-install.sh
./run-install.sh

Linux

Certain Linux-based operating systems may encounter complications with the installer. In such instances, we suggest installing the requirements.txt within a Python environment version 3.9 to 3.11.

chmod +x run-install.sh
./run-install.sh

Makefile

For platforms such as Paperspace:

make run-install

Usage

Visit Applio Documentation for a detailed UI usage explanation.

Windows

./run-applio.bat

macOS

chmod +x run-applio.sh
./run-applio.sh

Linux

chmod +x run-applio.sh
./run-applio.sh

Makefile

For platforms such as Paperspace:

make run-applio

Technical Information

Applio uses an enhanced version of the Retrieval-based Voice Conversion (RVC) model, a powerful technique for transforming the voice of an audio signal to sound like another person. This advanced implementation of RVC in Applio enables high-quality voice conversion while maintaining simplicity and performance.

0. Pre-Learning: Key Concepts in Speech Processing and Voice Conversion

This section introduces fundamental concepts in speech processing and voice conversion, paving the way for a deeper understanding of the RVC pipeline:

1. Speech Representation

  • Phoneme: The smallest unit of sound in a language that distinguishes one word from another. Examples: /k/, /æ/, /t/.
  • Spectrogram: A visual representation of the frequency content of a sound over time, showing how the intensity of different frequencies changes over the duration of the audio.
  • Mel-Spectrogram: A type of spectrogram that mimics human auditory perception, emphasizing frequencies that are more important to human hearing.
  • Speaker Embedding: A vector representation that captures the unique acoustic characteristics of a speaker's voice, encoding information about pitch, tone, timbre, and other vocal qualities.

2. Text-to-Speech (TTS)

  • TTS Model: A machine learning model that generates artificial speech from written text.
  • Encoder-Decoder Architecture: A common architecture in TTS models, where an encoder processes the text and pitch information to create a latent representation, and a decoder uses this representation to synthesize the audio signal.
  • Transformer Architecture: A powerful neural network architecture particularly well-suited for sequence modeling, allowing the model to handle long sequences of text or audio and capture relationships between elements.

3. Voice Conversion

  • Voice Conversion (VC): The process of transforming the voice of a speaker in an audio signal to sound like another speaker.
  • Speaker Adaptation: The process of adapting a TTS model to a specific speaker, often by training on a small dataset of the speaker's voice.
  • Retrieval-Based VC (RVC): A voice conversion approach where speaker embeddings are retrieved from a database and used to guide the TTS model in synthesizing audio with the target speaker's voice.

4. Additional Concepts

  • ContentVec: A powerful self-supervised learning model for speech representation, excelling at capturing speaker-specific information.
  • FAISS: A library for efficient similarity search, used to retrieve speaker embeddings that are similar to the extracted ContentVec embedding.
  • Neural Source Filter (NSF): A module that models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.

5. Why are these concepts important?

Understanding these concepts is essential for appreciating the mechanics and capabilities of the RVC pipeline:

  • Speech Representation: Different representations capture different aspects of speech, allowing for effective analysis and manipulation.
  • TTS Models: The TTS model forms the foundation of RVC, providing the ability to synthesize audio from text and pitch.
  • Voice Conversion: Voice conversion aims to transfer a speaker's identity to a different audio signal.
  • ContentVec and Speaker Embeddings: ContentVec provides a powerful way to extract speaker-specific information, which is crucial for accurate voice conversion.
  • FAISS: This library enables efficient speaker embedding retrieval, facilitating the selection of appropriate target voices.
  • NSF: The NSF is a critical component of the TTS model, contributing to the generation of realistic and high-quality audio.

1. Model Architecture

The RVC model comprises two main components:

A. Encoder-Decoder Network

This network synthesizes audio based on text and pitch information while incorporating speaker characteristics from the ContentVec embedding.

Encoder:

  • Input: Phoneme sequences (text representation) and pitch information (optional).

  • Embeddings:

    • Phonemes are represented as vectors using linear layers, creating a dense representation of the text input.
    • Pitch is usually converted to a one-hot encoding or a continuous value and embedded similarly.
  • Transformer Encoder: Processes the embedded features in a highly parallel manner.

    It employs:

    • Self-Attention: Allows the encoder to attend to different parts of the input sequence to understand the relationships between words and their context.
    • Feedforward Networks (FFN): Apply non-linear transformations to further refine the features captured by self-attention.
    • Layer Normalization: Stabilizes training and improves performance by normalizing the outputs of each layer.
    • Dropout: A regularization technique to prevent overfitting.
    • Output: Produces a latent representation of the input text and pitch, capturing their relationships and serving as the input for the decoder.

Decoder:

  • Input: The latent representation from the encoder.
  • Transformer Decoder: Receives the encoder output and utilizes:
    • Self-Attention: Allows the decoder to attend to different parts of the generated sequence to maintain consistency and coherence in the output audio.
    • Encoder-Decoder Attention: Enables the decoder to incorporate information from the input text and pitch into the audio generation process.
  • Neural Source Filter (NSF): A powerful component for generating audio, modeling the generation process as a filter applied to a source signal. It uses:
    • Upsampling: Increases the resolution of the latent representation to match the desired length of the audio signal.
    • Residual Blocks: Learn complex and non-linear relationships between input features and the output audio, contributing to realistic and detailed waveforms.
    • Source Module: Generates the excitation signal (often harmonic) that drives the NSF. It combines sine waves (for voiced sounds) and noise (for unvoiced sounds) to create a natural source signal.
    • Noise Convolution: Convolves noise with the harmonic signal to introduce additional variation and realism.
    • Final Convolutional Layer: Converts the filtered output to a single-channel audio waveform.
  • Output: Synthesized audio signal.

B. ContentVec Speaker Embedding Extractor

Extracts speaker-specific information from the input audio.

  • Input: The preprocessed audio signal.
  • Processing: The ContentVec model, trained on a massive dataset of speech data, processes the input audio and extracts a speaker embedding vector, capturing the unique acoustic properties of the speaker's voice.
  • Output: A speaker embedding vector representing the voice of the speaker.

2. Training Stage

The RVC model is trained using a combination of two key losses:

  • Generative Loss:
    • Mel-Spectrogram: The Mel-spectrogram is computed for both the target audio and the generated audio.
    • L1 Loss: Measures the absolute difference between the Mel-spectrograms of the target and generated audio, encouraging the decoder to produce audio with a similar spectral profile.
  • Discriminative Loss:
    • Multi-Period Discriminator: Tries to distinguish between real and generated audio at different time scales, using convolution layers to capture long-term dependencies in the audio.
    • Adversarial Training: The generator tries to fool the discriminator by producing audio that sounds real, while the discriminator is trained to correctly identify generated audio.
  • Optional KL Divergence Loss: Measures the difference between the distributions of latent variables generated by the encoder and a posterior encoder (which infers the latent representation from the target audio). Encourages the model to learn a more efficient and stable latent representation.

3. Inference Stage

The inference stage utilizes the trained model to convert the voice of an audio input to sound like a target speaker. Here's a breakdown:

Input:

  • Phoneme sequences (text representation).
  • Pitch information (optional).
  • Target speaker ID (identifies the desired voice).

Steps:

  • ContentVec Embedding Extraction:
    • The ContentVec model processes the input audio and extracts a speaker embedding vector, capturing the voice characteristics of the speaker.
  • Optional Embedding Retrieval:
    • FAISS Index: Used to efficiently search for speaker embeddings similar to the extracted ContentVec embedding. It helps guide the voice conversion process toward a specific speaker when multiple speakers are available.
    • Embedding Retrieval: The FAISS index is queried using the extracted ContentVec embedding, and similar embeddings are retrieved.
  • Embedding Manipulation:
    • Blending: The extracted ContentVec embedding can be blended with retrieved embeddings using the index_rate parameter, allowing control over how much the target speaker's voice influences the conversion.
  • Encoder-Decoder Processing:
    • Encoder: Encodes the phoneme sequences and pitch into a latent representation, capturing the relationships between them.
    • Decoder: Synthesizes the audio signal, incorporating the speaker characteristics from the ContentVec embedding (potentially blended with retrieved embeddings).
  • Post-Processing:
    • Resampling: Adjusts the sampling rate of the generated audio if needed.
    • RMS Adjustment: Adjusts the volume (RMS) of the output audio to match the input audio.

4. Key Techniques

  • Transformer Architecture: The Transformer architecture is a powerful tool for sequence modeling, enabling the encoder and decoder to efficiently process long sequences and capture complex relationships within the data.
  • Neural Source Filter (NSF): Models audio generation as a filtering process, allowing the model to produce high-quality and realistic audio signals by learning complex relationships between the source signal and the output waveform.
  • Flow-Based Generative Model: Enables the model to learn complex probability distributions for the audio signal, leading to more realistic and diverse generated speech.
  • Multi-period Discriminator: Helps improve the quality and realism of the generated audio by evaluating the audio at

different temporal scales and providing feedback to the generator.

  • Relative Positional Encoding: Helps the model understand the relative positions of elements within the input sequences, improving the model's ability to handle long sequences and maintain context.

5. Future Challenges

Despite the advancements in Retrieval-Based Voice Conversion, several challenges and areas for future research remain:

  • Speaker Generalization: Improving the ability of models to generalize to unseen speakers with minimal data.
  • Real-time Processing: Enhancing the efficiency of models to support real-time voice conversion applications.
  • Emotional Expression: Better capturing and transferring emotional nuances in voice conversion.
  • Noise Robustness: Improving the robustness of voice conversion models to handle noisy and low-quality input audio.

Repository Enhancements

This repository has undergone significant enhancements to improve its functionality and maintainability:

  • Modular Codebase: Restructured codebase for better organization, readability, and maintenance.
  • Hop Length Implementation: Improved efficiency and performance, especially on Crepe (formerly Mangio-Crepe), thanks to @Mangio621.
  • Translations in 30+ Languages: Added support for over 30 languages.
  • Cross-Platform Compatibility: Ensured seamless operation across various platforms.
  • Optimized Requirements: Fine-tuned project requirements for enhanced performance.
  • Streamlined Installation: Simplified installation process for a user-friendly setup.
  • Hybrid F0 Estimation: Introduced a personalized 'hybrid' F0 estimation method utilizing nanmedian.
  • Easy-to-Use UI: Implemented an intuitive user interface.
  • Plugin System: Introduced a plugin system for extending functionality.
  • Overtraining Detector: Implemented a detector to prevent excessive training.
  • Model Search: Integrated model search feature for easy discovery.
  • Pretrained Models: Added support for custom pretrained models.
  • Voice Blender: Developed a feature to combine two trained models to create a new one.
  • Accessibility Improvements: Enhanced with descriptive tooltips for UI elements.
  • New F0 Extraction Methods: Introduced methods like FCPE or Hybrid for pitch extraction.
  • Output Format Selection: Added feature to choose audio file formats.
  • Hashing System: Assigned unique IDs to models to prevent unauthorized duplication.
  • Model Download System: Supported downloads from various platforms.
  • TTS Enhancements: Improved Text-to-Speech functionality.
  • Split Audio: Implemented audio splitting for faster processing.
  • Discord Presence: Displayed usage status on Discord.
  • Flask Integration: Enabled automatic model downloads via Flask.
  • Support Tab: Added a tab for screen recording to report issues.

These enhancements contribute to a more robust and scalable codebase, making the repository more accessible for contributors and users alike.

Commercial Usage

For commercial purposes, please adhere to the guidelines outlined in the MIT license governing this project. Prior to integrating Applio into your application, we kindly request that you contact us at [email protected] to ensure ethical use.

Please note, the use of Applio-generated audio files falls under your own responsibility and must always respect applicable copyrights. We encourage you to consider supporting the continuous development and maintenance of Applio through a donation.

Your cooperation and support are greatly appreciated. Thank you!

References

Applio is possible to these projects and those cited in their references.

Contributors