Skip to content

Commit

Permalink
DOC Update README (PABannier#55)
Browse files Browse the repository at this point in the history
  • Loading branch information
PABannier committed Aug 12, 2023
1 parent 8ec7c72 commit 00ff99b
Show file tree
Hide file tree
Showing 5 changed files with 260 additions and 12 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
ggml_weights/*
*.dSYM/
build/
models/

bark
encodec
Expand Down
219 changes: 207 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,214 @@
# bark.cpp (coming soon!)
# bark.cpp

Inference of SunoAI's bark model in pure C/C++ using [ggml](https://github.com/ggerganov/ggml).
![bark.cpp](./assets/banner.jpeg)

[![Actions Status](https://github.com/PABannier/bark.cpp/actions/workflows/build.yml/badge.svg)](https://github.com/PABannier/bark.cpp/actions)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)

[Roadmap](https://github.com/users/PABannier/projects/1) / [encodec.cpp](https://github.com/PABannier/encodec.cpp) / [ggml](https://github.com/ggerganov/ggml)

Inference of [SunoAI's bark model](https://github.com/suno-ai/bark) in pure C/C++.

## Description

The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model.
The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model in efficiently using only CPU.

- [X] Plain C/C++ implementation without dependencies
- [X] AVX, AVX2 and AVX512 for x86 architectures
- [ ] Optimized via ARM NEON, Accelerate and Metal frameworks
- [ ] iOS on-device deployment using CoreML
- [ ] Mixed F16 / F32 precision
- [ ] 4-bit, 5-bit and 8-bit integer quantization

The original implementation of `bark.cpp` is the bark's 24Khz English model. We expect to support multiple languages in the future, as well as other vocoders (see [this](https://github.com/PABannier/bark.cpp/issues/36) and [this](https://github.com/PABannier/bark.cpp/issues/6)).
This project is for educational purposes.

**Supported platforms:**

- [X] Mac OS
- [X] Linux
- [X] Windows

**Supported models:**

- [X] Bark's 24Khz model
- [ ] Bark's 48Khz model
- [ ] Multiple voices

---

Here is a typical run using Bark:

```java
make -j && ./main -p "this is an audio"
I bark.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
I LDFLAGS: -framework Accelerate
I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)

bark_model_load: loading model from './ggml_weights'
bark_model_load: reading bark text model
gpt_model_load: n_in_vocab = 129600
gpt_model_load: n_out_vocab = 10048
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1894.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1701.69 MB
bark_model_load: reading bark vocab

bark_model_load: reading bark coarse model
gpt_model_load: n_in_vocab = 12096
gpt_model_load: n_out_vocab = 12096
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 1
gpt_model_load: n_wtes = 1
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1443.87 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1250.69 MB

bark_model_load: reading bark fine model
gpt_model_load: n_in_vocab = 1056
gpt_model_load: n_out_vocab = 1056
gpt_model_load: block_size = 1024
gpt_model_load: n_embd = 1024
gpt_model_load: n_head = 16
gpt_model_load: n_layer = 24
gpt_model_load: n_lm_heads = 7
gpt_model_load: n_wtes = 8
gpt_model_load: ggml tensor size = 272 bytes
gpt_model_load: ggml ctx size = 1411.25 MB
gpt_model_load: memory size = 192.00 MB, n_mem = 24576
gpt_model_load: model size = 1218.26 MB

bark_model_load: reading bark codec model
encodec_model_load: model size = 44.32 MB

bark_model_load: total model size = 74.64 MB

bark_generate_audio: prompt: 'this is an audio'
bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
bark_forward_text_encoder: ...........................................................................................................

bark_forward_text_encoder: mem per token = 4.80 MB
bark_forward_text_encoder: sample time = 7.91 ms
bark_forward_text_encoder: predict time = 2779.49 ms / 7.62 ms per token
bark_forward_text_encoder: total time = 2829.35 ms

bark_forward_coarse_encoder: .................................................................................................................................................................
..................................................................................................................................................................

bark_forward_coarse_encoder: mem per token = 8.51 MB
bark_forward_coarse_encoder: sample time = 3.08 ms
bark_forward_coarse_encoder: predict time = 10997.70 ms / 33.94 ms per token
bark_forward_coarse_encoder: total time = 11036.88 ms

bark_forward_fine_encoder: .....

bark_forward_fine_encoder: mem per token = 5.11 MB
bark_forward_fine_encoder: sample time = 39.85 ms
bark_forward_fine_encoder: predict time = 19773.94 ms
bark_forward_fine_encoder: total time = 19873.72 ms



bark_forward_encodec: mem per token = 760209 bytes
bark_forward_encodec: predict time = 528.46 ms / 528.46 ms per token
bark_forward_encodec: total time = 663.63 ms

Number of frames written = 51840.


main: load time = 1436.36 ms
main: eval time = 34520.53 ms
main: total time = 35956.92 ms
```

## Usage

Here are the steps for the bark model.

### Get the code

```bash
git clone https://github.com/PABannier/bark.cpp.git
cd bark.cpp
```

### Build

In order to build bark.cpp you have two different options. We recommend using `CMake` for Windows.

- Using `make`:
- On Linux or MacOS:

```bash
make
```

- Using `CMake`:

```bash
mkdir build
cd build
cmake ..
cmake --build . --config Release
```

### Prepare data & Run

```bash
# obtain the original bark and encodec weights and place them in ./models
python3 download_weights.py --download-dir ./models

# install Python dependencies
python3 -m pip install -r requirements.txt

# convert the model to ggml format
python convert.py \
--dir-model ./models \
--codec-path ./models \
--vocab-path ./models \
--out-dir ./ggml_weights/

# run the inference
./main -m ./models/ggml_weights/ -p "this is an audio"
```

### Seminal papers and background on models

- Bark
- [Text Prompted Generative Audio](https://github.com/suno-ai/bark)
- Encodec
- [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
- GPT-3
- [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)

### Contributing

`bark.cpp` is a continuous endeavour that relies on the community efforts to last and evolve. Your contribution is welcome and highly valuable. It can be

Bark has essentially 4 components:
- [x] Semantic model to encode the text input
- [x] Coarse model
- [x] Fine model
- [ ] Encoder (quantizer + decoder) to generate the waveform from the tokens
- bug report: you may encounter a bug while using `bark.cpp`. Don't hesitate to report it on the issue section.
- feature request: you want to add a new model or support a new platform. You can use the issue section to make suggestions.
- pull request: you may have fixed a bug, added a features, or even fixed a small typo in the documentation, ... you can submit a pull request and a reviewer will reach out to you.

## Roadmap
### Coding guidelines

- [ ] Quantization
- [ ] FP16
- [ ] Swift package for iOS devices
- Avoid adding third-party dependencies, extra files, extra headers, etc.
- Always consider cross-compatibility with other operating systems and architectures
- Avoid fancy looking modern STL constructs, keep it simple
- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & ref`
Binary file added assets/banner.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
49 changes: 49 additions & 0 deletions download_weights.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import argparse
from pathlib import Path

from huggingface_hub import hf_hub_download
import torch


ENCODEC_PATH = Path("https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th")

REMOTE_MODEL_PATHS = {
"text": {
"repo_id": "suno/bark",
"file_name": "text_2.pt",
},
"coarse": {
"repo_id": "suno/bark",
"file_name": "coarse_2.pt",
},
"fine": {
"repo_id": "suno/bark",
"file_name": "fine_2.pt",
},
}

parser = argparse.ArgumentParser()
parser.add_argument("--download-dir", type=str, required=True)

if __name__ == "__main__":
args = parser.parse_args()
out_dir = Path(args.download_dir)

out_dir.mkdir(parents=True, exist_ok=True)

print(" ### Downloading bark encoders...")
for model_k in REMOTE_MODEL_PATHS.keys():
model_details = REMOTE_MODEL_PATHS[model_k]
repo_id, filename = model_details["repo_id"], model_details["file_name"]
hf_hub_download(repo_id=repo_id, filename=filename, local_dir=out_dir)

print(" ### Downloading EnCodec weights...")
state_dict = torch.hub.load_state_dict_from_url(
str(ENCODEC_PATH),
map_location="cpu",
check_hash=True
)
with open(out_dir / ENCODEC_PATH.name, "wb") as fout:
torch.save(state_dict, fout)

print("Done.")
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
huggingface-hub>=0.14.1
numpy
torch

0 comments on commit 00ff99b

Please sign in to comment.