DOC Update README (PABannier#55)

harperreed · Aug 12, 2023 · 00ff99b · 00ff99b
1 parent 8ec7c72
commit 00ff99b
Show file tree

Hide file tree

Showing 5 changed files with 260 additions and 12 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,6 +1,7 @@
 ggml_weights/*
 *.dSYM/
 build/
+models/
 
 bark
 encodec

diff --git a/README.md b/README.md
@@ -1,19 +1,214 @@
-# bark.cpp (coming soon!)
+# bark.cpp
 
-Inference of SunoAI's bark model in pure C/C++ using [ggml](https://github.com/ggerganov/ggml).
+![bark.cpp](./assets/banner.jpeg)
+
+[![Actions Status](https://github.com/PABannier/bark.cpp/actions/workflows/build.yml/badge.svg)](https://github.com/PABannier/bark.cpp/actions)
+[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
+
+[Roadmap](https://github.com/users/PABannier/projects/1) / [encodec.cpp](https://github.com/PABannier/encodec.cpp) / [ggml](https://github.com/ggerganov/ggml)
+
+Inference of [SunoAI's bark model](https://github.com/suno-ai/bark) in pure C/C++.
 
 ## Description
 
-The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model.
+The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model in efficiently using only CPU.
+
+- [X] Plain C/C++ implementation without dependencies
+- [X] AVX, AVX2 and AVX512 for x86 architectures
+- [ ] Optimized via ARM NEON, Accelerate and Metal frameworks
+- [ ] iOS on-device deployment using CoreML
+- [ ] Mixed F16 / F32 precision
+- [ ] 4-bit, 5-bit and 8-bit integer quantization
+
+The original implementation of `bark.cpp` is the bark's 24Khz English model. We expect to support multiple languages in the future, as well as other vocoders (see [this](https://github.com/PABannier/bark.cpp/issues/36) and [this](https://github.com/PABannier/bark.cpp/issues/6)).
+This project is for educational purposes.
+
+**Supported platforms:**
+
+- [X] Mac OS
+- [X] Linux
+- [X] Windows
+
+**Supported models:**
+
+- [X] Bark's 24Khz model
+- [ ] Bark's 48Khz model
+- [ ] Multiple voices
+
+---
+
+Here is a typical run using Bark:
+
+```java
+make -j && ./main -p "this is an audio"
+I bark.cpp build info:
+I UNAME_S: Darwin
+I UNAME_P: arm
+I UNAME_M: arm64
+I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_ACCELERATE
+I CXXFLAGS: -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
+I LDFLAGS: -framework Accelerate
+I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
+I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)
+
+bark_model_load: loading model from './ggml_weights'
+bark_model_load: reading bark text model
+gpt_model_load: n_in_vocab = 129600
+gpt_model_load: n_out_vocab = 10048
+gpt_model_load: block_size = 1024
+gpt_model_load: n_embd = 1024
+gpt_model_load: n_head = 16
+gpt_model_load: n_layer = 24
+gpt_model_load: n_lm_heads = 1
+gpt_model_load: n_wtes = 1
+gpt_model_load: ggml tensor size = 272 bytes
+gpt_model_load: ggml ctx size = 1894.87 MB
+gpt_model_load: memory size = 192.00 MB, n_mem = 24576
+gpt_model_load: model size = 1701.69 MB
+bark_model_load: reading bark vocab
+
+bark_model_load: reading bark coarse model
+gpt_model_load: n_in_vocab = 12096
+gpt_model_load: n_out_vocab = 12096
+gpt_model_load: block_size = 1024
+gpt_model_load: n_embd = 1024
+gpt_model_load: n_head = 16
+gpt_model_load: n_layer = 24
+gpt_model_load: n_lm_heads = 1
+gpt_model_load: n_wtes = 1
+gpt_model_load: ggml tensor size = 272 bytes
+gpt_model_load: ggml ctx size = 1443.87 MB
+gpt_model_load: memory size = 192.00 MB, n_mem = 24576
+gpt_model_load: model size = 1250.69 MB
+
+bark_model_load: reading bark fine model
+gpt_model_load: n_in_vocab = 1056
+gpt_model_load: n_out_vocab = 1056
+gpt_model_load: block_size = 1024
+gpt_model_load: n_embd = 1024
+gpt_model_load: n_head = 16
+gpt_model_load: n_layer = 24
+gpt_model_load: n_lm_heads = 7
+gpt_model_load: n_wtes = 8
+gpt_model_load: ggml tensor size = 272 bytes
+gpt_model_load: ggml ctx size = 1411.25 MB
+gpt_model_load: memory size = 192.00 MB, n_mem = 24576
+gpt_model_load: model size = 1218.26 MB
+
+bark_model_load: reading bark codec model
+encodec_model_load: model size = 44.32 MB
+
+bark_model_load: total model size = 74.64 MB
+
+bark_generate_audio: prompt: 'this is an audio'
+bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
+bark_forward_text_encoder: ...........................................................................................................
+
+bark_forward_text_encoder: mem per token = 4.80 MB
+bark_forward_text_encoder: sample time = 7.91 ms
+bark_forward_text_encoder: predict time = 2779.49 ms / 7.62 ms per token
+bark_forward_text_encoder: total time = 2829.35 ms
+
+bark_forward_coarse_encoder: .................................................................................................................................................................
+..................................................................................................................................................................
+
+bark_forward_coarse_encoder: mem per token = 8.51 MB
+bark_forward_coarse_encoder: sample time = 3.08 ms
+bark_forward_coarse_encoder: predict time = 10997.70 ms / 33.94 ms per token
+bark_forward_coarse_encoder: total time = 11036.88 ms
+
+bark_forward_fine_encoder: .....
+
+bark_forward_fine_encoder: mem per token = 5.11 MB
+bark_forward_fine_encoder: sample time = 39.85 ms
+bark_forward_fine_encoder: predict time = 19773.94 ms
+bark_forward_fine_encoder: total time = 19873.72 ms
+
+
+
+bark_forward_encodec: mem per token = 760209 bytes
+bark_forward_encodec: predict time = 528.46 ms / 528.46 ms per token
+bark_forward_encodec: total time = 663.63 ms
+
+Number of frames written = 51840.
+
+
+main: load time = 1436.36 ms
+main: eval time = 34520.53 ms
+main: total time = 35956.92 ms
+```
+
+## Usage
+
+Here are the steps for the bark model.
+
+### Get the code
+
+```bash
+git clone https://github.com/PABannier/bark.cpp.git
+cd bark.cpp
+```
+
+### Build
+
+In order to build bark.cpp you have two different options. We recommend using `CMake` for Windows.
+
+- Using `make`:
+ - On Linux or MacOS:
+
+ ```bash
+ make
+ ```
+
+- Using `CMake`:
+
+ ```bash
+ mkdir build
+ cd build
+ cmake ..
+ cmake --build . --config Release
+ ```
+
+### Prepare data & Run
+
+```bash
+# obtain the original bark and encodec weights and place them in ./models
+python3 download_weights.py --download-dir ./models
+
+# install Python dependencies
+python3 -m pip install -r requirements.txt
+
+# convert the model to ggml format
+python convert.py \
+ --dir-model ./models \
+ --codec-path ./models \
+ --vocab-path ./models \
+ --out-dir ./ggml_weights/
+
+# run the inference
+./main -m ./models/ggml_weights/ -p "this is an audio"
+```
+
+### Seminal papers and background on models
+
+- Bark
+ - [Text Prompted Generative Audio](https://github.com/suno-ai/bark)
+- Encodec
+ - [High Fidelity Neural Audio Compression](https://arxiv.org/abs/2210.13438)
+- GPT-3
+ - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
+
+### Contributing
+
+`bark.cpp` is a continuous endeavour that relies on the community efforts to last and evolve. Your contribution is welcome and highly valuable. It can be
 
-Bark has essentially 4 components:
-- [x] Semantic model to encode the text input
-- [x] Coarse model
-- [x] Fine model
-- [ ] Encoder (quantizer + decoder) to generate the waveform from the tokens
+- bug report: you may encounter a bug while using `bark.cpp`. Don't hesitate to report it on the issue section.
+- feature request: you want to add a new model or support a new platform. You can use the issue section to make suggestions.
+- pull request: you may have fixed a bug, added a features, or even fixed a small typo in the documentation, ... you can submit a pull request and a reviewer will reach out to you.
 
-## Roadmap
+### Coding guidelines
 
-- [ ] Quantization
-- [ ] FP16
-- [ ] Swift package for iOS devices
+- Avoid adding third-party dependencies, extra files, extra headers, etc.
+- Always consider cross-compatibility with other operating systems and architectures
+- Avoid fancy looking modern STL constructs, keep it simple
+- Clean-up any trailing whitespaces, use 4 spaces for indentation, brackets on the same line, `void * ptr`, `int & ref`
diff --git a/assets/banner.jpeg b/assets/banner.jpeg
diff --git a/download_weights.py b/download_weights.py
@@ -0,0 +1,49 @@
+import argparse
+from pathlib import Path
+
+from huggingface_hub import hf_hub_download
+import torch
+
+
+ENCODEC_PATH = Path("https://dl.fbaipublicfiles.com/encodec/v0/encodec_24khz-d7cc33bc.th")
+
+REMOTE_MODEL_PATHS = {
+ "text": {
+ "repo_id": "suno/bark",
+ "file_name": "text_2.pt",
+ },
+ "coarse": {
+ "repo_id": "suno/bark",
+ "file_name": "coarse_2.pt",
+ },
+ "fine": {
+ "repo_id": "suno/bark",
+ "file_name": "fine_2.pt",
+ },
+}
+
+parser = argparse.ArgumentParser()
+parser.add_argument("--download-dir", type=str, required=True)
+
+if __name__ == "__main__":
+ args = parser.parse_args()
+ out_dir = Path(args.download_dir)
+
+ out_dir.mkdir(parents=True, exist_ok=True)
+
+ print(" ### Downloading bark encoders...")
+ for model_k in REMOTE_MODEL_PATHS.keys():
+ model_details = REMOTE_MODEL_PATHS[model_k]
+ repo_id, filename = model_details["repo_id"], model_details["file_name"]
+ hf_hub_download(repo_id=repo_id, filename=filename, local_dir=out_dir)
+
+ print(" ### Downloading EnCodec weights...")
+ state_dict = torch.hub.load_state_dict_from_url(
+ str(ENCODEC_PATH),
+ map_location="cpu",
+ check_hash=True
+ )
+ with open(out_dir / ENCODEC_PATH.name, "wb") as fout:
+ torch.save(state_dict, fout)
+
+ print("Done.")
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,3 @@
+huggingface-hub>=0.14.1
+numpy
+torch