updated README

PABannier · PABannier · Feb 13, 2024 · Oct 26, 2023 · Oct 26, 2023 · Oct 26, 2023
commit 38846ecfba9da6c81a575da2a215fd798e044d7f
diff --git a/README.md b/README.md
@@ -9,13 +9,10 @@
 
 Inference of [SunoAI's bark model](https://github.com/suno-ai/bark) in pure C/C++.
 
-**Disclaimer: there remains bug in the inference code, bark is able to generate audio for some prompts or some seeds,
-but it does not work for most prompts. The current effort of the community is to fix those bugs, in order to release
-v0.0.2**.
-
 ## Description
 
-The main goal of `bark.cpp` is to synthesize audio from a textual input with the [Bark](https://github.com/suno-ai/bark) model in efficiently using only CPU.
+With `bark.cpp`, our goal is to bring **real-time realistic** text-to-speech generation to the community.
+Currently, we are focused on porting the [Bark](https://github.com/suno-ai/bark) model in C++.
 
 - [X] Plain C/C++ implementation without dependencies
 - [X] AVX, AVX2 and AVX512 for x86 architectures
@@ -42,113 +39,56 @@ Demo on [Google Colab](https://colab.research.google.com/drive/1JVtJ6CDwxtKfFmEd
 
 ---
 
-Here are typical audio pieces generated by `bark.cpp`:
+Here is a typical run using `bark.cpp`:
 
-https://github.com/PABannier/bark.cpp/assets/12958149/f9f240fd-975f-4d69-9bb3-b295a61daaff
+```java
+make -j && ./main -p "This is an audio generated by bark.cpp"
 
-https://github.com/PABannier/bark.cpp/assets/12958149/c0caadfd-bed9-4a48-8c17-3215963facc1
+ __ __
+ / /_ ____ ______/ /__ _________ ____
+ / __ \/ __ `/ ___/ //_/ / ___/ __ \/ __ \
+ / /_/ / /_/ / / / ,< _ / /__/ /_/ / /_/ /
+/_.___/\__,_/_/ /_/|_| (_) \___/ .___/ .___/
+ /_/ /_/
 
-Here is a typical run using Bark:
 
-```java
-make -j && ./main -p "this is an audio"
-I bark.cpp build info:
-I UNAME_S: Darwin
-I UNAME_P: arm
-I UNAME_M: arm64
-I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -pthread -DGGML_USE_ACCELERATE
-I CXXFLAGS: -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread
-I LDFLAGS: -framework Accelerate
-I CC: Apple clang version 14.0.0 (clang-1400.0.29.202)
-I CXX: Apple clang version 14.0.0 (clang-1400.0.29.202)
-
-bark_model_load: loading model from './ggml_weights'
-bark_model_load: reading bark text model
-gpt_model_load: n_in_vocab = 129600
-gpt_model_load: n_out_vocab = 10048
-gpt_model_load: block_size = 1024
-gpt_model_load: n_embd = 1024
-gpt_model_load: n_head = 16
-gpt_model_load: n_layer = 24
-gpt_model_load: n_lm_heads = 1
-gpt_model_load: n_wtes = 1
-gpt_model_load: ggml tensor size = 272 bytes
-gpt_model_load: ggml ctx size = 1894.87 MB
-gpt_model_load: memory size = 192.00 MB, n_mem = 24576
-gpt_model_load: model size = 1701.69 MB
-bark_model_load: reading bark vocab
-
-bark_model_load: reading bark coarse model
-gpt_model_load: n_in_vocab = 12096
-gpt_model_load: n_out_vocab = 12096
-gpt_model_load: block_size = 1024
-gpt_model_load: n_embd = 1024
-gpt_model_load: n_head = 16
-gpt_model_load: n_layer = 24
-gpt_model_load: n_lm_heads = 1
-gpt_model_load: n_wtes = 1
-gpt_model_load: ggml tensor size = 272 bytes
-gpt_model_load: ggml ctx size = 1443.87 MB
-gpt_model_load: memory size = 192.00 MB, n_mem = 24576
-gpt_model_load: model size = 1250.69 MB
-
-bark_model_load: reading bark fine model
-gpt_model_load: n_in_vocab = 1056
-gpt_model_load: n_out_vocab = 1056
-gpt_model_load: block_size = 1024
-gpt_model_load: n_embd = 1024
-gpt_model_load: n_head = 16
-gpt_model_load: n_layer = 24
-gpt_model_load: n_lm_heads = 7
-gpt_model_load: n_wtes = 8
-gpt_model_load: ggml tensor size = 272 bytes
-gpt_model_load: ggml ctx size = 1411.25 MB
-gpt_model_load: memory size = 192.00 MB, n_mem = 24576
-gpt_model_load: model size = 1218.26 MB
-
-bark_model_load: reading bark codec model
-encodec_model_load: model size = 44.32 MB
-
-bark_model_load: total model size = 74.64 MB
-
-bark_generate_audio: prompt: 'this is an audio'
-bark_generate_audio: number of tokens in prompt = 513, first 8 tokens: 20579 20172 20199 33733 129595 129595 129595 129595
-bark_forward_text_encoder: ...........................................................................................................
-
-bark_forward_text_encoder: mem per token = 4.80 MB
-bark_forward_text_encoder: sample time = 7.91 ms
-bark_forward_text_encoder: predict time = 2779.49 ms / 7.62 ms per token
-bark_forward_text_encoder: total time = 2829.35 ms
-
-bark_forward_coarse_encoder: .................................................................................................................................................................
-..................................................................................................................................................................
-
-bark_forward_coarse_encoder: mem per token = 8.51 MB
-bark_forward_coarse_encoder: sample time = 3.08 ms
-bark_forward_coarse_encoder: predict time = 10997.70 ms / 33.94 ms per token
-bark_forward_coarse_encoder: total time = 11036.88 ms
-
-bark_forward_fine_encoder: .....
-
-bark_forward_fine_encoder: mem per token = 5.11 MB
-bark_forward_fine_encoder: sample time = 39.85 ms
-bark_forward_fine_encoder: predict time = 19773.94 ms
-bark_forward_fine_encoder: total time = 19873.72 ms
-
-
-
-bark_forward_encodec: mem per token = 760209 bytes
-bark_forward_encodec: predict time = 528.46 ms / 528.46 ms per token
-bark_forward_encodec: total time = 663.63 ms
+bark_tokenize_input: prompt: 'this is a dog barking.'
+bark_tokenize_input: number of tokens in prompt = 513, first 8 tokens: 20579 20172 10217 27883 28169 25677 10167 129595
 
-Number of frames written = 51840.
+Generating semantic tokens: [========> ] (17%)
+
+bark_print_statistics: mem per token = 0.00 MB
+bark_print_statistics: sample time = 9.90 ms / 138 tokens
+bark_print_statistics: predict time = 3163.78 ms / 22.92 ms per token
+bark_print_statistics: total time = 3188.37 ms
+
+Generating coarse tokens: [==================================================>] (100%)
+
+bark_print_statistics: mem per token = 0.00 MB
+bark_print_statistics: sample time = 3.96 ms / 410 tokens
+bark_print_statistics: predict time = 14303.32 ms / 34.89 ms per token
+bark_print_statistics: total time = 14315.52 ms
 
+Generating fine tokens: [==================================================>] (100%)
+
+bark_print_statistics: mem per token = 0.00 MB
+bark_print_statistics: sample time = 41.93 ms / 6144 tokens
+bark_print_statistics: predict time = 15234.38 ms / 2.48 ms per token
+bark_print_statistics: total time = 15282.15 ms
+
+Number of frames written = 51840.
 
 main: load time = 1436.36 ms
 main: eval time = 34520.53 ms
-main: total time = 35956.92 ms
+main: total time = 32786.04 ms
 ```
 
+Here are typical audio pieces generated by `bark.cpp`:
+
+https://github.com/PABannier/bark.cpp/assets/12958149/f9f240fd-975f-4d69-9bb3-b295a61daaff
+
+https://github.com/PABannier/bark.cpp/assets/12958149/c0caadfd-bed9-4a48-8c17-3215963facc1
+
 ## Usage
 
 Here are the steps for the bark model.