Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the decoding issues #1768

Open
wants to merge 58 commits into
base: master
Choose a base branch
from
Open

Conversation

bobqianic
Copy link
Collaborator

@bobqianic bobqianic commented Jan 14, 2024

  • Basic functionality
  • Rewrite whisper_wrap_segment
  • Rewrite L5717-L5805
  • Remove print_realtime This is too tricky
  • Remove hallucination by using token_nosp
  • Heuristic hallucination detection (Basic implementation)
  • Disable beam search when $temperature>0$
  • Fix tokenizer
  • Fix audio feature seeking mechanism
  • Use compression ratio instead of entropy Will be addressed in separate PRs
  • Code cleanup

@bobqianic bobqianic added the decoding Decoding related issues label Jan 17, 2024
@bobqianic
Copy link
Collaborator Author

bobqianic commented Feb 10, 2024

I looked at the code and this is unrelated to the non-speech token changes, right?

Yes. In situations where the model exhibits hallucinations with high confidence (avg_log_probs), this non-speech token approach will not be effective. The heuristic repetition check that I've implemented serves as a workaround for the compression ratio check. Implementing compression in C++ can be challenging without using third-party libraries. In the official implementation by OpenAI, both the compression ratio and non-speech tokens anti-hallucination mechanisms are utilized.

[00:57:11.700 --> 00:57:14.700] (c) 2014 University of Georgia College of Agricultural and Environmental Sciences UGA Extension Office of Communications and Creative Services

Which branch are you using? I can't find the hallucinations you mentioned.

large-v2

image

@jettoblack
Copy link

Which branch are you using? I can't find the hallucinations you mentioned.

I was using this PR @ 476dff4, unless I did something wrong, but this was on a Mac using the Metal gpu backend so that could make a difference. I'll retest on CPU and CUDA shortly and let you know.

@ukolovda
Copy link

ukolovda commented Feb 16, 2024

Hi!

@bobqianic new version is very robust!

On my test files, main branch emit 10 hallucinations on 26 WAV files (model ggml-large-v2, russian language).
With this PR it give only 2 hallucination. It is very fine result!!!

But example/server doesn't work at all, both CPU and CUDA versions. It returns empty text without any errors.
I try patch and append parameters (heuristic and other), but it not help. With --print-progress it print progress, but not result text.

Also it give the error on specific file:
500 Internal Server Error map::at

What can we do for fix it, how do you think?

Run server command:

/usr/src/whisper.cpp-bobqianic/server -m ../../models/ggml-large-v2.bin -l ru --print-progress --print-realtime -nt -nf

whisper_init_from_file_with_params_no_state: loading model from '../../models/ggml-large-v2.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094.49 MB (3 buffers)
whisper_model_load: model size    = 3093.99 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220.20 MB
whisper_init_state: kv cross size =  245.76 MB
whisper_init_state: compute buffer (conv)   =   33.91 MB
whisper_init_state: compute buffer (encode) =  233.50 MB
whisper_init_state: compute buffer (cross)  =   10.15 MB
whisper_init_state: compute buffer (decode) =  108.99 MB

whisper server listening at https://127.0.0.1:8080

Received request: 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-14.wav
Successfully loaded 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-14.wav

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

operator(): processing '0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-14.wav' (168960 samples, 10.6 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 0 ...

Running whisper.cpp inference on 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-14.wav
Received request: 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-15.wav
Successfully loaded 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-15.wav

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

operator(): processing '0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-15.wav' (235200 samples, 14.7 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 0 ...

Running whisper.cpp inference on 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-15.wav

whisper_print_progress_callback: progress = 204%
Received request: 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-16.wav
Successfully loaded 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-16.wav

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

operator(): processing '0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-16.wav' (512000 samples, 32.0 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 0 ...

Running whisper.cpp inference on 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-16.wav

whisper_print_progress_callback: progress =  93%

whisper_print_progress_callback: progress = 187%
Received request: 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-18.wav
Successfully loaded 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-18.wav

system_info: n_threads = 4 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

operator(): processing '0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-18.wav' (115520 samples, 7.2 sec), 4 threads, 1 processors, lang = ru, task = transcribe, timestamps = 0 ...

Running whisper.cpp inference on 0f3657ce-6352-4cbb-a88f-b39dc6a37a34-1-18.wav

whisper_print_progress_callback: progress = 416%
...

Send file command:

curl localhost:8080/inference -H "Content-Type: multipart/form-data" -F file="@${filename}"

git diff result:

diff --git a/examples/server/server.cpp b/examples/server/server.cpp
index cf0157d..5030e87 100644
--- a/examples/server/server.cpp
+++ b/examples/server/server.cpp
@@ -64,6 +64,7 @@ struct whisper_params {
     float word_thold      =  0.01f;
     float entropy_thold   =  2.40f;
     float logprob_thold   = -1.00f;
+    float no_speech_thold =  0.60f;
     float temperature     =  0.00f;
     float temperature_inc =  0.20f;
 
@@ -78,6 +79,8 @@ struct whisper_params {
     bool print_realtime  = false;
     bool print_progress  = false;
     bool no_timestamps   = false;
+    bool suppress_nst    = true;  // suppress non speech tokens
+    bool heuristic       = true;
     bool use_gpu         = true;
 
     std::string language        = "en";
@@ -183,7 +186,10 @@ bool whisper_params_parse(int argc, char ** argv, whisper_params & params, serve
         else if (arg == "-wt"   || arg == "--word-thold")      { params.word_thold      = std::stof(argv[++i]); }
         else if (arg == "-et"   || arg == "--entropy-thold")   { params.entropy_thold   = std::stof(argv[++i]); }
         else if (arg == "-lpt"  || arg == "--logprob-thold")   { params.logprob_thold   = std::stof(argv[++i]); }
+        else if (arg == "-nst"  || arg == "--nospeech-thold")  { params.no_speech_thold = std::stof(argv[++i]); }
         // else if (arg == "-su"   || arg == "--speed-up")        { params.speed_up        = true; }
+        else if (arg == "-nsnst"|| arg == "--no-suppress-nst") { params.suppress_nst    = false; }
+        else if (arg == "-nh"   || arg == "--no-heuristic")    { params.heuristic       = false; }
         else if (arg == "-tr"   || arg == "--translate")       { params.translate       = true; }
         else if (arg == "-di"   || arg == "--diarize")         { params.diarize         = true; }
         else if (arg == "-tdrz" || arg == "--tinydiarize")     { params.tinydiarize     = true; }
@@ -726,6 +732,7 @@ int main(int argc, char ** argv) {
             wparams.max_len          = params.max_len == 0 ? 60 : params.max_len;
 
             wparams.speed_up         = params.speed_up;
+wparams.heuristic = params.heuristic;
 
             wparams.tdrz_enable      = params.tinydiarize; // [TDRZ]
 
@@ -738,8 +745,11 @@ int main(int argc, char ** argv) {
             wparams.temperature_inc  = params.temperature_inc;
             wparams.entropy_thold    = params.entropy_thold;
             wparams.logprob_thold    = params.logprob_thold;
+wparams.no_speech_thold = params.no_speech_thold;
 
             wparams.no_timestamps    = params.no_timestamps;
+wparams.suppress_non_speech_tokens = params.suppress_nst;
+
             wparams.token_timestamps = !params.no_timestamps && params.response_format == vjson_format;
 
             whisper_print_user_data user_data = { &params, &pcmf32s, 0 };

Thank you!

@felrock
Copy link
Collaborator

felrock commented Feb 17, 2024

Hello @ukolovda I took a look at this yesterday evening. Whats missing in server.cpp is what you mentioned:

  • heuristics
  • supress_nst
  • no_speech_thold

I got an output in the terminal by circumventing the print_realtime flag(instead of using a callback segment). So the model does in fact generate the output string but for some unknown reason whisper_full_n_segments(ctx) returns 0. I try to check this a bit more tomorrow.

@ukolovda
Copy link

I got an output in the terminal by circumventing the print_realtime flag(instead of using a callback segment). So the model does in fact generate the output string but for some unknown reason whisper_full_n_segments(ctx) returns 0.

Hello, @felrock !

Thank you!

@ukolovda
Copy link

ukolovda commented Feb 20, 2024

Append issue with zero-filled WAV.
#1881

@ukolovda
Copy link

ukolovda commented Feb 20, 2024

File from #1881 (zero filled WAV) give a gallucination in this version too.

$ ../whisper.cpp-bobqianic/main -m ./models/ggml-large-v3.bin -l ru --threads 8 -mc 0 samples/zeroes.wav
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-large-v3.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51866
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 1280
whisper_model_load: n_audio_head  = 20
whisper_model_load: n_audio_layer = 32
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 1280
whisper_model_load: n_text_head   = 20
whisper_model_load: n_text_layer  = 32
whisper_model_load: n_mels        = 128
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 5 (large v3)
whisper_model_load: adding 1609 extra tokens
whisper_model_load: n_langs       = 100
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
whisper_backend_init: using CUDA backend
whisper_model_load:    CUDA0 total size =  3094,86 MB (3 buffers)
whisper_model_load: model size    = 3094,36 MB
whisper_backend_init: using CUDA backend
whisper_init_state: kv self size  =  220,20 MB
whisper_init_state: kv cross size =  245,76 MB
whisper_init_state: compute buffer (conv)   =   35,50 MB
whisper_init_state: compute buffer (encode) =  233,50 MB
whisper_init_state: compute buffer (cross)  =   10,15 MB
whisper_init_state: compute buffer (decode) =  108,99 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | CUDA = 1 | COREML = 0 | OPENVINO = 0 | 

run: processing 'samples/zeroes.wav' (19200 samples, 1,2 sec), 8 threads, 1 processors, 5 beams + best of 5, lang = ru, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:29.980]   Продолжение следует...


whisper_print_timings:     load time =   781,61 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     4,81 ms
whisper_print_timings:   sample time =    28,10 ms /    79 runs (    0,36 ms per run)
whisper_print_timings:   encode time =   162,31 ms /     1 runs (  162,31 ms per run)
whisper_print_timings:   decode time =     0,00 ms /     1 runs (    0,00 ms per run)
whisper_print_timings:   batchd time =   482,89 ms /    77 runs (    6,27 ms per run)
whisper_print_timings:   prompt time =     0,00 ms /     1 runs (    0,00 ms per run)
whisper_print_timings:    total time =  1502,74 ms

@linmi
Copy link

linmi commented Feb 21, 2024

-output-json-full has problems with the output format.

  • Language: Chinese
    CleanShot 2024-02-21 at 16 41 35@2x

@thewh1teagle
Copy link
Contributor

What's the status of this PR? is it safe to use?
I experience decoding issues
thewh1teagle/vibe#34

@jwijffels
Copy link
Contributor

jwijffels commented Apr 5, 2024

I'm thinking about including this pull request in the R wrapper at audio.whisper . There the current approach to handle some of the hallucinations is to use R packages audio.vadwebrtc or audio.vadsilero to detect silences or general non-voiced signals and either

  • instead of looping over different files in the main loop, loop over the detected non-silence sections in the audio.
  • or create a new audio file with only the voiced audio and recompute the timestamps later on by adding what was left out

I haven't looked into the extreme details on this pull request (only skimmed through the logic which was changed in main.cpp and whisper.cpp) but would it make sense already to incorporate this pull request in audio.whisper or are there a lot of changes to be expected here or is this pull request going to be split into a BPE change (#1854) and a change regarding how to handle non-speech?

@ronyfadel
Copy link

@bobqianic are you pursuing this at the moment?

@bobqianic
Copy link
Collaborator Author

@bobqianic are you pursuing this at the moment?

No, at least not in May. I'm really tied up with a lot of things this month.

@bygreencn
Copy link

I'm thinking about including this pull request in the R wrapper at audio.whisper . There the current approach to handle some of the hallucinations is to use R packages audio.vadwebrtc or audio.vadsilero to detect silences or general non-voiced signals and either

  • instead of looping over different files in the main loop, loop over the detected non-silence sections in the audio.
  • or create a new audio file with only the voiced audio and recompute the timestamps later on by adding what was left out

I haven't looked into the extreme details on this pull request (only skimmed through the logic which was changed in main.cpp and whisper.cpp) but would it make sense already to incorporate this pull request in audio.whisper or are there a lot of changes to be expected here or is this pull request going to be split into a BPE change (#1854) and a change regarding how to handle non-speech?

The best way to include Silero Voice Activity into whisper.cpp is to add thirdparty package of onnxruntime1.12.1 dll, then call silero onnx model. My branch had added it. Even VAD, the hallucinations on silent intervals is also happenning.

@IntendedConsequence
Copy link

The best way to include Silero Voice Activity into whisper.cpp is to add thirdparty package of onnxruntime1.12.1 dll, then call silero onnx model. My branch had added it. Even VAD, the hallucinations on silent intervals is also happenning.

I recommend considering a previous Silero VAD version, namely v3.1. The current version v4 (at the moment of writing) often hallucinates speech on lengthy chunks of silent or near-silent audio segments.
snakers4/silero-vad#369
snakers4/silero-vad#396

But you have to add a heavyweight dependency like onnxruntime just to run a 750KB model. The smallest size I could possibly reduce onnxruntime.dll to was about 2.2MB, which is still 3x the size of silero weights, and requires a lengthy custom build of onnxruntime from source with reduced operator set configs and other size reduction options. And prebuilt redistributables are easily 5-9 MB or more.

I have a working Silero v3.1 implementation in pure C, but as much as I would like to suggest it as an option, the code is quite bad, I wrote it as a personal project for learning low level neural nets.

@ziegenberg
Copy link
Contributor

@bobqianic, Could you rebase your changes? I'd like to test those improvements of yours with production data on our setup.

examples/main/main.cpp Outdated Show resolved Hide resolved
@bobqianic
Copy link
Collaborator Author

@ziegenberg I did some testing, and it LGTM. If the CI is mostly green, you can proceed with your testing now.

@ziegenberg
Copy link
Contributor

I already did some testing and fixed some of the errors on my own. Looks promising. I see less hallucinations, but I need to do some more statistics. I will switch to your branch for the next tests.

Is your PR #1854 also related to this improvement?

@bobqianic
Copy link
Collaborator Author

Is your PR #1854 also related to this improvement?

PR #1854 is a subset of this PR, meaning this PR includes everything in PR #1854.

@ziegenberg
Copy link
Contributor

What data/statistics would you need from my side to consider this PR validated and get it merged?

@bobqianic
Copy link
Collaborator Author

What data/statistics would you need from my side to consider this PR validated and get it merged?

Thank you. If you have the ground truth text, please calculate the WER.

@Makememo
Copy link

I tested the output of anime using medium.en and found a problem with time axis recognition in the middle.

file: https://dropover.cloud/f7

2024-06-27 09 20 40 e020

@ziegenberg
Copy link
Contributor

Hi @Makememo,
was this a singular incident or does this happen regularly?

@Makememo
Copy link

Hi @Makememo,

was this a singular incident or does this happen regularly?

I tested three videos and I had this problem.

The common feature of these videos is that there is a music section that begins to mess up the timeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decoding Decoding related issues research🔬
Projects
None yet