Improve the seeking algorithm #934

jumon · 2023-02-07T07:50:26Z

Problem

The current implementation of the transcribe function does not add the last segment to the result when there are multiple segments but no partially included segment at the end. This leads to inefficiency (and possibly hallucinations) as the portion is decoded in the next iteration.

For example, the current implementation transcribes the audio file (pasted at the end of this PR) like this.
Note that I added print(f"line 185: tokenizer.decode_with_timestamps(tokens) = {tokenizer.decode_with_timestamps(tokens)}") in whisper/transcribe.py#L185 to inspect decoded tokens.

> whisper test_audio.mp4 --output_dir output
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
line 185: tokenizer.decode_with_timestamps(tokens) = <|0.00|> And do you know what the answer to this question now is?<|3.24|><|3.24|> The answer is no.<|5.30|><|5.30|> It is not possible to buy a cell phone that d
oesn't do too much.<|8.52|><|8.52|> So.<|9.02|>
[00:00.000 --> 00:03.240]  And do you know what the answer to this question now is?
[00:03.240 --> 00:05.300]  The answer is no.
[00:05.300 --> 00:08.520]  It is not possible to buy a cell phone that doesn't do too much.
line 185: tokenizer.decode_with_timestamps(tokens) = <|0.00|> So, you know what the answer to this question now is, is it possible to buy a cell phone that doesn't do too much?<|24.00|><|24.00|>
[00:08.520 --> 00:32.520]  So, you know what the answer to this question now is, is it possible to buy a cell phone that doesn't do too much?

We can see that the decoding result of the first iteration was <|0.00|> And do you ....... to buy a cell phone that doesn't do too much.<|8.52|><|8.52|> So.<|9.02|>, but it only sought to <|8.52|> and decoded audio after the timestamp again in the next iteration. It also led to a hallucination.

Solution

This PR fixes the issue by sliding the length of audio when there is no partial segment in the current window. With this fix, the output is as follows.
We can see only a single decoding iteration happened without hallucinations.

> whisper test_audio.mp4 --output_dir output
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
line 185: tokenizer.decode_with_timestamps(tokens) = <|0.00|> And do you know what the answer to this question now is?<|3.24|><|3.24|> The answer is no.<|5.30|><|5.30|> It is not possible to buy a cell phone that d
oesn't do too much.<|8.52|><|8.52|> So.<|9.02|>
[00:00.000 --> 00:03.240]  And do you know what the answer to this question now is?
[00:03.240 --> 00:05.300]  The answer is no.
[00:05.300 --> 00:08.520]  It is not possible to buy a cell phone that doesn't do too much.
[00:08.520 --> 00:09.020]  So.

This is the sample audio file taken from the TEDLIUM2 corpus (found at https://www.openslr.org/19/, licensed under CC BY-NC-ND 3.0).

test_audio.mp4

glangford · 2023-02-07T15:07:43Z

Looking at the existing behaviour (without this proposed fix) there are different results depending on the model used. For example the large model only transcribes a portion of the test_audio, but doesn't hallucinate:

1
00:00:00,000 --> 00:00:06,500
And do you know what the answer to this question now is?

Have you tested against different size models?

(edit: just wondering if this proposed change fixes multiple problems, such as missed speech for example which I see more often than hallucination)

jumon · 2023-02-08T06:43:58Z

I used the large-v2 model and encountered the same outcome as you did. Unfortunately, this is a performance issue with the model and my PR cannot resolve it. My PR mainly aims to fix the redundant decoding and speed up the decoding time.

jongwook · 2023-03-07T02:47:10Z

Hi! I realized I fixed the same issue in #1033 without reviewing this PR. Sorry! Please feel free to reopen if I missed anything in that fix.

jumon · 2023-03-07T11:04:46Z

No need to worry! I've checked #1033, and it seems all good, doing the same thing as this PR.

improve the seek updating algorithm

596cbb5

jongwook closed this Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the seeking algorithm #934

Improve the seeking algorithm #934

jumon commented Feb 7, 2023

glangford commented Feb 7, 2023 •

edited

Loading

jumon commented Feb 8, 2023

jongwook commented Mar 7, 2023

jumon commented Mar 7, 2023

Improve the seeking algorithm #934

Improve the seeking algorithm #934

Conversation

jumon commented Feb 7, 2023

Problem

Solution

glangford commented Feb 7, 2023 • edited Loading

jumon commented Feb 8, 2023

jongwook commented Mar 7, 2023

jumon commented Mar 7, 2023

glangford commented Feb 7, 2023 •

edited

Loading