Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the seeking algorithm #934

Closed
wants to merge 1 commit into from

Conversation

jumon
Copy link
Contributor

@jumon jumon commented Feb 7, 2023

Problem

The current implementation of the transcribe function does not add the last segment to the result when there are multiple segments but no partially included segment at the end. This leads to inefficiency (and possibly hallucinations) as the portion is decoded in the next iteration.

For example, the current implementation transcribes the audio file (pasted at the end of this PR) like this.
Note that I added print(f"line 185: tokenizer.decode_with_timestamps(tokens) = {tokenizer.decode_with_timestamps(tokens)}") in whisper/transcribe.py#L185 to inspect decoded tokens.

> whisper test_audio.mp4 --output_dir output
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
line 185: tokenizer.decode_with_timestamps(tokens) = <|0.00|> And do you know what the answer to this question now is?<|3.24|><|3.24|> The answer is no.<|5.30|><|5.30|> It is not possible to buy a cell phone that d
oesn't do too much.<|8.52|><|8.52|> So.<|9.02|>
[00:00.000 --> 00:03.240]  And do you know what the answer to this question now is?
[00:03.240 --> 00:05.300]  The answer is no.
[00:05.300 --> 00:08.520]  It is not possible to buy a cell phone that doesn't do too much.
line 185: tokenizer.decode_with_timestamps(tokens) = <|0.00|> So, you know what the answer to this question now is, is it possible to buy a cell phone that doesn't do too much?<|24.00|><|24.00|>
[00:08.520 --> 00:32.520]  So, you know what the answer to this question now is, is it possible to buy a cell phone that doesn't do too much?

We can see that the decoding result of the first iteration was <|0.00|> And do you ....... to buy a cell phone that doesn't do too much.<|8.52|><|8.52|> So.<|9.02|>, but it only sought to <|8.52|> and decoded audio after the timestamp again in the next iteration. It also led to a hallucination.

Solution

This PR fixes the issue by sliding the length of audio when there is no partial segment in the current window. With this fix, the output is as follows.
We can see only a single decoding iteration happened without hallucinations.

> whisper test_audio.mp4 --output_dir output
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
line 185: tokenizer.decode_with_timestamps(tokens) = <|0.00|> And do you know what the answer to this question now is?<|3.24|><|3.24|> The answer is no.<|5.30|><|5.30|> It is not possible to buy a cell phone that d
oesn't do too much.<|8.52|><|8.52|> So.<|9.02|>
[00:00.000 --> 00:03.240]  And do you know what the answer to this question now is?
[00:03.240 --> 00:05.300]  The answer is no.
[00:05.300 --> 00:08.520]  It is not possible to buy a cell phone that doesn't do too much.
[00:08.520 --> 00:09.020]  So.

This is the sample audio file taken from the TEDLIUM2 corpus (found at https://www.openslr.org/19/, licensed under CC BY-NC-ND 3.0).

test_audio.mp4

@glangford
Copy link

glangford commented Feb 7, 2023

Looking at the existing behaviour (without this proposed fix) there are different results depending on the model used. For example the large model only transcribes a portion of the test_audio, but doesn't hallucinate:

1
00:00:00,000 --> 00:00:06,500
And do you know what the answer to this question now is?

Have you tested against different size models?

(edit: just wondering if this proposed change fixes multiple problems, such as missed speech for example which I see more often than hallucination)

@jumon
Copy link
Contributor Author

jumon commented Feb 8, 2023

I used the large-v2 model and encountered the same outcome as you did. Unfortunately, this is a performance issue with the model and my PR cannot resolve it. My PR mainly aims to fix the redundant decoding and speed up the decoding time.

@jongwook
Copy link
Collaborator

jongwook commented Mar 7, 2023

Hi! I realized I fixed the same issue in #1033 without reviewing this PR. Sorry! Please feel free to reopen if I missed anything in that fix.

@jongwook jongwook closed this Mar 7, 2023
@jumon
Copy link
Contributor Author

jumon commented Mar 7, 2023

No need to worry! I've checked #1033, and it seems all good, doing the same thing as this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants