Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt to fix the repetition/hallucination issue identified in #1046 #1052

Merged
merged 5 commits into from
Mar 8, 2023

Conversation

jongwook
Copy link
Collaborator

@jongwook jongwook commented Mar 7, 2023

No description provided.

@ryanheise
Copy link
Contributor

Hi @jongwook Not sure if you saw the comment below, but it includes a reproduction case which might be useful:

#869 (comment)

The repetition persists with this PR.

@jongwook
Copy link
Collaborator Author

jongwook commented Mar 8, 2023

@ryanheise thanks! will look into it...

@glangford
Copy link

glangford commented Mar 8, 2023

The problem triggered by the test data from @ryanheise is model sensitive. I see the problem with small but using either small.en or medium.en looks ok although the timing of the last few words is off. Below is the mp3 fragment converted to video to show the English subtitles.

ryan-test-sub.mp4

@jongwook
Copy link
Collaborator Author

jongwook commented Mar 8, 2023

Thanks all! The incorrect zero-padding of Mel spectrograms as identified in #730 and #838 was contributing to this error. The fix in 477f0be appears to fix the repetition issue.

@ryanheise
Copy link
Contributor

The fix in 477f0be appears to fix the repetition issue.

I can confirm this fixed my example, thanks! 👍

Below is the mp3 fragment converted to video to show the English subtitles.

@glangford FYI the subtitles didn't show in your video.

@glangford
Copy link

@ryanheise Inline, (on Mac at least) you may need to click on the >> on the right to turn on subtitles. Or download and view with VLC, Quicktime, or whatever and enable subtitles in the viewer.

@jongwook jongwook merged commit 919a713 into main Mar 8, 2023
@ryanheise
Copy link
Contributor

ryanheise commented Mar 8, 2023

Ah, I see, Firefox doesn't show any options, but downloading it and opening in VLC works. You can also do hard subs this way #435 (reply in thread)

using either small.en or medium.en looks ok although the timing of the last few words is off.

Here is the base model for comparison, which appears more accurate on the last few words:

69-whiskey-clip.mp4

@m-bain
Copy link

m-bain commented Mar 8, 2023

Btw have you guys tried with longer audio, e.g. 5 mins long? I am still getting a lot of repetition even with this fix.
E.g. on the TEDLIUM test set "AimeeMullins_2009P.wav"

[02:10.440 --> 02:14.720] and needless to say, thank God, I wasn't using a thesaurus back then.
[02:14.720 --> 02:14.720] and needless to say, thank God, I wasn't using a thesaurus back then.
[02:15.460 --> 02:18.580] I mean from this entry, it would seem that
[02:18.580 --> 02:22.800] I was born into a world that perceived someone like me
[02:22.800 --> 02:23.340] I was born into a world that perceived someone like me
[02:23.340 --> 02:27.540] to have nothing positive, whatsoever, going for them
[02:27.540 --> 02:27.540] to have nothing positive, whatsoever, going for them
[02:27.540 --> 02:35.340] When in fact today, I'm celebrated for the opportunities and adventures my life has procured
[02:35.340 --> 02:35.960] When in fact today, I'm celebrated for the opportunities and adventures my life has procured
[02:35.960 --> 02:42.140] So I immediately went to look up the 2009 online edition
[02:42.140 --> 02:42.160] So I immediately went to look up the 2009 online edition
[02:42.160 --> 02:42.160] So I immediately went to look up the 2009 online edition

I was hoping to update word segmentation results for whisper-only word timestamps in our paper https://arxiv.org/abs/2303.00747

But currently i am getting better results with our implementation which is similar to https://github.com/linto-ai/whisper-timestamped

@glangford
Copy link

Btw have you guys tried with longer audio, e.g. 5 mins long? I am still getting a lot of repetition even with this fix.

I am testing a longer audio now (running on CPU, larger model, transcript+transcribe so it is taking a while). For clarity,

  • are you running the 20230307 release version? with, or without --word_timestamps?
  • the repetitions from "AimeeMullins_2009P.wav" above, are they from verbose print to the console?

It seems like there are different possible sources of error, in all the different discussions

  • model hallucination
  • new repetition introduced or magnified by --word_timestamps True
  • (hand waving) segmentation issues

@m-bain
Copy link

m-bain commented Mar 8, 2023

are you running the 20230307 release version? with, or without --word_timestamps?

yes

the repetitions from "AimeeMullins_2009P.wav" above, are they from verbose print to the console?

yes

@glangford
Copy link

@jongwook Note from @m-bain example above the repetition occurring with verbose print. The repetitions in this example are all "instantaneous" ; eg same start and end time

[02:14.720 --> 02:14.720] and needless to say, thank God, I wasn't using a thesaurus back then.

they are printed but then immediately cleared by this code, which looks like a bug unique to --verbose True

# if a segment is instantaneous or does not contain text, clear it

@glangford
Copy link

@m-bain Given this could you maybe rerun and see if the formal output formats are messed up or not, using --verbose False?

@m-bain
Copy link

m-bain commented Mar 8, 2023

This is not a verbose error, and the start times and end times of repetition are not always instantaneous, see output for the .srt file without verbose:

271
00:02:14,440 --> 00:02:14,720
and needless to say, thank God, I wasn't using a thesaurus back then.

272
00:02:14,720 --> 00:02:14,720

273
00:02:15,460 --> 00:02:16,180
I mean from this entry, it would seem that

274
00:02:16,180 --> 00:02:16,360
I mean from this entry, it would seem that

275
00:02:16,360 --> 00:02:16,960
I mean from this entry, it would seem that

276
00:02:16,960 --> 00:02:17,220
I mean from this entry, it would seem that

277
00:02:17,220 --> 00:02:17,620
I mean from this entry, it would seem that

278
00:02:17,620 --> 00:02:17,800
I mean from this entry, it would seem that

@glangford
Copy link

So there are at least two problems then

  • verbose mode can print cleared segments
  • something else triggered by word_timestamps

Given how close the start/end times are it feels like something related to seek_shift is still off

seek = previous_seek + seek_shift

@m-bain Do the same repetitions happen with word_timestamps False or no?

@m-bain
Copy link

m-bain commented Mar 9, 2023

Update, I realise there is some specific underline formatting in the word_timestamps, was able to get it working in the end. See here for comparison on word-level timestamp accuracy

image

@jongwook could you share the evaluation for long-form transcription WER? I am unable to reproduce whisper results, right now I report in the vanilla setting -- greedy/beam5 decoding without the heuristic tricks

@jongwook jongwook deleted the fix-decoding-repetition-degradation branch March 14, 2023 19:35
zackees pushed a commit to zackees/whisper that referenced this pull request May 5, 2023
…i#1046 (openai#1052)

* attempt to fix the repetition/hallucination issue identified in openai#1046

* zero-pad the audio instead of spectrogram

* formatting fix

* delete debug print
ilanit1997 pushed a commit to ilanit1997/whisper that referenced this pull request May 16, 2023
…i#1046 (openai#1052)

* attempt to fix the repetition/hallucination issue identified in openai#1046

* zero-pad the audio instead of spectrogram

* formatting fix

* delete debug print
abyesilyurt pushed a commit to abyesilyurt/whisper that referenced this pull request Nov 13, 2023
…i#1046 (openai#1052)

* attempt to fix the repetition/hallucination issue identified in openai#1046

* zero-pad the audio instead of spectrogram

* formatting fix

* delete debug print
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants