Word level timestamps and hallucinations #813

mu4farooqi · 2023-01-05T09:03:45Z

mu4farooqi
Jan 5, 2023

I tried to integrate the amazing work which @jongwook did with multilingual ASR notebook into transcribe function of my local repository. You can clone and try it.

I always remove silences using silero-vad before feeding the input to Whisper. This helps avoiding hallucinations in all other chunks except the last one. This word level timestamps algorithm also removes hallucinations in the last chunk. For example if a last chunk is 6s long, hallucinated words would have same begin and end timestamps.

Bumicom · 2023-01-05T16:15:36Z

Bumicom
Jan 5, 2023

When I try the ASR notebook I get an error at last step.

The dimensions of the returned tensor is only two where four are expected.
Did you forget to checkin a file?

torch.Size([1152, 1024])

IndexError Traceback (most recent call last)
Cell In[37], line 21
19 weights = torch.concatenate(QKs) # layers * heads * tokens * frames
20 print(weights.shape)
---> 21 weights = weights[:, :, :, : duration // AUDIO_SAMPLES_PER_TOKEN].cpu()
22 # weights = medfilt(weights, (1, 1, 1, medfilt_width))
23 # weights = torch.tensor(weights * qk_scale).softmax(dim=-1)
24
(...)
86 # display(pd.DataFrame(data))
87 # display(HTML("

"))

IndexError: too many indices for tensor of dimension 2

1 reply

mu4farooqi Jan 5, 2023
Author

I didn’t write the notebook. I used the concepts introduced in notebook and integrated in transcribe function of my forked Whisper repo. You can try that if you want.

eschmidbauer · 2023-01-05T17:01:33Z

eschmidbauer
Jan 5, 2023

Hi @mu4farooqi
Tested out your branch and it's fantastic, nice work!
One problem i noticed is it sometimes skips the first segment of audio
I found that the line causing it is here

I understand you are doing work to prevent hallucinations there but it maybe causing an edge case with the first segment. Any ideas?

20 replies

skanda1005 Feb 21, 2023

If you pre-process the file using VAD, the word timestamps will no longer correspond to the original audio. You would need to also do some post-processing to offset the word timestamps by the length of audio that was deleted before each timestamp using the retained speech_timestamps. However, beware that when you compact all the speech segments together with no gaps in between, if the first word of a segment has a timestamp that is slightly early (since the word timestamps are still approximate), such an approach could incorrectly infer that word to belong to the previous speech segment because all the neighboring segments are tightly packed together. You might be able to avoid this by adding a bit more padding around each segment in speech_timestamps before you slice up the audio. This might also improve the transcription results in cases where the VAD is clipping things a bit too tightly. I would also suggest lowering the speech threshold for VAD from 0.5 to say 0.3 to ensure that the VAD doesn't cut out audio where there was actually some low level speech.

Also some unrelated notes:

the dtw-python library is GPL and cannot be used with whisper's current license.

this code assumes cuda. It has no fallback for cpu.

Hi, by any chance do you know how to do the preprocessing to get back the original timestamps of my original audio, this would solve a major issue I am facing.
Thanks in advance

mu4farooqi Feb 21, 2023
Author

To reconstruct timestamps, I can give you some idea how we can do it. First while you are doing VAD. You need to construct a map from new timestamps to old timestamps. Something like following. In this code, tss will look like something [(s1, e1), (s2, e2), ...] which you'll get from VAD.

def get_ts_mapping(tss, wav):
 '''
 tss: List of tuples of start and end timestamps from the original audio.
 wav: Audio file loaded using any audio loading lib i.e. torch-audio
 '''
  # To reconstruct original word-level timestamps we need this mapping.
  tss_mapping, chunks_len = [], 0
  
  for i in tss:
    s, e = int(i['start'] * SAMPLE_RATE), int(i['end'] * SAMPLE_RATE)
    s, e = min(len(wav), s), min(len(wav), e)
    chunks_len += e - s
    tss_mapping.append((i['end'], round(chunks_len / SAMPLE_RATE, 1)))

  return tss_mapping

Once you have word level timestamps, you can use above mapping to reconstruct original word level timestamps using binary search.

def get_corrected_word_dict(word, start, end, vad_tss_map):
  """
  Binary search for the chunk with the smallest timestamp greater than or equal to ts
  """

  if start > end:
    return dict(text=word, start=round(end, 2), end=round(end, 2))

  def bs(value):
    s, e = 0, len(vad_tss_map) - 1
    while s < e:
      m = s + (e - s) // 2
      if value <= vad_tss_map[m][1]:
        e = m
      else:
        s = m + 1

    return s

  start_idx = bs(start)
  corrected_begin = start + (
    (vad_tss_map[start_idx][0] -
     vad_tss_map[start_idx][1]) if len(vad_tss_map) > 0 else 0)

  end_idx = bs(end)
  corrected_end = end + (
    (vad_tss_map[end_idx][0] -
     vad_tss_map[end_idx][1]) if len(vad_tss_map) > 0 else 0)
  return dict(text=word, start=round(corrected_begin, 2), end=round(corrected_end, 2))

If may be hard to understand but it works according to my tests. Let. me know if you need more information.

skanda1005 Feb 21, 2023

Great! Thank you so much! Will give it a try!

skanda1005 Feb 23, 2023

There seems to be a small bug in your code, in the get_ts_mapping function it needs to be len(wav[0]) not len(wav).

Also, I am having difficulties in running the second function, could you help me in how I could run the binary search function?

edit: I got it all working! Thanks for your help!

mayeaux Mar 26, 2023

I have a Javascript implementation of adjusting the timestamps as well, I'll probably post it as a Gist once it's cleaned up.

mu4farooqi · 2023-02-21T16:37:14Z

mu4farooqi
Feb 21, 2023
Author

It's the original audio. On Tuesday, 21 February 2023 at 11:25:33 am GMT-5, Skanda Subramanyan ***@***.***> wrote: Just a clarification, in the first function, will the audio input be the original audio or will it be the modified audio after VAD? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

rodrigoGA · 2023-11-25T05:18:14Z

rodrigoGA
Nov 25, 2023

Excuse my ignorance, but looking at @jongwook notebook and your repository, I am not clear on the necessary changes.
As I understand from the comment, the notebooks recommends removing the fragments with repeated timestamps that are found at the end of the audio.
Is this alone sufficient to eliminate the hallucinations (once VAD is applied)?
Is there no correlation between temperature, avg_logprob, compression_ratio, and no_speech_prob that should also be taken into account to remove the hallucinations?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word level timestamps and hallucinations #813

{{title}}

Replies: 4 comments 21 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Word level timestamps and hallucinations #813

Replies: 4 comments · 21 replies

torch.Size([1152, 1024])

mu4farooqi Jan 5, 2023 Author

mu4farooqi Feb 21, 2023 Author

mu4farooqi Feb 21, 2023 Author

Replies: 4 comments 21 replies

mu4farooqi Jan 5, 2023
Author

mu4farooqi Feb 21, 2023
Author

mu4farooqi
Feb 21, 2023
Author