Word level timestamps and hallucinations #813
Replies: 4 comments 21 replies
-
When I try the ASR notebook I get an error at last step. The dimensions of the returned tensor is only two where four are expected. torch.Size([1152, 1024])IndexError Traceback (most recent call last) ")) IndexError: too many indices for tensor of dimension 2 |
Beta Was this translation helpful? Give feedback.
-
Hi @mu4farooqi I understand you are doing work to prevent hallucinations there but it maybe causing an edge case with the first segment. Any ideas? |
Beta Was this translation helpful? Give feedback.
-
It's the original audio.
On Tuesday, 21 February 2023 at 11:25:33 am GMT-5, Skanda Subramanyan ***@***.***> wrote:
Just a clarification, in the first function, will the audio input be the original audio or will it be the modified audio after VAD?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Excuse my ignorance, but looking at @jongwook notebook and your repository, I am not clear on the necessary changes. |
Beta Was this translation helpful? Give feedback.
-
I tried to integrate the amazing work which @jongwook did with multilingual ASR notebook into
transcribe
function of my local repository. You can clone and try it.I always remove silences using silero-vad before feeding the input to Whisper. This helps avoiding hallucinations in all other chunks except the last one. This word level timestamps algorithm also removes hallucinations in the last chunk. For example if a last chunk is 6s long, hallucinated words would have same
begin
andend
timestamps.Beta Was this translation helpful? Give feedback.
All reactions