Shape mismatch in certain batches #9

sbuser · 2023-01-21T01:13:33Z

I've tried for a while here to figure out what is causing this without much success. Batch processing will run for a variety of files but I've come to a group here that throws an IndexError on the 2nd segments of the batch:

File "/app/.venv/lib/python3.9/site-packages/whisper/decoding.py", line 694, in _main_loop
    probs_at_sot.append(logits[:, self.sot_index[i]].float().softmax(dim=-1))
IndexError: index 8 is out of bounds for dimension 1 with size 3

In a normal loop self.sot_index is the same at all indicies:
[8, 8, 8, 8, 8, 8] or [11, 11, 11, 11, 11, 11]

In the batch and segment number that fails it looks like this:

[0, 0, 0, 0, 8, 0]   <-- self.sot_index
0 0 torch.Size([6, 3, 51865])  <-- i, self.sot_index[i], logits.shape
1 0 torch.Size([6, 3, 51865])
2 0 torch.Size([6, 3, 51865])
3 0 torch.Size([6, 3, 51865])
4 8 torch.Size([6, 3, 51865])

I'm not tracking how this is happening. I'm not providing any different languages or an initial prompt, so I'm not understanding the mismatch with sot_index here.

I do see that it hasn't properly transcribed portions of that file from the first segment in the output. I don't see where it would be hanging onto that to cause this problem, but something is broken.

Sorry I'm not of more help on this. I'll keep digging.

The text was updated successfully, but these errors were encountered:

Blair-Johnson · 2023-01-21T03:19:30Z

Interesting, I'm happy to help you with debugging this. Do the clips transcribe properly on the official whisper implementation?

sbuser · 2023-01-21T03:52:10Z

It does, yes. Interestingly, not only does it not generate the IndexError, it also does a better job with the transcription itself. Perhaps related to the temperature cascading discussed in the other issue? I'm not sure.

Without changing anything except adding print statements to diagnose this issue, on maybe the 10th run it did actually pass the step it had previously failed (no IndexError) and put in a bunch of garbage in that segment's transcription. I suppose nothing guaranteed the outputs here are deterministic, but that was surprising to me.

In trying to answer this I also found that the fix for no_speech_prob returning an array of all of the probabilities breaks running whisper against a single audio file (when it bypasses all of the batch code).

Edit to clarify on the non-deterministic behavior: - that was probably related to the other files in the batch potentially changing. I'm batching by files size and there were quite a few files with the exact same size so that likely accounts for the differences between runs rather than the model itself being responsible. If so, then it's pretty clear the temperature linking can have a negative effect. Batching certainly has some effect on outcomes because the file is fully and properly transcribed when run by itself.

JunZhan2000 · 2023-02-27T12:48:26Z

I missed this too

Blair-Johnson mentioned this issue Jan 26, 2023

IndexError thrown when using batch transcribe function. #5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shape mismatch in certain batches #9

Shape mismatch in certain batches #9

sbuser commented Jan 21, 2023

Blair-Johnson commented Jan 21, 2023

sbuser commented Jan 21, 2023 •

edited

Loading

JunZhan2000 commented Feb 27, 2023

Shape mismatch in certain batches #9

Shape mismatch in certain batches #9

Comments

sbuser commented Jan 21, 2023

Blair-Johnson commented Jan 21, 2023

sbuser commented Jan 21, 2023 • edited Loading

JunZhan2000 commented Feb 27, 2023

sbuser commented Jan 21, 2023 •

edited

Loading