Word level timestamps #2

dennislysenko · 2024-01-31T01:25:43Z

Segment level timestamps look good, great work guys.

Are token level timestamps currently supported somehow, or on the roadmap?

ZachNagengast · 2024-01-31T02:16:59Z

Thanks @dennislysenko! Yes parity with openai's python features are definitely on the top of the list, here's our short list of topics we're missing that will be coming soon in future releases (before v1.0.0) https://github.com/argmaxinc/WhisperKit/wiki/Roadmap-&-Contribution-Guide.

atiorh · 2024-01-31T04:22:52Z

Adding to Zach's comment, we think word level timestamps is a great opportunity for a community contribution but if no-one does it in a few weeks, we will do it :)

fakerybakery · 2024-02-11T19:50:19Z

+1, would love to have this feature!

ZachNagengast · 2024-02-16T22:26:28Z

FYI this is in progress 🎉

Adding more details here:

Word level timestamps provide precise start and end times for each word in a transcription (note: not each token). There are many use cases for this but in general, they give much more flexibility when choosing how to display transcriptions for videos, and will assist significantly with alignment for streaming. Here's an example output from the openai whisper API for the jfk.wav file:

[{'word': 'And', 'start': 0, 'end': 0.64},
 {'word': 'so', 'start': 0.64, 'end': 0.98},
 {'word': 'my', 'start': 0.98, 'end': 1.32},
 {'word': 'fellow', 'start': 1.32, 'end': 1.68},
 {'word': 'Americans', 'start': 1.68, 'end': 2.28},
 {'word': 'ask', 'start': 3.8, 'end': 3.8},
 {'word': 'not', 'start': 3.8, 'end': 4.38},
 {'word': 'what', 'start': 4.38, 'end': 5.62},
 {'word': 'your', 'start': 5.62, 'end': 5.96},
 {'word': 'country', 'start': 5.96, 'end': 6.32},
 {'word': 'can', 'start': 6.32, 'end': 6.72},
 {'word': 'do', 'start': 6.72, 'end': 6.88},
 {'word': 'for', 'start': 6.88, 'end': 7.16},
 {'word': 'you', 'start': 7.16, 'end': 7.64},
 {'word': 'ask', 'start': 8.5, 'end': 8.56},
 {'word': 'what', 'start': 8.56, 'end': 8.84},
 {'word': 'you', 'start': 8.84, 'end': 9.16},
 {'word': 'can', 'start': 9.16, 'end': 9.44},
 {'word': 'do', 'start': 9.44, 'end': 9.62},
 {'word': 'for', 'start': 9.62, 'end': 9.84},
 {'word': 'your', 'start': 9.84, 'end': 10.22},
 {'word': 'country', 'start': 10.22, 'end': 10.38}]

For the implementation here, we will also want avg log probs and other such token-level contexts as well for each word.

References:

Original PR in the openai implementation: openai/whisper#869
Current code (there has been several improvements since the original PR): https://github.com/openai/whisper/blob/main/whisper/timing.py
MLX implementation: ml-explore/mlx-examples#201

ldenoue · 2024-03-06T10:19:04Z

@ZachNagengast is it possible to also get the words in the decodingCallback's TranscriptionProgress?
Currently we only get the full text, and it's not enough to let users e.g. play the audio at a specific time until we get all the words (which could take a few minutes on longer audio files).

ZachNagengast · 2024-03-07T20:59:59Z

@ldenoue Can you elaborate a bit on this? By words, do you mean the word timestamps? If so, it is not currently available until the segment is completed, but we're investigating how to do this in each decoding loop. If you could provide a simple json of what you'd like to see returned in a particular TranscriptionProgress that would be helpful too.

ldenoue · 2024-03-08T10:03:27Z

@ZachNagengast Yes, basically I would like to know the words that make up the progress.text, so something like:

{ text: "Hello world", words: [{text: "Hello", "start": 0, "end": 0.1}, {text: " world", "start": 0.1, "end": 0.2}]}

ZachNagengast added triaged This issue has been looked at and prioritized by a maintainer feature New feature or request labels Feb 16, 2024

ZachNagengast self-assigned this Feb 16, 2024

ZachNagengast mentioned this issue Feb 29, 2024

Support Word Timestamps #38

Merged

ZachNagengast closed this as completed in #38 Mar 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word level timestamps #2

Word level timestamps #2

dennislysenko commented Jan 31, 2024

ZachNagengast commented Jan 31, 2024

atiorh commented Jan 31, 2024

fakerybakery commented Feb 11, 2024

ZachNagengast commented Feb 16, 2024

ldenoue commented Mar 6, 2024

ZachNagengast commented Mar 7, 2024 •

edited

Loading

ldenoue commented Mar 8, 2024 •

edited

Loading

Word level timestamps #2

Word level timestamps #2

Comments

dennislysenko commented Jan 31, 2024

ZachNagengast commented Jan 31, 2024

atiorh commented Jan 31, 2024

fakerybakery commented Feb 11, 2024

ZachNagengast commented Feb 16, 2024

References:

ldenoue commented Mar 6, 2024

ZachNagengast commented Mar 7, 2024 • edited Loading

ldenoue commented Mar 8, 2024 • edited Loading

ZachNagengast commented Mar 7, 2024 •

edited

Loading

ldenoue commented Mar 8, 2024 •

edited

Loading