-
Notifications
You must be signed in to change notification settings - Fork 329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word level timestamps #2
Comments
Thanks @dennislysenko! Yes parity with openai's python features are definitely on the top of the list, here's our short list of topics we're missing that will be coming soon in future releases (before v1.0.0) https://github.com/argmaxinc/WhisperKit/wiki/Roadmap-&-Contribution-Guide. |
Adding to Zach's comment, we think word level timestamps is a great opportunity for a community contribution but if no-one does it in a few weeks, we will do it :) |
+1, would love to have this feature! |
FYI this is in progress 🎉 Adding more details here: Word level timestamps provide precise start and end times for each word in a transcription (note: not each token). There are many use cases for this but in general, they give much more flexibility when choosing how to display transcriptions for videos, and will assist significantly with alignment for streaming. Here's an example output from the openai whisper API for the [{'word': 'And', 'start': 0, 'end': 0.64},
{'word': 'so', 'start': 0.64, 'end': 0.98},
{'word': 'my', 'start': 0.98, 'end': 1.32},
{'word': 'fellow', 'start': 1.32, 'end': 1.68},
{'word': 'Americans', 'start': 1.68, 'end': 2.28},
{'word': 'ask', 'start': 3.8, 'end': 3.8},
{'word': 'not', 'start': 3.8, 'end': 4.38},
{'word': 'what', 'start': 4.38, 'end': 5.62},
{'word': 'your', 'start': 5.62, 'end': 5.96},
{'word': 'country', 'start': 5.96, 'end': 6.32},
{'word': 'can', 'start': 6.32, 'end': 6.72},
{'word': 'do', 'start': 6.72, 'end': 6.88},
{'word': 'for', 'start': 6.88, 'end': 7.16},
{'word': 'you', 'start': 7.16, 'end': 7.64},
{'word': 'ask', 'start': 8.5, 'end': 8.56},
{'word': 'what', 'start': 8.56, 'end': 8.84},
{'word': 'you', 'start': 8.84, 'end': 9.16},
{'word': 'can', 'start': 9.16, 'end': 9.44},
{'word': 'do', 'start': 9.44, 'end': 9.62},
{'word': 'for', 'start': 9.62, 'end': 9.84},
{'word': 'your', 'start': 9.84, 'end': 10.22},
{'word': 'country', 'start': 10.22, 'end': 10.38}] For the implementation here, we will also want avg log probs and other such token-level contexts as well for each word. References:Original PR in the openai implementation: openai/whisper#869 |
@ZachNagengast is it possible to also get the words in the decodingCallback's TranscriptionProgress? |
@ldenoue Can you elaborate a bit on this? By words, do you mean the word timestamps? If so, it is not currently available until the segment is completed, but we're investigating how to do this in each decoding loop. If you could provide a simple json of what you'd like to see returned in a particular TranscriptionProgress that would be helpful too. |
@ZachNagengast Yes, basically I would like to know the words that make up the progress.text, so something like:
|
Segment level timestamps look good, great work guys.
Are token level timestamps currently supported somehow, or on the roadmap?
The text was updated successfully, but these errors were encountered: