-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CSV formatted output in transcript, using integer start/end times in milliseconds. #228
Conversation
…ormatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>
column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''".
The CSV seems to just multiply the range by 1000 to get ms resolution. Could we please have actual ms resolution for the speakers? If doing text-to-speech of a natural conversation, and trying to combine it with speaker recognition (pytorch for example), two persons may speak in the same second. We don't know who said what. Therefore, if we had ms resolution out of whisper, we could easily know who said what sentence in a natural conversation. |
To use Whisper for subtitles, Millisecond resolution would be a big plus. |
@ksn-systems that is precisely why I made this change. See #233 for details. |
@jongwook would it be possible to "approve the workflow" as I see this PR is stuck at "1 workflow awaiting approval". Is there anything I should change or clarify in order to get this PR merged or get the "workflow awaiting approval" to be satisfied? Please note a similar PR was merged for whisper.cpp ( ggerganov/whisper.cpp#340 ). Having this functionality in place for whisper allows for easier comparison of results between implementations, via easier importing into spreadsheets and databases supporting the CSV format. |
@NielsMayer Thanks for the PR! Would it work for you if I make this I should probably merge #333 first with some modifications as the number of output files is becoming unwieldly. |
Yes, merging #333 makes sense. If you do that first, I will update this PR to conform to the changes made, e.g. add additional csv (or tsv) output format keyword. W/r/t changing from CSV to TSV, that is fine by me as it would require a trivial change on my end. The reason why I chose CSV is that it seems more "standard"; although most of the programs automatically importing from CSV just as easily handle TSV. I'm not familiar with the comma issues you mention in csv.reader or pandas.read_csv, however, do note that I updated the code for compatibility with importing into, e.g. openoffice, where string type is automatically recognized if delimited by '"' character. To prevent issues, Internal Chances are, such formatting and lack of special escape character means the existing solution would work with the readers you mention, @jongwook . PS: I updated my repo for this PR to the latest head of repository from whisper, so once again, there is a "workflow awaiting approval" message... |
FYI here's a whisper CSV file read into libreoffice on a linux desktop. Note the " replaced by ''
@jongwook how on earth did whisper figure out that Sometimes however it gets the theme music wrong on the same show, but different episode: ( https://rumble.com/v24mywg-what-really-happened-in-brazil-yesterday-system-update-18.html ) Likewise how is whisper figuring out where quotes start and end? It's kind of spooky actually! :-) |
Thanks for accommodating the TSV suggestion! I merged a refactored version of #333 and edited this PR accordingly. The issue about CSV is that, although CSV is more widely used and well known, even the simplest possible case like:
results in very inconsistent user experience according to the program because of the lack of standardization around the quotation marks: The latter two are the most common way to read CSV files in Python -- there are some combination of options to read the file as intended, but it's inconvenient and not practical to expect the users to use the "correct" configuration for all reader implementations. Meanwhile, TSV doesn't need to deal with quoting because the field values are not allowed to contain tab characters. Re: the second comment, because of the way that Whisper was trained, the model must have encountered the exact music and the text |
… in milliseconds. (openai#228) * Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas> * for easier reading by spreadsheets importing CSV, the third column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''". * fix syntax error * docstring edit Co-authored-by: Jong Wook Kim <[email protected]> Co-authored-by: Jong Wook Kim <[email protected]>
… in milliseconds. (openai#228) * Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas> * for easier reading by spreadsheets importing CSV, the third column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''". * fix syntax error * docstring edit Co-authored-by: Jong Wook Kim <[email protected]> Co-authored-by: Jong Wook Kim <[email protected]>
… in milliseconds. (openai#228) * Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas> * for easier reading by spreadsheets importing CSV, the third column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''". * fix syntax error * docstring edit Co-authored-by: Jong Wook Kim <[email protected]> Co-authored-by: Jong Wook Kim <[email protected]>
This PR adds CSV output to Whisper transcription similar to the way #102 added SRT subtitle formatted output.
Each line of the resulting CSV file is formatted like:
<startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, "<transcript-including-commas>"
One of the reasons for using integer millisecond timings is to avoid regional incompatibilities with writing and reading floating point timings across language regions which use different characters - either "." or "," - as the decimal separator (c.f. #197). The CSV format with integer millisecond timings also allows for more efficient parsing and storage of Whisper results when read into other applications, in other languages like C++.