Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CSV formatted output in transcript, using integer start/end times in milliseconds. #228

Merged
merged 11 commits into from
Jan 22, 2023

Conversation

NielsMayer
Copy link
Contributor

@NielsMayer NielsMayer commented Oct 2, 2022

This PR adds CSV output to Whisper transcription similar to the way #102 added SRT subtitle formatted output.

Each line of the resulting CSV file is formatted like:
<startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, "<transcript-including-commas>"

One of the reasons for using integer millisecond timings is to avoid regional incompatibilities with writing and reading floating point timings across language regions which use different characters - either "." or "," - as the decimal separator (c.f. #197). The CSV format with integer millisecond timings also allows for more efficient parsing and storage of Whisper results when read into other applications, in other languages like C++.

…ormatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>
column of the CSV file is delimited by quotes, and any quote
characters that might be in the transcript (which would interfere with
parsing the third column as a string) are converted to "''".
@Yoyoma22
Copy link

Yoyoma22 commented Oct 27, 2022

The CSV seems to just multiply the range by 1000 to get ms resolution. Could we please have actual ms resolution for the speakers? If doing text-to-speech of a natural conversation, and trying to combine it with speaker recognition (pytorch for example), two persons may speak in the same second. We don't know who said what. Therefore, if we had ms resolution out of whisper, we could easily know who said what sentence in a natural conversation.

whisper/utils.py Outdated Show resolved Hide resolved
@ksn-systems
Copy link

To use Whisper for subtitles, Millisecond resolution would be a big plus.

@NielsMayer
Copy link
Contributor Author

@ksn-systems that is precisely why I made this change. See #233 for details.

@NielsMayer
Copy link
Contributor Author

@jongwook would it be possible to "approve the workflow" as I see this PR is stuck at "1 workflow awaiting approval".

Is there anything I should change or clarify in order to get this PR merged or get the "workflow awaiting approval" to be satisfied?

Please note a similar PR was merged for whisper.cpp ( ggerganov/whisper.cpp#340 ). Having this functionality in place for whisper allows for easier comparison of results between implementations, via easier importing into spreadsheets and databases supporting the CSV format.

@jongwook
Copy link
Collaborator

jongwook commented Jan 21, 2023

@NielsMayer Thanks for the PR! Would it work for you if I make this write_tsv instead? CSV format is not standardized and csv.reader and pandas.read_csv often create headache parsing quotes.

I should probably merge #333 first with some modifications as the number of output files is becoming unwieldly.

@NielsMayer
Copy link
Contributor Author

@NielsMayer Thanks for the PR! Would it work for you if I make this write_tsv instead? CSV format is not standardized and csv.reader and pandas.read_csv often create headache parsing quotes.

I should probably merge #333 first with some modifications as the number of output files is becoming unwieldly.

Yes, merging #333 makes sense. If you do that first, I will update this PR to conform to the changes made, e.g. add additional csv (or tsv) output format keyword.

W/r/t changing from CSV to TSV, that is fine by me as it would require a trivial change on my end. The reason why I chose CSV is that it seems more "standard"; although most of the programs automatically importing from CSV just as easily handle TSV.

I'm not familiar with the comma issues you mention in csv.reader or pandas.read_csv, however, do note that I updated the code for compatibility with importing into, e.g. openoffice, where string type is automatically recognized if delimited by '"' character. To prevent issues, Internal " in each CSV text line is replaced by two consecutive single quotes '' ... Note: print('"' + segment['text'].strip().replace('"', "''") + '"'

Chances are, such formatting and lack of special escape character means the existing solution would work with the readers you mention, @jongwook .

PS: I updated my repo for this PR to the latest head of repository from whisper, so once again, there is a "workflow awaiting approval" message...

@NielsMayer
Copy link
Contributor Author

FYI here's a whisper CSV file read into libreoffice on a linux desktop. Note the " replaced by ''
(orig source: https://rumble.com/v2619vq-bills-proliferate-to-criminalize-speech-darren-beattie-on-lex-and-brazil-fa.html )

3171020 | 3172820 | and he is up right now.
3172820 | 3176820 | [''System Updates,'' main theme music playing.]
3182620 | 3183760 | Great to be here.

@jongwook how on earth did whisper figure out that [''System Updates,'' main theme music playing.] -- I mean how did it figure out the name of the show ("remembered" the announcement of the show name at the beginning, sometimes an hour earlier?)

Sometimes however it gets the theme music wrong on the same show, but different episode: ( https://rumble.com/v24mywg-what-really-happened-in-brazil-yesterday-system-update-18.html )
`"[''The Daily Show Theme'']"

Likewise how is whisper figuring out where quotes start and end? It's kind of spooky actually! :-)

@jongwook jongwook merged commit f5bfe00 into openai:main Jan 22, 2023
@jongwook
Copy link
Collaborator

Thanks for accommodating the TSV suggestion! I merged a refactored version of #333 and edited this PR accordingly.

The issue about CSV is that, although CSV is more widely used and well known, even the simplest possible case like:

start, end, text
1234, 12345, "hello, world!"

results in very inconsistent user experience according to the program because of the lack of standardization around the quotation marks:

Apple Quick Look:
image

Apple Numbers:
image

csv.reader():
image

pandas.read_csv()
image

The latter two are the most common way to read CSV files in Python -- there are some combination of options to read the file as intended, but it's inconvenient and not practical to expect the users to use the "correct" configuration for all reader implementations.

Meanwhile, TSV doesn't need to deal with quoting because the field values are not allowed to contain tab characters.


Re: the second comment, because of the way that Whisper was trained, the model must have encountered the exact music and the text ["System Updates" main theme music playing.] multiple times during training. It's usually an undesired behavior, and we tried to mitigate this (rather hackily) by suppressing the [ character by default.

zackees pushed a commit to zackees/whisper that referenced this pull request May 5, 2023
… in milliseconds. (openai#228)

* Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>

* for easier reading by spreadsheets importing CSV, the third

column of the CSV file is delimited by quotes, and any quote
characters that might be in the transcript (which would interfere with
parsing the third column as a string) are converted to "''".

* fix syntax error

* docstring edit

Co-authored-by: Jong Wook Kim <[email protected]>
Co-authored-by: Jong Wook Kim <[email protected]>
ilanit1997 pushed a commit to ilanit1997/whisper that referenced this pull request May 16, 2023
… in milliseconds. (openai#228)

* Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>

* for easier reading by spreadsheets importing CSV, the third

column of the CSV file is delimited by quotes, and any quote
characters that might be in the transcript (which would interfere with
parsing the third column as a string) are converted to "''".

* fix syntax error

* docstring edit

Co-authored-by: Jong Wook Kim <[email protected]>
Co-authored-by: Jong Wook Kim <[email protected]>
abyesilyurt pushed a commit to abyesilyurt/whisper that referenced this pull request Nov 13, 2023
… in milliseconds. (openai#228)

* Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>

* for easier reading by spreadsheets importing CSV, the third

column of the CSV file is delimited by quotes, and any quote
characters that might be in the transcript (which would interfere with
parsing the third column as a string) are converted to "''".

* fix syntax error

* docstring edit

Co-authored-by: Jong Wook Kim <[email protected]>
Co-authored-by: Jong Wook Kim <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants