Add CSV formatted output in transcript, using integer start/end times in milliseconds. #228

NielsMayer · 2022-10-02T21:45:50Z

This PR adds CSV output to Whisper transcription similar to the way #102 added SRT subtitle formatted output.

Each line of the resulting CSV file is formatted like:
<startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, "<transcript-including-commas>"

One of the reasons for using integer millisecond timings is to avoid regional incompatibilities with writing and reading floating point timings across language regions which use different characters - either "." or "," - as the decimal separator (c.f. #197). The CSV format with integer millisecond timings also allows for more efficient parsing and storage of Whisper results when read into other applications, in other languages like C++.

…ormatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>

column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''".

Yoyoma22 · 2022-10-27T00:09:13Z

The CSV seems to just multiply the range by 1000 to get ms resolution. Could we please have actual ms resolution for the speakers? If doing text-to-speech of a natural conversation, and trying to combine it with speaker recognition (pytorch for example), two persons may speak in the same second. We don't know who said what. Therefore, if we had ms resolution out of whisper, we could easily know who said what sentence in a natural conversation.

whisper/utils.py

ksn-systems · 2022-10-29T05:01:03Z

To use Whisper for subtitles, Millisecond resolution would be a big plus.

NielsMayer · 2022-11-30T22:05:49Z

@ksn-systems that is precisely why I made this change. See #233 for details.

NielsMayer · 2023-01-20T22:11:19Z

@jongwook would it be possible to "approve the workflow" as I see this PR is stuck at "1 workflow awaiting approval".

Is there anything I should change or clarify in order to get this PR merged or get the "workflow awaiting approval" to be satisfied?

Please note a similar PR was merged for whisper.cpp ( ggerganov/whisper.cpp#340 ). Having this functionality in place for whisper allows for easier comparison of results between implementations, via easier importing into spreadsheets and databases supporting the CSV format.

jongwook · 2023-01-21T06:07:33Z

@NielsMayer Thanks for the PR! Would it work for you if I make this write_tsv instead? CSV format is not standardized and csv.reader and pandas.read_csv often create headache parsing quotes.

I should probably merge #333 first with some modifications as the number of output files is becoming unwieldly.

NielsMayer · 2023-01-21T22:27:36Z

@NielsMayer Thanks for the PR! Would it work for you if I make this write_tsv instead? CSV format is not standardized and csv.reader and pandas.read_csv often create headache parsing quotes.

I should probably merge #333 first with some modifications as the number of output files is becoming unwieldly.

Yes, merging #333 makes sense. If you do that first, I will update this PR to conform to the changes made, e.g. add additional csv (or tsv) output format keyword.

W/r/t changing from CSV to TSV, that is fine by me as it would require a trivial change on my end. The reason why I chose CSV is that it seems more "standard"; although most of the programs automatically importing from CSV just as easily handle TSV.

I'm not familiar with the comma issues you mention in csv.reader or pandas.read_csv, however, do note that I updated the code for compatibility with importing into, e.g. openoffice, where string type is automatically recognized if delimited by '"' character. To prevent issues, Internal " in each CSV text line is replaced by two consecutive single quotes '' ... Note: print('"' + segment['text'].strip().replace('"', "''") + '"'

Chances are, such formatting and lack of special escape character means the existing solution would work with the readers you mention, @jongwook .

PS: I updated my repo for this PR to the latest head of repository from whisper, so once again, there is a "workflow awaiting approval" message...

NielsMayer · 2023-01-21T23:02:52Z

FYI here's a whisper CSV file read into libreoffice on a linux desktop. Note the " replaced by ''
(orig source: https://rumble.com/v2619vq-bills-proliferate-to-criminalize-speech-darren-beattie-on-lex-and-brazil-fa.html )

3171020 | 3172820 | and he is up right now.
3172820 | 3176820 | [''System Updates,'' main theme music playing.]
3182620 | 3183760 | Great to be here.

@jongwook how on earth did whisper figure out that [''System Updates,'' main theme music playing.] -- I mean how did it figure out the name of the show ("remembered" the announcement of the show name at the beginning, sometimes an hour earlier?)

Sometimes however it gets the theme music wrong on the same show, but different episode: ( https://rumble.com/v24mywg-what-really-happened-in-brazil-yesterday-system-update-18.html )
`"[''The Daily Show Theme'']"

Likewise how is whisper figuring out where quotes start and end? It's kind of spooky actually! :-)

jongwook · 2023-01-22T08:52:46Z

Thanks for accommodating the TSV suggestion! I merged a refactored version of #333 and edited this PR accordingly.

The issue about CSV is that, although CSV is more widely used and well known, even the simplest possible case like:

start, end, text
1234, 12345, "hello, world!"

results in very inconsistent user experience according to the program because of the lack of standardization around the quotation marks:

Apple Quick Look:

Apple Numbers:

csv.reader():

pandas.read_csv()

The latter two are the most common way to read CSV files in Python -- there are some combination of options to read the file as intended, but it's inconvenient and not practical to expect the users to use the "correct" configuration for all reader implementations.

Meanwhile, TSV doesn't need to deal with quoting because the field values are not allowed to contain tab characters.

Re: the second comment, because of the way that Whisper was trained, the model must have encountered the exact music and the text ["System Updates" main theme music playing.] multiple times during training. It's usually an undesired behavior, and we tried to mitigate this (rather hackily) by suppressing the [ character by default.

… in milliseconds. (openai#228) * Add CSV format output in transcript, containing lines of characters formatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas> * for easier reading by spreadsheets importing CSV, the third column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''". * fix syntax error * docstring edit Co-authored-by: Jong Wook Kim <[email protected]> Co-authored-by: Jong Wook Kim <[email protected]>

NielsMayer added 3 commits October 1, 2022 22:15

Add CSV format output in transcript, containing lines of characters f…

7c5dfb4

…ormatted like: <startTime-in-integer-milliseconds>, <endTime-in-integer-milliseconds>, <transcript-including-commas>

for easier reading by spreadsheets importing CSV, the third

9249e42

column of the CSV file is delimited by quotes, and any quote characters that might be in the transcript (which would interfere with parsing the third column as a string) are converted to "''".

Merge branch 'openai:main' into main

c3c190b

Yoyoma22 reviewed Oct 27, 2022

View reviewed changes

whisper/utils.py Outdated Show resolved Hide resolved

Merge branch 'openai:main' into main

3ce1e31

Merge branch 'openai:main' into main

c3ffedd

NielsMayer mentioned this pull request Dec 27, 2022

Similar to Whisper PR#228, this adds -ocsv, aka --output-csv, writing CSV file containing millisecond timestamps ggerganov/whisper.cpp#340

Merged

NielsMayer added 2 commits January 6, 2023 19:44

Merge branch 'openai:main' into main

8712bf1

Merge branch 'openai:main' into main

21569c3

Merge branch 'openai:main' into main

8f878fa

jongwook and others added 3 commits January 22, 2023 00:05

Merge branch 'main' into main

b161f34

fix syntax error

f7e96de

docstring edit

691ffc2

jongwook merged commit f5bfe00 into openai:main Jan 22, 2023

NielsMayer mentioned this pull request May 29, 2023

support comma separated output_format #1039

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CSV formatted output in transcript, using integer start/end times in milliseconds. #228

Add CSV formatted output in transcript, using integer start/end times in milliseconds. #228

NielsMayer commented Oct 2, 2022 •

edited

Loading

Yoyoma22 commented Oct 27, 2022 •

edited

Loading

ksn-systems commented Oct 29, 2022

NielsMayer commented Nov 30, 2022

NielsMayer commented Jan 20, 2023

jongwook commented Jan 21, 2023 •

edited

Loading

NielsMayer commented Jan 21, 2023

NielsMayer commented Jan 21, 2023

jongwook commented Jan 22, 2023

Add CSV formatted output in transcript, using integer start/end times in milliseconds. #228

Add CSV formatted output in transcript, using integer start/end times in milliseconds. #228

Conversation

NielsMayer commented Oct 2, 2022 • edited Loading

Yoyoma22 commented Oct 27, 2022 • edited Loading

ksn-systems commented Oct 29, 2022

NielsMayer commented Nov 30, 2022

NielsMayer commented Jan 20, 2023

jongwook commented Jan 21, 2023 • edited Loading

NielsMayer commented Jan 21, 2023

NielsMayer commented Jan 21, 2023

jongwook commented Jan 22, 2023

NielsMayer commented Oct 2, 2022 •

edited

Loading

Yoyoma22 commented Oct 27, 2022 •

edited

Loading

jongwook commented Jan 21, 2023 •

edited

Loading