Parse closed caption files for Oakland better #203

evamaxfield · 2022-08-18T01:25:41Z

I fixed the cookiecutter / infra build and deployed Oakland!

There are still minor problems with permissions / CORS settings so ignore the broken video (I will fix that tomorrow)

That all said, Oakland has closed caption files which we parse and convert into our transcript format like we do for Seattle and Boston, once again, there is some minor differences from boston and seattle that I would love to be able to fix.

example of oakland event: https://councildataproject.org/oakland/#/events/e0912a619d69

the transcript includes a bunch of ">> " markings where there are new speakers. To me these should be filtered out properly.

Similarly there are also symbols I have never seen in a closed caption file like: "->> " which I assume also denote a new speaker.

cc @isaacna @Shak2000 @dphoria any of you free to take this one?

The text was updated successfully, but these errors were encountered:

dphoria · 2022-08-18T01:47:06Z

I can work on this on weekends. Having said that, between work and school I have little free time, unfortunately. So @Shak2000 if you do have time and are willing, that would be better. I will wait until this weekend and then go from there.

Thank you Eva!

Shak2000 · 2022-08-18T02:00:14Z

Let me take a look

Shak2000 · 2022-08-18T06:38:42Z

I am not very familiar with transcripts, so I to first write a unit test for this case. To make this possible, could you please send me a link to examples of pages in the Oakland website as well as an example transcript or audio file?

evamaxfield · 2022-08-18T19:57:45Z

@Shak2000 example webpage with transcript: https://councildataproject.org/oakland/#/events/e0912a619d69

script for downloading oakland resources:

from cdp_backend.database import models as db_models
from cdp_backend.utils import file_utils
from cdp_backend.pipeline.transcript_model import Transcript
import fireo
from gcsfs import GCSFileSystem
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client

# Connect to the database
fireo.connection(client=Client(
    project="cdp-oakland-ba81c097",
    credentials=AnonymousCredentials()
))

# Get event
event = db_models.Event.collection.get("event/e0912a619d69")

# Get session
session = list(
    db_models.Session.collection.filter("event_ref", "==", event.key).fetch()
)[0]

# Get transcript
transcript = list(
    db_models.Transcript.collection.filter("session_ref", "==", session.key).fetch()
)[0]
transcript_file = transcript.file_ref.get()

# Connect to filestore
fs = GCSFileSystem(project="cdp-oakland-ba81c097", token="anon")

# Download transcript
fs.get(transcript_file.uri, "oakland-transcript.json")

# Download captions
file_utils.resource_copy(
    "https://oakland.granicus.com//videos/5042/captions.vtt",
    "oakland-captions.vtt",
    overwrite=True,
)

# If you want to read the transcript as a Python object
with open("oakland-transcript.json", "r") as open_f:
    read_transcript = Transcript.from_json(open_f.read())

print(read_transcript)

copy paste that into a file and call it with python and it will download two files, oakland-captions.vtt and oakland-transcript.json

Shak2000 · 2022-08-22T05:56:52Z

I managed to add the unit tests to the test_webvtt_sr_model.py, and it has run successfully

I can now debug the transcribe function on webvtt_sr_model.py. Before I debug, I would like to better understand why we implemented a different model. If there is any documentation, I would be happy to read

evamaxfield · 2022-08-22T19:10:14Z

Before I debug, I would like to better understand why we implemented a different model. If there is any documentation, I would be happy to read

I dont follow?

We haven't implemented a different model? / what model are you talking about?

Shak2000 · 2022-08-23T01:43:46Z

One format/model is that we get vtt from the municipality and we convert it into our own model. What is the logic behind this?

evamaxfield · 2022-08-23T02:14:42Z

For multiple reasons:

VTT files are for closed caption services, they are chunked up in sometimes really odd ways. Because we want full sentences we want to store the data in sentence format.
VTT files rarely have good "casing" i.e. they are usually all CAPITAL CASE. We want them to look more "transcript-y".
We can write a single function to convert from VTT to our format here instead of:
a. writing a function to render both VTT and transcript on the frontend
b. writing a function to process both VTT and transcript on any of our analysis and processing functions

short answer: "do the conversion during the pipeline so we dont have to have MORE code and duplicate processing downstream"

our format is much more extensive, we can add annotations and more analysis driven stuff to it. VTT is really just a text + timestamp format.

Shak2000 · 2022-08-23T05:52:06Z

Analysis:

>> does indeed represent a change in the speaker. I listened to the video and compared to the .vtt file
In the .vtt file, there is no hint as to who the speakers are. Thus, the transaction includes all the participants in the conversation with:
"speaker_index":0,
"speaker_name":null,
There are a few places where the .vtt file has a -. It seems that this occurs whenever there is a disconnect in the audio or the sound is unclear. The transcriber adds the - as it is
There are 4 places (indices 33, 47, 49 and 97) where the .vtt file has a -. After that, there is a change in the speaker (the next word starts with >>). In these cases, the transcriber represents the change in speaker using ->>

Solution: I wrote a very simple solution for this problem in the '_normalize_text' function, which replaces >> with a blank. However:

It is too specific to the Oakland instance. I want generalize it or enable each CDP instance to specify unwanted words
I won't rush to push the specific solution because I do not have a way to test that it did not break any other CDP instance (or are we alright with removing >> for every instance?)

WDYT?

dphoria · 2022-08-25T03:40:37Z

First, thank you Shak. Looks like you have figured out what those >> and - (thus ->>) are.

I can see pros and cons for your idea of allowing an instance to specify unwanted characters in transcripts. Do we do something like that already? i.e. Do we do some instance-specific "thing" during data cleanup/import into our models? If the answer is no (and it's OK with you don't know; I don't 😅 ), then I think the proposal may be overengineering. i.e. Just clean those out of every transcript for any instance, like you suggested in _normalize_text() or wherever appropriate.

That's my 2 cents for now...

tohuynh · 2022-09-11T04:16:58Z

Similarly there are also symbols I have never seen in a closed caption file like: "->> " which I assume also denote a new speaker.

If this is true, you'd just need to add an optional - to the regex here: https://github.com/CouncilDataProject/cdp-backend/blob/main/cdp_backend/sr_models/webvtt_sr_model.py#L59?

Btw, in the webvtt_sr_model.py, there is a default new turn pattern > (which is >).

Shak2000 · 2022-09-25T02:46:56Z

I want to summarize the current status: I found out that we don't need to change anything in the CDP program. There is already a config in the CDP instance caption_new_speaker_turn_pattern to set the change of speaker. As a result, I canceled the old pull request and created a new one in the Oakland instance.

evamaxfield · 2022-09-26T16:54:34Z

Handled on the oakland instance side. Merged and released. Will check all is well after new event is published.

evamaxfield added the bug Something isn't working label Aug 18, 2022

Shak2000 mentioned this issue Aug 27, 2022

bugfix/webvtt-parsing-for-oakland #206

Closed

evamaxfield closed this as completed Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse closed caption files for Oakland better #203

Parse closed caption files for Oakland better #203

evamaxfield commented Aug 18, 2022

dphoria commented Aug 18, 2022

Shak2000 commented Aug 18, 2022

Shak2000 commented Aug 18, 2022

evamaxfield commented Aug 18, 2022 •

edited

Loading

Shak2000 commented Aug 22, 2022 •

edited

Loading

evamaxfield commented Aug 22, 2022

Shak2000 commented Aug 23, 2022

evamaxfield commented Aug 23, 2022

Shak2000 commented Aug 23, 2022 •

edited

Loading

dphoria commented Aug 25, 2022

tohuynh commented Sep 11, 2022

Shak2000 commented Sep 25, 2022

evamaxfield commented Sep 26, 2022

Parse closed caption files for Oakland better #203

Parse closed caption files for Oakland better #203

Comments

evamaxfield commented Aug 18, 2022

dphoria commented Aug 18, 2022

Shak2000 commented Aug 18, 2022

Shak2000 commented Aug 18, 2022

evamaxfield commented Aug 18, 2022 • edited Loading

Shak2000 commented Aug 22, 2022 • edited Loading

evamaxfield commented Aug 22, 2022

Shak2000 commented Aug 23, 2022

evamaxfield commented Aug 23, 2022

Shak2000 commented Aug 23, 2022 • edited Loading

dphoria commented Aug 25, 2022

tohuynh commented Sep 11, 2022

Shak2000 commented Sep 25, 2022

evamaxfield commented Sep 26, 2022

evamaxfield commented Aug 18, 2022 •

edited

Loading

Shak2000 commented Aug 22, 2022 •

edited

Loading

Shak2000 commented Aug 23, 2022 •

edited

Loading