-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse closed caption files for Oakland better #203
Comments
I can work on this on weekends. Having said that, between work and school I have little free time, unfortunately. So @Shak2000 if you do have time and are willing, that would be better. I will wait until this weekend and then go from there. Thank you Eva! |
Let me take a look |
I am not very familiar with transcripts, so I to first write a unit test for this case. To make this possible, could you please send me a link to examples of pages in the Oakland website as well as an example transcript or audio file? |
@Shak2000 example webpage with transcript: https://councildataproject.org/oakland/#/events/e0912a619d69 script for downloading oakland resources: from cdp_backend.database import models as db_models
from cdp_backend.utils import file_utils
from cdp_backend.pipeline.transcript_model import Transcript
import fireo
from gcsfs import GCSFileSystem
from google.auth.credentials import AnonymousCredentials
from google.cloud.firestore import Client
# Connect to the database
fireo.connection(client=Client(
project="cdp-oakland-ba81c097",
credentials=AnonymousCredentials()
))
# Get event
event = db_models.Event.collection.get("event/e0912a619d69")
# Get session
session = list(
db_models.Session.collection.filter("event_ref", "==", event.key).fetch()
)[0]
# Get transcript
transcript = list(
db_models.Transcript.collection.filter("session_ref", "==", session.key).fetch()
)[0]
transcript_file = transcript.file_ref.get()
# Connect to filestore
fs = GCSFileSystem(project="cdp-oakland-ba81c097", token="anon")
# Download transcript
fs.get(transcript_file.uri, "oakland-transcript.json")
# Download captions
file_utils.resource_copy(
"https://oakland.granicus.com//videos/5042/captions.vtt",
"oakland-captions.vtt",
overwrite=True,
)
# If you want to read the transcript as a Python object
with open("oakland-transcript.json", "r") as open_f:
read_transcript = Transcript.from_json(open_f.read())
print(read_transcript) copy paste that into a file and call it with python and it will download two files, |
I managed to add the unit tests to the I can now debug the |
I dont follow? We haven't implemented a different model? / what model are you talking about? |
One format/model is that we get |
For multiple reasons:
short answer: "do the conversion during the pipeline so we dont have to have MORE code and duplicate processing downstream" our format is much more extensive, we can add annotations and more analysis driven stuff to it. VTT is really just a text + timestamp format. |
Analysis:
Solution: I wrote a very simple solution for this problem in the '_normalize_text' function, which replaces
WDYT? |
First, thank you Shak. Looks like you have figured out what those I can see pros and cons for your idea of allowing an instance to specify unwanted characters in transcripts. Do we do something like that already? i.e. Do we do some instance-specific "thing" during data cleanup/import into our models? If the answer is no (and it's OK with you don't know; I don't 😅 ), then I think the proposal may be overengineering. i.e. Just clean those out of every transcript for any instance, like you suggested in That's my 2 cents for now... |
If this is true, you'd just need to add an optional Btw, in the |
I want to summarize the current status: I found out that we don't need to change anything in the CDP program. There is already a config in the CDP instance |
Handled on the oakland instance side. Merged and released. Will check all is well after new event is published. |
I fixed the cookiecutter / infra build and deployed Oakland!
There are still minor problems with permissions / CORS settings so ignore the broken video (I will fix that tomorrow)
That all said, Oakland has closed caption files which we parse and convert into our transcript format like we do for Seattle and Boston, once again, there is some minor differences from boston and seattle that I would love to be able to fix.
example of oakland event: https://councildataproject.org/oakland/#/events/e0912a619d69
the transcript includes a bunch of
">> "
markings where there are new speakers. To me these should be filtered out properly.Similarly there are also symbols I have never seen in a closed caption file like:
"->> "
which I assume also denote a new speaker.cc @isaacna @Shak2000 @dphoria any of you free to take this one?
The text was updated successfully, but these errors were encountered: