Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/process-local-file #89

Merged

Conversation

isaacna
Copy link
Collaborator

@isaacna isaacna commented Aug 17, 2021

Link to Relevant Issue

This pull request resolves #57

Description of Changes

Adds a script for processing a local video file. event_details_file is just an EventIngestionModel in JSON format.

Example run:

process_local_file --local_video_file wombo_combo.mp4 --event_details_file event-details.json --event_gather_config_file event-gather-config.json

I also made the resource_copy function create the file in a temporary directory if dst is not provided. I did this because otherwise the resource_copy will copy into the repo directory. This usually isn't a problem, but for example if the video file provided in the script lives in cdp_backend, then we'll get a FileExistsError during resource copy and the pipeline will error out.

@codecov
Copy link

codecov bot commented Aug 17, 2021

Codecov Report

Merging #89 (aeb813a) into main (c98f609) will decrease coverage by 0.16%.
The diff coverage is 63.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #89      +/-   ##
==========================================
- Coverage   95.88%   95.72%   -0.17%     
==========================================
  Files          48       48              
  Lines        2238     2247       +9     
==========================================
+ Hits         2146     2151       +5     
- Misses         92       96       +4     
Impacted Files Coverage Δ
cdp_backend/pipeline/event_gather_pipeline.py 87.28% <42.85%> (-0.94%) ⬇️
cdp_backend/database/validators.py 97.43% <100.00%> (ø)
cdp_backend/utils/file_utils.py 94.79% <100.00%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c98f609...aeb813a. Read the comment docs.

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. it's pretty neat that for this many lines of code we are adding a new feature, like go team on making it robust like this.
  2. one comment on general design of how to accept this file and pass to pipeline
  3. one nitpick comment on tempdir management, can ignore if you want.

Other than that, looks pretty sane to me.

cdp_backend/bin/process_local_file.py Outdated Show resolved Hide resolved
cdp_backend/utils/file_utils.py Show resolved Hide resolved
@evamaxfield
Copy link
Member

OH! And when attaching the URI back to the session model, add the ?alt=media to the end. Unless that breaks the pipeline?

@isaacna
Copy link
Collaborator Author

isaacna commented Aug 18, 2021

OH! And when attaching the URI back to the session model, add the ?alt=media to the end. Unless that breaks the pipeline?

Unfortunately this does break the pipeline when we try to do resource_copy on the file in GCS, we get a FileNotFoundError. Just curious, but what exactly was the reason for appending that to the URI?

@evamaxfield
Copy link
Member

evamaxfield commented Aug 18, 2021

OH! And when attaching the URI back to the session model, add the ?alt=media to the end. Unless that breaks the pipeline?

Unfortunately this does break the pipeline when we try to do resource_copy on the file in GCS, we get a FileNotFoundError. Just curious, but what exactly was the reason for appending that to the URI?

Oh whoops. I event added it in the wrong place myself. It is needed for google file storage to recognize that you aren't trying to download or copy the file but rather stream the file.

But what I meant by "I added it in the wrong place" is that you can keep it how it is for all the file storage and processing and such (exactly how it is now) but then during the database upload task, can you pass a parameter that says like "from_local" or something, and if it is local the session_video_uris need to be converted to their https:// equivalents. https://storage.googleapis.com.... Which is fortunately really easy to get:

In [1]: from gcsfs import GCSFileSystem

In [2]: fs = GCSFileSystem(token="../cdp-backend/.keys/cdp-jackson-dev-002-sa-dev.json")

In [3]: fs.url("cdp-jackson-dev-002.appspot.com/9af4adb4f28b34c0142a2ab28fa5603dd5fdb347c074d957d13636dae554ed34-audio.wav")
Out[3]: 'https://storage.googleapis.com/download/storage/v1/b/cdp-jackson-dev-002.appspot.com/o/9af4adb4f28b34c0142a2ab28fa5603dd5fdb347c074d957d13636dae554ed34-audio.wav?alt=media'

It even adds that ?alt=media to the URL!

TL;DR: it's need for the front-end streaming of the file and not the backend processing. So only change it right before database storage of the URI.

@isaacna
Copy link
Collaborator Author

isaacna commented Aug 19, 2021

OH! And when attaching the URI back to the session model, add the ?alt=media to the end. Unless that breaks the pipeline?

Unfortunately this does break the pipeline when we try to do resource_copy on the file in GCS, we get a FileNotFoundError. Just curious, but what exactly was the reason for appending that to the URI?

Oh whoops. I event added it in the wrong place myself. It is needed for google file storage to recognize that you aren't trying to download or copy the file but rather stream the file.

But what I meant by "I added it in the wrong place" is that you can keep it how it is for all the file storage and processing and such (exactly how it is now) but then during the database upload task, can you pass a parameter that says like "from_local" or something, and if it is local the session_video_uris need to be converted to their https:// equivalents. https://storage.googleapis.com.... Which is fortunately really easy to get:

In [1]: from gcsfs import GCSFileSystem

In [2]: fs = GCSFileSystem(token="../cdp-backend/.keys/cdp-jackson-dev-002-sa-dev.json")

In [3]: fs.url("cdp-jackson-dev-002.appspot.com/9af4adb4f28b34c0142a2ab28fa5603dd5fdb347c074d957d13636dae554ed34-audio.wav")
Out[3]: 'https://storage.googleapis.com/download/storage/v1/b/cdp-jackson-dev-002.appspot.com/o/9af4adb4f28b34c0142a2ab28fa5603dd5fdb347c074d957d13636dae554ed34-audio.wav?alt=media'

It even adds that ?alt=media to the URL!

TL;DR: it's need for the front-end streaming of the file and not the backend processing. So only change it right before database storage of the URI.

Ah gotcha that makes sense, I didn't know it had that use case for the front end. Will do

@isaacna
Copy link
Collaborator Author

isaacna commented Aug 19, 2021

Just curious but does the front end have a way of streaming the video of something stored in GCS using the HTTP url?

@isaacna
Copy link
Collaborator Author

isaacna commented Aug 19, 2021

Not sure why the graphviz fails to install on ubuntu. I added some changes to the github workflow, but I think the workflow on this PR runs off of the main cdp_backend and not my changes.

@evamaxfield
Copy link
Member

Just curious but does the front end have a way of streaming the video of something stored in GCS using the HTTP url?

Yep! See:
Screen Shot 2021-08-19 at 07 47 11

From: https://councildataproject.org/seattle/#/events/2ee8dfd2-a4a5-416f-aded-f48b37e6e521

@evamaxfield
Copy link
Member

Not sure why the graphviz fails to install on ubuntu. I added some changes to the github workflow, but I think the workflow on this PR runs off of the main cdp_backend and not my changes.

They may have released something and something somewhere is broken in the dep chain, I will wait a bit and retrigger GitHub Actions to run.

Copy link
Member

@evamaxfield evamaxfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to keep this open a second and test the ubuntu graphviz stuff tomorrow morning. But all looks good.

@evamaxfield
Copy link
Member

This actually looks like a github actions release failure.... GitHub released a new build of their ubuntu runner 6 days ago which aligns with when the ubuntu tests started failing.... https://github.com/actions/virtual-environments/releases/tag/ubuntu20%2F20210816.1

@evamaxfield
Copy link
Member

It's actually not graphviz its ffmpeg it looks like.... The docs test works just fine. Investigating.

@evamaxfield
Copy link
Member

Got it figured out. Going to merge this then merge my PR to fix the builds without merge conflicts.

@evamaxfield evamaxfield merged commit 540392e into CouncilDataProject:main Aug 21, 2021
@isaacna isaacna deleted the feature/process_local_file branch September 1, 2021 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add script for processing local file
2 participants