Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/hash-file-contents #60

Merged
merged 5 commits into from
Jun 6, 2021
Merged

feature/hash-file-contents #60

merged 5 commits into from
Jun 6, 2021

Conversation

evamaxfield
Copy link
Member

Pull request recommendations:

  • Name your pull request your-development-type/short-description. Ex: feature/read-tiff-files
  • Link to any relevant issue in the PR description. Ex: Resolves [Add builds for all platforms #12], adds tiff file format support

Resolves #51

  • Provide context of changes.

Hashes the file content in buffer chunk reads to protect against massive files. Additionally updates the pipeline a tiny bit to handle this change.

We should download the file once instead of hashing the file remotely then downloading for audio splitting.

So now it is:

  1. download
  2. hash
  3. split audio

I also updated the deps and fixed the one test that broke because of dep changes.

  • Provide relevant tests for your feature or bug fix.

I also ran make run-min-event-pipeline to test as well.... Which I am somewhat tempted to add as CI here. 🤷 we will see, log below:

Log of Min Event Pipeline
(cdp-backend) maxfield@maxfield-ubuntu:~/active/cdp/cdp-backend$ make run-min-event-pipeline key=.keys/cdp-jackson-dev-001-sa-dev.json 
run_cdp_event_gather \
	-g .keys/cdp-jackson-dev-001-sa-dev.json \
	-e cdp_backend.tests.pipeline.test_event_gather_pipeline.min_get_events
[INFO: run_cdp_event_gather:  87 2021-06-05 16:51:39,830] Defaulting to bucket: cdp-jackson-dev-001.appspot.com
[2021-06-05 16:51:39-0700] INFO - prefect.FlowRunner | Beginning Flow run for 'CDP Event Gather Pipeline'
[INFO: flow_runner: 245 2021-06-05 16:51:39,836] Beginning Flow run for 'CDP Event Gather Pipeline'
[2021-06-05 16:51:39-0700] INFO - prefect.TaskRunner | Task 'create_body_from_ingestion_model': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:39,861] Task 'create_body_from_ingestion_model': Starting task run...
[2021-06-05 16:51:39-0700] INFO - prefect.TaskRunner | Task 'create_body_from_ingestion_model': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:39,869] Task 'create_body_from_ingestion_model': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:39-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:39,885] Task 'upload_db_model_task': Starting task run...
[INFO: functions: 117 2021-06-05 16:51:40,299] Saved new Body with document id=XyoFvuqxeoM2eQzNwyqe.
[2021-06-05 16:51:40-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:40,312] Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:40-0700] INFO - prefect.TaskRunner | Task 'create_event_from_ingestion_model': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:40,329] Task 'create_event_from_ingestion_model': Starting task run...
[2021-06-05 16:51:40-0700] INFO - prefect.TaskRunner | Task 'create_event_from_ingestion_model': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:40,337] Task 'create_event_from_ingestion_model': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:40-0700] INFO - prefect.TaskRunner | Task 'external_resource_copy_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:40,352] Task 'external_resource_copy_task': Starting task run...
[INFO: file_utils: 101 2021-06-05 16:51:51,969] Stored external resource copy: /home/maxfield/active/cdp/cdp-backend/brief_072219_2011957V.mp4
[2021-06-05 16:51:51-0700] INFO - prefect.TaskRunner | Task 'external_resource_copy_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:51,981] Task 'external_resource_copy_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:51-0700] INFO - prefect.TaskRunner | Task 'hash_file_contents_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:51,998] Task 'hash_file_contents_task': Starting task run...
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'hash_file_contents_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:52,081] Task 'hash_file_contents_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'List': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:52,098] Task 'List': Starting task run...
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'List': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:52,107] Task 'List': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'join_strs_and_extension': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:52,124] Task 'join_strs_and_extension': Starting task run...
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'join_strs_and_extension': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:52,132] Task 'join_strs_and_extension': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:52,150] Task 'upload_db_model_task': Starting task run...
[INFO: functions: 117 2021-06-05 16:51:52,476] Saved new Event with document id=Iko1YoRG9wUNp1ZtivfQ.
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:52,490] Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'get_file_uri_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:52,508] Task 'get_file_uri_task': Starting task run...
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'get_file_uri_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:52,809] Task 'get_file_uri_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'case(None)': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:52,826] Task 'case(None)': Starting task run...
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'case(None)': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:52,834] Task 'case(None)': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'create_session_from_ingestion_model': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:52,850] Task 'create_session_from_ingestion_model': Starting task run...
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'create_session_from_ingestion_model': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:52,858] Task 'create_session_from_ingestion_model': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:52-0700] INFO - prefect.TaskRunner | Task 'split_audio_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:52,873] Task 'split_audio_task': Starting task run...
[INFO: file_utils: 154 2021-06-05 16:51:54,675] Stored audio: 5c707a68c02956529f3d15ae2934041b50447911eb5843ce2e907b46e440aa48_audio.wav
[2021-06-05 16:51:54-0700] INFO - prefect.TaskRunner | Task 'split_audio_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:54,688] Task 'split_audio_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:54-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:54,705] Task 'upload_db_model_task': Starting task run...
[INFO: functions: 117 2021-06-05 16:51:55,146] Saved new Session with document id=mLV8TVYQ5spVCNVJeBQt.
[2021-06-05 16:51:55-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:55,157] Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:55-0700] INFO - prefect.TaskRunner | Task 'split_audio_task[0]': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:55,172] Task 'split_audio_task[0]': Starting task run...
[2021-06-05 16:51:55-0700] INFO - prefect.TaskRunner | Task 'split_audio_task[0]': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:55,181] Task 'split_audio_task[0]': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:55-0700] INFO - prefect.TaskRunner | Task 'split_audio_task[1]': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:55,197] Task 'split_audio_task[1]': Starting task run...
[2021-06-05 16:51:55-0700] INFO - prefect.TaskRunner | Task 'split_audio_task[1]': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:55,205] Task 'split_audio_task[1]': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:55-0700] INFO - prefect.TaskRunner | Task 'split_audio_task[2]': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:55,221] Task 'split_audio_task[2]': Starting task run...
[2021-06-05 16:51:55-0700] INFO - prefect.TaskRunner | Task 'split_audio_task[2]': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:55,229] Task 'split_audio_task[2]': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:55-0700] INFO - prefect.TaskRunner | Task 'upload_file_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:55,246] Task 'upload_file_task': Starting task run...
[2021-06-05 16:51:57-0700] INFO - prefect.TaskRunner | Task 'upload_file_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:57,509] Task 'upload_file_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:57-0700] INFO - prefect.TaskRunner | Task 'remove_local_file_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:57,525] Task 'remove_local_file_task': Starting task run...
[INFO: functions: 145 2021-06-05 16:51:57,535] Removed /home/maxfield/active/cdp/cdp-backend/brief_072219_2011957V.mp4 from local file system.
[2021-06-05 16:51:57-0700] INFO - prefect.TaskRunner | Task 'remove_local_file_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:57,543] Task 'remove_local_file_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:57-0700] INFO - prefect.TaskRunner | Task 'upload_file_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:57,559] Task 'upload_file_task': Starting task run...
[2021-06-05 16:51:57-0700] INFO - prefect.TaskRunner | Task 'upload_file_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:57,854] Task 'upload_file_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:57-0700] INFO - prefect.TaskRunner | Task 'upload_file_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:57,870] Task 'upload_file_task': Starting task run...
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'upload_file_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,172] Task 'upload_file_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_filename_from_filepath': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,189] Task 'create_filename_from_filepath': Starting task run...
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_filename_from_filepath': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,197] Task 'create_filename_from_filepath': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_file': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,214] Task 'create_file': Starting task run...
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_file': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,223] Task 'create_file': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_filename_from_filepath': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,239] Task 'create_filename_from_filepath': Starting task run...
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_filename_from_filepath': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,248] Task 'create_filename_from_filepath': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_file': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,266] Task 'create_file': Starting task run...
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_file': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,274] Task 'create_file': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,292] Task 'upload_db_model_task': Starting task run...
[INFO: functions: 117 2021-06-05 16:51:58,587] Saved new File with document id=8K0LOfBL80gvBSlaXk2l.
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,601] Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'remove_local_file_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,617] Task 'remove_local_file_task': Starting task run...
[INFO: functions: 145 2021-06-05 16:51:58,617] Removed /home/maxfield/active/cdp/cdp-backend/5c707a68c02956529f3d15ae2934041b50447911eb5843ce2e907b46e440aa48_audio.err from local file system.
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'remove_local_file_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,625] Task 'remove_local_file_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_filename_from_filepath': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,641] Task 'create_filename_from_filepath': Starting task run...
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_filename_from_filepath': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,649] Task 'create_filename_from_filepath': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,665] Task 'upload_db_model_task': Starting task run...
[INFO: functions: 117 2021-06-05 16:51:58,899] Saved new File with document id=uFu3NDzpQkaFVOtMBUr3.
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,912] Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'remove_local_file_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,930] Task 'remove_local_file_task': Starting task run...
[INFO: functions: 145 2021-06-05 16:51:58,936] Removed /home/maxfield/active/cdp/cdp-backend/5c707a68c02956529f3d15ae2934041b50447911eb5843ce2e907b46e440aa48_audio.wav from local file system.
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'remove_local_file_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,945] Task 'remove_local_file_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_file': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,966] Task 'create_file': Starting task run...
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'create_file': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:58,975] Task 'create_file': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:58-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:58,991] Task 'upload_db_model_task': Starting task run...
[INFO: functions: 117 2021-06-05 16:51:59,241] Saved new File with document id=Zf49sD1PKkFA3K16EgnP.
[2021-06-05 16:51:59-0700] INFO - prefect.TaskRunner | Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:59,255] Task 'upload_db_model_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:59-0700] INFO - prefect.TaskRunner | Task 'remove_local_file_task': Starting task run...
[INFO: task_runner: 242 2021-06-05 16:51:59,270] Task 'remove_local_file_task': Starting task run...
[INFO: functions: 145 2021-06-05 16:51:59,270] Removed /home/maxfield/active/cdp/cdp-backend/5c707a68c02956529f3d15ae2934041b50447911eb5843ce2e907b46e440aa48_audio.out from local file system.
[2021-06-05 16:51:59-0700] INFO - prefect.TaskRunner | Task 'remove_local_file_task': Finished task run for task with final state: 'Success'
[INFO: task_runner: 336 2021-06-05 16:51:59,278] Task 'remove_local_file_task': Finished task run for task with final state: 'Success'
[2021-06-05 16:51:59-0700] INFO - prefect.FlowRunner | Flow run SUCCESS: all reference tasks succeeded
[INFO: flow_runner: 710 2021-06-05 16:51:59,281] Flow run SUCCESS: all reference tasks succeeded
  • Provide or update documentation for any feature added by your pull request.

Here is the new pipeline DAG:

image

Thanks for contributing!

@evamaxfield evamaxfield added the enhancement New feature or request label Jun 5, 2021
@evamaxfield evamaxfield requested a review from isaacna June 5, 2021 23:58
@evamaxfield evamaxfield self-assigned this Jun 5, 2021
@codecov
Copy link

codecov bot commented Jun 6, 2021

Codecov Report

Merging #60 (9141ba1) into main (5f09efc) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Current head 9141ba1 differs from pull request most recent head be3d74e. Consider uploading reports for the commit be3d74e to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main      #60      +/-   ##
==========================================
+ Coverage   98.60%   98.63%   +0.03%     
==========================================
  Files          42       42              
  Lines        1291     1321      +30     
==========================================
+ Hits         1273     1303      +30     
  Misses         18       18              
Impacted Files Coverage Δ
cdp_backend/pipeline/event_gather_pipeline.py 100.00% <100.00%> (ø)
cdp_backend/tests/file_store/test_functions.py 100.00% <100.00%> (ø)
...ckend/tests/pipeline/test_event_gather_pipeline.py 100.00% <100.00%> (ø)
cdp_backend/tests/utils/test_file_utils.py 100.00% <100.00%> (ø)
cdp_backend/utils/file_utils.py 93.42% <100.00%> (+1.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f09efc...be3d74e. Read the comment docs.

Comment on lines +93 to +94
video_uri: str
The uri to the video file to split audio from.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the documentation here, I must've forgot this earlier

Copy link
Collaborator

@isaacna isaacna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for adding this improvement!

@evamaxfield evamaxfield merged commit 47d9fc5 into main Jun 6, 2021
@evamaxfield evamaxfield deleted the feature/hash-file-contents branch June 6, 2021 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Event gather pipeline: Hash videos by content rather than URL
2 participants