Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an video operator using CLIP #357

Open
Tracked by #354
dennyabrain opened this issue Jun 14, 2024 · 3 comments · Fixed by #369
Open
Tracked by #354

Create an video operator using CLIP #357

dennyabrain opened this issue Jun 14, 2024 · 3 comments · Fixed by #369
Assignees

Comments

@dennyabrain
Copy link
Contributor

No description provided.

@aatmanvaidya aatmanvaidya changed the title Create an operator for video clustering and benchmark it Create an video operator using CLIP Aug 5, 2024
@aatmanvaidya
Copy link
Collaborator

@Snehil-Shah lets use this issue to track the work on just a simple video CLIP operator that takes an image as input and gives an embedding as output. (do add the iframe approach here)

can you reply to this issue so that I can assign it to you

@Snehil-Shah
Copy link
Contributor

Comment

@Snehil-Shah
Copy link
Contributor

Snehil-Shah commented Aug 8, 2024

CLIP-ViT-base-patch32 newer pipeline:

Video Length CPU Time (s) RAM Usage
30s (1.92 MB) 1.15 40.6 MiB
1m (8.86 MB) 2.16 147.4 MiB
5m (42.8 MB) 10.89 724.9 MiB
10m (85.83 MB) 28.67 1.4 GiB
15m (128.65 MB) 38.26 2.1 GiB
20m (171.65 MB) 58.89 2.9 GiB
25m (214.41 MB) 57.3 3.5 GiB
30m (257.29 MB) 77 4.2 GiB
45m (385.94 MB) killed killed
1h (421.13 MB) killed killed

The results are surprising.

Some pointers regarding the behavior of the operator:

  • Profiles are inconsistent. For instance, I measured the 20m video twice (as I noticed some discrepancies in data), once it was 58.89s (as written above) and another time it was 67s.
  • Maybe the video encoding really affects the behavior? I basically used ffmpeg to loop a base video to desired lengths to create the benchmarking data. I once used an online video looper service as well to generate a 30m video. With ffmpeg, the RAM is 4.2GiB (as stated above) and with the online service export, it was merely 1.1 GiB. This behavior is definitely strange.
  • The difference in RAM usage between the older CLIP pipeline and the newer pipeline is also strange. But to be fair we were using a different API (sentence-transformers) previously for quick testing, and now we are using raw transformers with batch processing. So, internal implementations might be different.

Here are the previous profiles with the older pipeline:

ResNet18:

Video Length CPU Time (s) RAM Usage
30s (1.92 MB) 3.34 106.7 MiB
1m (8.86 MB) 8.87 107.8 MiB
5m (42.8 MB) 58.37 110.3 MiB
10m (85.83 MB) 79 116.6 MiB

CLIP-ViT-base-patch32:

Video Length CPU Time (s) RAM Usage
30s (1.92 MB) 9.87 1.1 GiB
1m (8.86 MB) 17.53 1.1 GiB
5m (42.8 MB) 78 1.1 GiB
10m (85.83 MB) 175 1.1 GiB

Comparisons:

  • The compute time is a lot faster, possibly because of more selective frame sampling (around 0.2 of the number of frames processed in the older pipeline). There are definitely inconsistencies as mentioned above, and demanding RAM behavior.

  • The video also failed to process videos longer than 30 minutes (were killed). But on the other hand, continuing the discussion above, I took the 30m video from the online service (one with very less RAM usage) and looped it using ffmpeg, and it was able to process 1h and even 2h videos successfully.
    I think it's safe to conclude, video length is not the right threshold to decide when to stop processing a video. I think it's safe to rely on the SIGKILL interrupt (emitted due to memory exhaustion) as a show-stopper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants