Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Google Gemini captions #8959

Closed
wants to merge 6 commits into from

Conversation

hunterjm
Copy link
Contributor

Google recently released Gemini, which includes multi-modal support, and is available at a free tier of 60 queries/minute. That is more than any single individual would use for this particular implementation of event captioning.

My initial implementation is adding a sub_label to detections that do not currently have one (by default) with a caption from Gemini. I've done some initial prompt engineering for person and vehicle and it works pretty well at detecting deliveries and characteristics.

image

Eventually, we can tokenize sub-labels and implement an embeddings search in the events page allowing you to type phrases like "bearded man" to show similar results.

I'm opening this as a draft currently because of a few things:

  1. We may want to implement a "description" field for events instead of re-using "sub_label"
  2. Frigate is built on the premise of local AI. This does send the thumbnails to an external Google API to be captioned.
  3. Google will be adding the ability to upload video via the API soon. It could be interesting to add a configuration to send thumbnail or the whole clip for captioning in the future.

LLMs aren't at a point where they can run on hardware Frigate normally runs on and be good enough. I'd like to open the door for a conversation on if something like this should even be included. I personally think enabling for external cameras and implementing a robust embeddings search would be worth it.

Copy link

netlify bot commented Dec 14, 2023

Deploy Preview for frigate-docs canceled.

Name Link
🔨 Latest commit 0fc2047
🔍 Latest deploy log https://app.netlify.com/sites/frigate-docs/deploys/657baf04d545d900081763ee

@blakeblackshear
Copy link
Owner

I am struggling to wrap my head around the utility of this right now. It takes me far longer to read the text description of the image than it does to just look at the image. My brain is much faster at inferring all of that information than any LLM.

Eventually, we can tokenize sub-labels and implement an embeddings search in the events page allowing you to type phrases like "bearded man" to show similar results

This makes it more interesting. Something like searching in Google Photos. You have to know what to search for ahead of time though. You could search for "people walking dogs" or "yellow cars" or "a person wearing a hoodie" or "dogs pooping in the yard" which would be interesting.

I think it would be more interesting to find a way to generate some structured metadata: aggressive/suspicious/etc

@hunterjm
Copy link
Contributor Author

hunterjm commented Dec 14, 2023

I am struggling to wrap my head around the utility of this right now. It takes me far longer to read the text description of the image than it does to just look at the image. My brain is much faster at inferring all of that information than any LLM.

Eventually, we can tokenize sub-labels and implement an embeddings search in the events page allowing you to type phrases like "bearded man" to show similar results

This makes it more interesting. Something like searching in Google Photos. You have to know what to search for ahead of time though. You could search for "people walking dogs" or "yellow cars" or "a person wearing a hoodie" or "dogs pooping in the yard" which would be interesting.

I think it would be more interesting to find a way to generate some structured metadata: aggressive/suspicious/etc

Yes, the utility from the draft as it stands isn't much except that you have to actually look at each image to infer those things, which you may not really do. I initially wrote a prompt to have it extract structured metadata, and it did really with that too:

Respond with only a valid JSON object in the following format:
[{
"gender": "",
"age": "",
"hair": "",
"shirt": "",
"activity": "",
"delivery": ""
}]

Where `gender` is the estimated gender, `age` is the approximate age (infant, child, teen, adult, senior) `hair` is the hair color (blonde, brown, red, black, bald, hat, hoodie, etc), `shirt` is the color of the shirt, and `activity` is the single word activity being performed (walking, running, soccer, carrying, etc), `delivery` is the delivery company the individual works for (amazon, ups, fedex, usps, etc) or "none". For any attributes you are not certain, respond with "unknown".

If there are multiple people, repeat as an array.

Describe the people in this image:

When I was thinking about use cases though, natural language search was the prominent one that came to mind, and unstructured text is actually better than structured data for that use case.

@hunterjm hunterjm mentioned this pull request Dec 15, 2023
17 tasks
@hunterjm
Copy link
Contributor Author

New direction outlined in #8980. When I get far enough I will open a new draft PR.

@hunterjm hunterjm closed this Dec 15, 2023
@reza8iucs
Copy link

I am struggling to wrap my head around the utility of this right now. It takes me far longer to read the text description of the image than it does to just look at the image. My brain is much faster at inferring all of that information than any LLM.

Eventually, we can tokenize sub-labels and implement an embeddings search in the events page allowing you to type phrases like "bearded man" to show similar results

This makes it more interesting. Something like searching in Google Photos. You have to know what to search for ahead of time though. You could search for "people walking dogs" or "yellow cars" or "a person wearing a hoodie" or "dogs pooping in the yard" which would be interesting.

I think it would be more interesting to find a way to generate some structured metadata: aggressive/suspicious/etc

what if you want to build some automation based on the LLM interpretation of the image? For example, if in the description you find "group of men" in your backyard, do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants