Implementation of Google Gemini captions #8959

hunterjm · 2023-12-14T08:24:15Z

Google recently released Gemini, which includes multi-modal support, and is available at a free tier of 60 queries/minute. That is more than any single individual would use for this particular implementation of event captioning.

My initial implementation is adding a sub_label to detections that do not currently have one (by default) with a caption from Gemini. I've done some initial prompt engineering for person and vehicle and it works pretty well at detecting deliveries and characteristics.

Eventually, we can tokenize sub-labels and implement an embeddings search in the events page allowing you to type phrases like "bearded man" to show similar results.

I'm opening this as a draft currently because of a few things:

We may want to implement a "description" field for events instead of re-using "sub_label"
Frigate is built on the premise of local AI. This does send the thumbnails to an external Google API to be captioned.
Google will be adding the ability to upload video via the API soon. It could be interesting to add a configuration to send thumbnail or the whole clip for captioning in the future.

LLMs aren't at a point where they can run on hardware Frigate normally runs on and be good enough. I'd like to open the door for a conversation on if something like this should even be included. I personally think enabling for external cameras and implementing a robust embeddings search would be worth it.

netlify · 2023-12-14T08:24:19Z

✅ Deploy Preview for frigate-docs canceled.

Name	Link
🔨 Latest commit	`0fc2047`
🔍 Latest deploy log	https://app.netlify.com/sites/frigate-docs/deploys/657baf04d545d900081763ee

blakeblackshear · 2023-12-14T12:03:52Z

I am struggling to wrap my head around the utility of this right now. It takes me far longer to read the text description of the image than it does to just look at the image. My brain is much faster at inferring all of that information than any LLM.

Eventually, we can tokenize sub-labels and implement an embeddings search in the events page allowing you to type phrases like "bearded man" to show similar results

This makes it more interesting. Something like searching in Google Photos. You have to know what to search for ahead of time though. You could search for "people walking dogs" or "yellow cars" or "a person wearing a hoodie" or "dogs pooping in the yard" which would be interesting.

I think it would be more interesting to find a way to generate some structured metadata: aggressive/suspicious/etc

hunterjm · 2023-12-14T15:38:06Z

I am struggling to wrap my head around the utility of this right now. It takes me far longer to read the text description of the image than it does to just look at the image. My brain is much faster at inferring all of that information than any LLM.

Eventually, we can tokenize sub-labels and implement an embeddings search in the events page allowing you to type phrases like "bearded man" to show similar results

This makes it more interesting. Something like searching in Google Photos. You have to know what to search for ahead of time though. You could search for "people walking dogs" or "yellow cars" or "a person wearing a hoodie" or "dogs pooping in the yard" which would be interesting.

I think it would be more interesting to find a way to generate some structured metadata: aggressive/suspicious/etc

Yes, the utility from the draft as it stands isn't much except that you have to actually look at each image to infer those things, which you may not really do. I initially wrote a prompt to have it extract structured metadata, and it did really with that too:

Respond with only a valid JSON object in the following format:
[{
"gender": "",
"age": "",
"hair": "",
"shirt": "",
"activity": "",
"delivery": ""
}]

Where `gender` is the estimated gender, `age` is the approximate age (infant, child, teen, adult, senior) `hair` is the hair color (blonde, brown, red, black, bald, hat, hoodie, etc), `shirt` is the color of the shirt, and `activity` is the single word activity being performed (walking, running, soccer, carrying, etc), `delivery` is the delivery company the individual works for (amazon, ups, fedex, usps, etc) or "none". For any attributes you are not certain, respond with "unknown".

If there are multiple people, repeat as an array.

Describe the people in this image:

When I was thinking about use cases though, natural language search was the prominent one that came to mind, and unstructured text is actually better than structured data for that use case.

hunterjm · 2023-12-15T23:46:08Z

New direction outlined in #8980. When I get far enough I will open a new draft PR.

reza8iucs · 2024-06-15T11:01:50Z

I am struggling to wrap my head around the utility of this right now. It takes me far longer to read the text description of the image than it does to just look at the image. My brain is much faster at inferring all of that information than any LLM.

Eventually, we can tokenize sub-labels and implement an embeddings search in the events page allowing you to type phrases like "bearded man" to show similar results

This makes it more interesting. Something like searching in Google Photos. You have to know what to search for ahead of time though. You could search for "people walking dogs" or "yellow cars" or "a person wearing a hoodie" or "dogs pooping in the yard" which would be interesting.

I think it would be more interesting to find a way to generate some structured metadata: aggressive/suspicious/etc

what if you want to build some automation based on the LLM interpretation of the image? For example, if in the description you find "group of men" in your backyard, do this.

hunterjm added 2 commits December 14, 2023 03:11

initial implementation of Google Gemini captions

af7cfee

add gemini handler and revert docker-compose

338a635

fix update logic

cd81265

blakeblackshear added the pinned label Dec 14, 2023

hunterjm added 3 commits December 14, 2023 10:40

catch does not exist for db record

a811471

fix ruff

f8dcc24

more error checking

0fc2047

hunterjm mentioned this pull request Dec 15, 2023

Semantic Search for Detections #8980

Open

17 tasks

hunterjm closed this Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of Google Gemini captions #8959

Implementation of Google Gemini captions #8959

hunterjm commented Dec 14, 2023

netlify bot commented Dec 14, 2023 •

edited

Loading

blakeblackshear commented Dec 14, 2023

hunterjm commented Dec 14, 2023 •

edited

Loading

hunterjm commented Dec 15, 2023

reza8iucs commented Jun 15, 2024

Implementation of Google Gemini captions #8959

Implementation of Google Gemini captions #8959

Conversation

hunterjm commented Dec 14, 2023

netlify bot commented Dec 14, 2023 • edited Loading

✅ Deploy Preview for frigate-docs canceled.

blakeblackshear commented Dec 14, 2023

hunterjm commented Dec 14, 2023 • edited Loading

hunterjm commented Dec 15, 2023

reza8iucs commented Jun 15, 2024

netlify bot commented Dec 14, 2023 •

edited

Loading

hunterjm commented Dec 14, 2023 •

edited

Loading