-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of Google Gemini captions #8959
Conversation
✅ Deploy Preview for frigate-docs canceled.
|
I am struggling to wrap my head around the utility of this right now. It takes me far longer to read the text description of the image than it does to just look at the image. My brain is much faster at inferring all of that information than any LLM.
This makes it more interesting. Something like searching in Google Photos. You have to know what to search for ahead of time though. You could search for "people walking dogs" or "yellow cars" or "a person wearing a hoodie" or "dogs pooping in the yard" which would be interesting. I think it would be more interesting to find a way to generate some structured metadata: aggressive/suspicious/etc |
Yes, the utility from the draft as it stands isn't much except that you have to actually look at each image to infer those things, which you may not really do. I initially wrote a prompt to have it extract structured metadata, and it did really with that too:
When I was thinking about use cases though, natural language search was the prominent one that came to mind, and unstructured text is actually better than structured data for that use case. |
New direction outlined in #8980. When I get far enough I will open a new draft PR. |
what if you want to build some automation based on the LLM interpretation of the image? For example, if in the description you find "group of men" in your backyard, do this. |
Google recently released Gemini, which includes multi-modal support, and is available at a free tier of 60 queries/minute. That is more than any single individual would use for this particular implementation of event captioning.
My initial implementation is adding a sub_label to detections that do not currently have one (by default) with a caption from Gemini. I've done some initial prompt engineering for
person
andvehicle
and it works pretty well at detecting deliveries and characteristics.Eventually, we can tokenize sub-labels and implement an embeddings search in the events page allowing you to type phrases like "bearded man" to show similar results.
I'm opening this as a draft currently because of a few things:
LLMs aren't at a point where they can run on hardware Frigate normally runs on and be good enough. I'd like to open the door for a conversation on if something like this should even be included. I personally think enabling for external cameras and implementing a robust embeddings search would be worth it.