Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary embeddings #254

Merged
merged 6 commits into from
Sep 12, 2023
Merged

Binary embeddings #254

merged 6 commits into from
Sep 12, 2023

Conversation

simonw
Copy link
Owner

@simonw simonw commented Sep 9, 2023

Refs:

Still needed:

  • Decide what to do about --store - should it store the binary content in a new content_blob column? See these notes.
  • The model.supports_binary boolean needs to be respected - it should raise errors if you attempt to embed binary content against a text-only model
  • Should there be a way to mark a model as ONLY accepting binary content? Probably yes - though the models I have played with so far like CLIP and ImageBind are happy to accept either kind of content.
  • Lots of tests, using a mock embedding model that can handle binary content.
  • Build a plugin that uses this (I have a draft llm-clip one already).

@simonw simonw added enhancement New feature or request embeddings labels Sep 9, 2023
@simonw
Copy link
Owner Author

simonw commented Sep 9, 2023

@simonw
Copy link
Owner Author

simonw commented Sep 9, 2023

Here's my rough CLIP plugin:

import llm
from PIL import Image
from sentence_transformers import SentenceTransformer, util
import io


@llm.hookimpl
def register_embedding_models(register):
    register(ClipEmbeddingModel())


class ClipEmbeddingModel(llm.EmbeddingModel):
    model_id = "clip"
    supports_binary = True

    def __init__(self):
        self._model = None
        self._processor = None
        self._tokenizer = None

    def embed_batch(self, items):
        # Embeds a mix of text strings and binary images
        if self._model is None:
            self._model = SentenceTransformer('clip-ViT-B-32')

        to_embed = []

        for item in items:
            if isinstance(item, bytes):
                # If the item is a byte string, treat it as image data and convert to Image object
                to_embed.append(Image.open(io.BytesIO(item)))
            elif isinstance(item, str):
                to_embed.append(item)

        embeddings = self._model.encode(to_embed)
        return [[float(num) for num in embedding] for embedding in embeddings]

Though looking at this, it's really just a sentence transformer that can accept both binary and text. It could go in llm-sentence-transformers.

I could still have llm-clip be a plugin that just depends on this and then registers the right model.

@simonw
Copy link
Owner Author

simonw commented Sep 9, 2023

I'm going to try the content_blob column and see how it feels.

@simonw
Copy link
Owner Author

simonw commented Sep 10, 2023

If a model only supports binary and does not support text, maybe we can have it treat all input as --binary even if you forget to use that flag?

@simonw
Copy link
Owner Author

simonw commented Sep 12, 2023

I'm landing this. Future tests will happen as I write the plugins.

@simonw simonw marked this pull request as ready for review September 12, 2023 01:57
@simonw simonw linked an issue Sep 12, 2023 that may be closed by this pull request
@simonw simonw merged commit 52cec13 into main Sep 12, 2023
21 checks passed
@simonw simonw deleted the binary-embeddings branch September 12, 2023 01:58
@simonw simonw mentioned this pull request Sep 12, 2023
3 tasks
simonw added a commit that referenced this pull request Sep 12, 2023
simonw added a commit that referenced this pull request Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
embeddings enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for embedding binary files
1 participant