How to handle embedding model download yourself with ChromaDB client #222

andreped · 2024-01-31T16:02:03Z

This line of code is causing some headache:

vanna/src/vanna/chromadb/chromadb_vector.py

Line 13 in 8c5e05a

default_ef = embedding_functions.DefaultEmbeddingFunction()

The problem is that if you want to store ChromaDB in a docker container, you most likely want to handle all downloads yourself, and give very limited rights to ChromaDB and Vanna, in terms of what is allowed to write on disk and where. If you try to download the model yourself, at the appropriate place (after reading some ChromaDB source code I found where it was trying to save the model), it will give an error stating that the model is already on disk (this I expected would be fine but this yields an error).

Another problem is that ChromaDB currently also uses PersistentClient, meaning that an .sqlite3 file will be saved on disk.

Hence, I would like the following features:

Allowing to use in-memory vector database with ChromaDB, hence not saving anything on disk
Allowing to predownload the embedding model and in turn not requiring ChromaDB to have internet access to work in Vanna

Issue was observed on: Ubuntu 22.04 using vanna==0.0.36 and Python 3.10.8 (after also fixing the sqlite3 issue Chroma has for Python 3.10).

The text was updated successfully, but these errors were encountered:

andreped · 2024-02-01T09:11:02Z

@zainhoda I could attempt to make a PR to address both these features. As they both impact how the ChromaDB is initialized and setup, it makes sense to do this in the same PR. Thoughts?

Not sure what the best way to do this is though.

For (1) it should be possible to enable/disable PersistentClient to do in-memory storage. This can be achieved by having an argument and handling that logic in the __init__() of the ChromaDB class that subclasses the Vanna Base class.
For (2) it seems like ChromaDB downloads the model by default, or at least attempts to. Maybe there is an API to override this from happening, if the model is already on disk. In the __init__(), we could have a check for this, and only attempt to download if the file is not found. Perhaps a solution is to just have a try-catch, but there is likely a cleaner way of doing it?

I will wait with making the PR, until we have agreed on a solution :]

relic-yuexi · 2024-02-02T07:16:23Z

just use this:

from chromadb.utils import embedding_functions
# device = "cuda" if torch.cuda.is_available() else "cpu"

bge_embeddingFunction = embedding_functions.SentenceTransformerEmbeddingFunction("/AI-ModelScope/bge-large-zh-v1.5","cuda",normalize_embeddings=True)

config = {"embedding_function":bge_embeddingFunction, "path": "/root/chatdb/chromadb"}

relic-yuexi · 2024-02-02T07:17:39Z

in /src/vanna/chromadb/chromadb_vector.py

        if config is not None:
            path = config.get("path", ".")
            self.embedding_function = config.get("embedding_function", default_ef)
        else:
            path = "."
            self.embedding_function = default_ef

andreped · 2024-02-02T08:01:57Z

in /src/vanna/chromadb/chromadb_vector.py

        if config is not None:
            path = config.get("path", ".")
            self.embedding_function = config.get("embedding_function", default_ef)
        else:
            path = "."
            self.embedding_function = default_ef

Sure, I can give it a go and see if it resolves the issue I was observing. But this needs to be added to Vanna itself, as I do not control this logic when using the library. Thanks for the suggestion, @relic-yuexi :]

relic-yuexi · 2024-02-04T09:51:45Z

in /src/vanna/chromadb/chromadb_vector.py
        if config is not None:
            path = config.get("path", ".")
            self.embedding_function = config.get("embedding_function", default_ef)
        else:
            path = "."
            self.embedding_function = default_ef
Sure, I can give it a go and see if it resolves the issue I was observing. But this needs to be added to Vanna itself, as I do not control this logic when using the library. Thanks for the suggestion, @relic-yuexi :]

actually i had make a pr for it

andreped · 2024-02-04T12:42:59Z

actually i had make a pr for it

@relic-yuexi I do not see a PR for this yet. Were you going to make one? If not, I can give it a go, as I have a specific use case I can test it for directly.

relic-yuexi · 2024-02-05T02:30:07Z

actually i had make a pr for it

@relic-yuexi I do not see a PR for this yet. Were you going to make one? If not, I can give it a go, as I have a specific use case I can test it for directly.

8047a38

There. you can use this code to load the embedding model as you download.

from chromadb.utils import embedding_functions

bge_embeddingFunction = embedding_functions.SentenceTransformerEmbeddingFunction("/AI-ModelScope/bge-large-zh-v1.5","cuda",normalize_embeddings=True)

then use the config to chromadbvector

config = {"embedding_function":bge_embeddingFunction, "path": "/root/chatdb/chromadb"}
vs = ChromaDB_VectorStore(config)

then it will all use the bge_embedding or you have downloaded

andreped · 2024-02-05T08:18:31Z

8047a38

@relic-yuexi Sorry, I misunderstood your first reply. I can run a test to see if this addresses issue (2), which is the most critical one to fix. Cheers! :]

roshanr10 · 2024-02-05T20:44:45Z

is there an expected configuration pattern for Allowing to use in-memory vector database with ChromaDB, hence not saving anything on disk?

I'd love to use an in-memory version as I intend to primarily load DDL from the database anyway and would reload from scratch every time

relic-yuexi · 2024-02-06T06:37:02Z

is there an expected configuration pattern for Allowing to use in-memory vector database with ChromaDB, hence not saving anything on disk?

I'd love to use an in-memory version as I intend to primarily load DDL from the database anyway and would reload from scratch every time

maybe you can use faiss , it is just a vector search tool as you want in memory

https://github.com/facebookresearch/faiss

andreped · 2024-02-21T07:59:34Z

I'd love to use an in-memory version

@roshanr10 ChromaDB supports in-memory storage but the Chroma instance setup by Vanna does not. There is no API in Vanna to use in-memory instead of persistent store for Chroma:
https://github.com/vanna-ai/vanna/blob/b3d46dc5c64d48bd7b8255ba557d1b12ec651903/src/vanna/chromadb/chromadb_vector.py#L27C30-L27C55

@zainhoda I can make a PR now to address this as I need this myself.

EDIT: I made a separate issue about this, @roshanr10. Then I can tag a PR to it if I make one.

andreped · 2024-02-21T17:11:01Z

@roshanr10 I have now made a PR to add in-memory support. PR: #250

andreped · 2024-02-23T14:19:15Z

Merged in #250.

Your feature request will be part of the upcoming release, @roshanr10 :]

As my original question has been resolved. I am closing this issue.

If relevant, feel free to open a new issue and tag this issue, if you have issues/requests related to this issue.

andreped closed this as completed Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle embedding model download yourself with ChromaDB client #222

How to handle embedding model download yourself with ChromaDB client #222

andreped commented Jan 31, 2024

andreped commented Feb 1, 2024 •

edited

Loading

relic-yuexi commented Feb 2, 2024

relic-yuexi commented Feb 2, 2024

andreped commented Feb 2, 2024

relic-yuexi commented Feb 4, 2024

andreped commented Feb 4, 2024

relic-yuexi commented Feb 5, 2024

andreped commented Feb 5, 2024

roshanr10 commented Feb 5, 2024

relic-yuexi commented Feb 6, 2024

andreped commented Feb 21, 2024 •

edited

Loading

andreped commented Feb 21, 2024

andreped commented Feb 23, 2024 •

edited

Loading

How to handle embedding model download yourself with ChromaDB client #222

How to handle embedding model download yourself with ChromaDB client #222

Comments

andreped commented Jan 31, 2024

andreped commented Feb 1, 2024 • edited Loading

relic-yuexi commented Feb 2, 2024

relic-yuexi commented Feb 2, 2024

andreped commented Feb 2, 2024

relic-yuexi commented Feb 4, 2024

andreped commented Feb 4, 2024

relic-yuexi commented Feb 5, 2024

andreped commented Feb 5, 2024

roshanr10 commented Feb 5, 2024

relic-yuexi commented Feb 6, 2024

andreped commented Feb 21, 2024 • edited Loading

andreped commented Feb 21, 2024

andreped commented Feb 23, 2024 • edited Loading

andreped commented Feb 1, 2024 •

edited

Loading

andreped commented Feb 21, 2024 •

edited

Loading

andreped commented Feb 23, 2024 •

edited

Loading