Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle embedding model download yourself with ChromaDB client #222

Closed
andreped opened this issue Jan 31, 2024 · 13 comments
Closed

How to handle embedding model download yourself with ChromaDB client #222

andreped opened this issue Jan 31, 2024 · 13 comments

Comments

@andreped
Copy link
Contributor

This line of code is causing some headache:

default_ef = embedding_functions.DefaultEmbeddingFunction()

The problem is that if you want to store ChromaDB in a docker container, you most likely want to handle all downloads yourself, and give very limited rights to ChromaDB and Vanna, in terms of what is allowed to write on disk and where. If you try to download the model yourself, at the appropriate place (after reading some ChromaDB source code I found where it was trying to save the model), it will give an error stating that the model is already on disk (this I expected would be fine but this yields an error).

Another problem is that ChromaDB currently also uses PersistentClient, meaning that an .sqlite3 file will be saved on disk.

Hence, I would like the following features:

  • Allowing to use in-memory vector database with ChromaDB, hence not saving anything on disk
  • Allowing to predownload the embedding model and in turn not requiring ChromaDB to have internet access to work in Vanna

Issue was observed on: Ubuntu 22.04 using vanna==0.0.36 and Python 3.10.8 (after also fixing the sqlite3 issue Chroma has for Python 3.10).

@andreped
Copy link
Contributor Author

andreped commented Feb 1, 2024

@zainhoda I could attempt to make a PR to address both these features. As they both impact how the ChromaDB is initialized and setup, it makes sense to do this in the same PR. Thoughts?

Not sure what the best way to do this is though.

  • For (1) it should be possible to enable/disable PersistentClient to do in-memory storage. This can be achieved by having an argument and handling that logic in the __init__() of the ChromaDB class that subclasses the Vanna Base class.
  • For (2) it seems like ChromaDB downloads the model by default, or at least attempts to. Maybe there is an API to override this from happening, if the model is already on disk. In the __init__(), we could have a check for this, and only attempt to download if the file is not found. Perhaps a solution is to just have a try-catch, but there is likely a cleaner way of doing it?

I will wait with making the PR, until we have agreed on a solution :]

@relic-yuexi
Copy link
Contributor

just use this:

from chromadb.utils import embedding_functions
# device = "cuda" if torch.cuda.is_available() else "cpu"

bge_embeddingFunction = embedding_functions.SentenceTransformerEmbeddingFunction("/AI-ModelScope/bge-large-zh-v1.5","cuda",normalize_embeddings=True)

config = {"embedding_function":bge_embeddingFunction, "path": "/root/chatdb/chromadb"}

@relic-yuexi
Copy link
Contributor

in /src/vanna/chromadb/chromadb_vector.py

        if config is not None:
            path = config.get("path", ".")
            self.embedding_function = config.get("embedding_function", default_ef)
        else:
            path = "."
            self.embedding_function = default_ef

@andreped
Copy link
Contributor Author

andreped commented Feb 2, 2024

in /src/vanna/chromadb/chromadb_vector.py

        if config is not None:
            path = config.get("path", ".")
            self.embedding_function = config.get("embedding_function", default_ef)
        else:
            path = "."
            self.embedding_function = default_ef

Sure, I can give it a go and see if it resolves the issue I was observing. But this needs to be added to Vanna itself, as I do not control this logic when using the library. Thanks for the suggestion, @relic-yuexi :]

@relic-yuexi
Copy link
Contributor

in /src/vanna/chromadb/chromadb_vector.py

        if config is not None:
            path = config.get("path", ".")
            self.embedding_function = config.get("embedding_function", default_ef)
        else:
            path = "."
            self.embedding_function = default_ef

Sure, I can give it a go and see if it resolves the issue I was observing. But this needs to be added to Vanna itself, as I do not control this logic when using the library. Thanks for the suggestion, @relic-yuexi :]

actually i had make a pr for it

@andreped
Copy link
Contributor Author

andreped commented Feb 4, 2024

actually i had make a pr for it

@relic-yuexi I do not see a PR for this yet. Were you going to make one? If not, I can give it a go, as I have a specific use case I can test it for directly.

@relic-yuexi
Copy link
Contributor

actually i had make a pr for it

@relic-yuexi I do not see a PR for this yet. Were you going to make one? If not, I can give it a go, as I have a specific use case I can test it for directly.

8047a38

There. you can use this code to load the embedding model as you download.

from chromadb.utils import embedding_functions

bge_embeddingFunction = embedding_functions.SentenceTransformerEmbeddingFunction("/AI-ModelScope/bge-large-zh-v1.5","cuda",normalize_embeddings=True)

then use the config to chromadbvector

config = {"embedding_function":bge_embeddingFunction, "path": "/root/chatdb/chromadb"}
vs = ChromaDB_VectorStore(config)

then it will all use the bge_embedding or you have downloaded

@andreped
Copy link
Contributor Author

andreped commented Feb 5, 2024

8047a38

@relic-yuexi Sorry, I misunderstood your first reply. I can run a test to see if this addresses issue (2), which is the most critical one to fix. Cheers! :]

@roshanr10
Copy link

is there an expected configuration pattern for Allowing to use in-memory vector database with ChromaDB, hence not saving anything on disk?

I'd love to use an in-memory version as I intend to primarily load DDL from the database anyway and would reload from scratch every time

@relic-yuexi
Copy link
Contributor

is there an expected configuration pattern for Allowing to use in-memory vector database with ChromaDB, hence not saving anything on disk?

I'd love to use an in-memory version as I intend to primarily load DDL from the database anyway and would reload from scratch every time

maybe you can use faiss , it is just a vector search tool as you want in memory

https://github.com/facebookresearch/faiss

@andreped
Copy link
Contributor Author

andreped commented Feb 21, 2024

I'd love to use an in-memory version

@roshanr10 ChromaDB supports in-memory storage but the Chroma instance setup by Vanna does not. There is no API in Vanna to use in-memory instead of persistent store for Chroma:
https://github.com/vanna-ai/vanna/blob/b3d46dc5c64d48bd7b8255ba557d1b12ec651903/src/vanna/chromadb/chromadb_vector.py#L27C30-L27C55

@zainhoda I can make a PR now to address this as I need this myself.


EDIT: I made a separate issue about this, @roshanr10. Then I can tag a PR to it if I make one.

@andreped
Copy link
Contributor Author

@roshanr10 I have now made a PR to add in-memory support. PR: #250

@andreped
Copy link
Contributor Author

andreped commented Feb 23, 2024

Merged in #250.

Your feature request will be part of the upcoming release, @roshanr10 :]

As my original question has been resolved. I am closing this issue.

If relevant, feel free to open a new issue and tag this issue, if you have issues/requests related to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants