Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to retrieve fewer question-answer pairs from chromaDB #218

Closed
philippschw opened this issue Jan 30, 2024 · 7 comments · Fixed by #268
Closed

How to retrieve fewer question-answer pairs from chromaDB #218

philippschw opened this issue Jan 30, 2024 · 7 comments · Fixed by #268

Comments

@philippschw
Copy link

I am running into token limits, is it possible to limit the number of question-answer pairs that are being retrieved when running the function get_similar_question_sql ?

I see chromadb has the argument: n_results=10. Just struggling to actually make use of it in the vanna.ai framework.

Thanks!

@Nuclear6
Copy link

I also found this problem. It seems that irrelevant items can be retrieved. Is there any improvement method here?

@andreped
Copy link
Contributor

andreped commented Feb 27, 2024

@philippschw Hmm, I tried looking for this, and I also struggle to see where in the code this is actually set. Maybe setting n_results is a deprecated feature, or just really hidden in the VannaBase class? Any comments, @zainhoda?

I know this does not answer the original question, but a PR to add support for setting max_tokens was just merged yesterday.
See bb0b4cc. Previously this was hardcoded to 500 for the OpenAI client, which would explain why you are running out of tokens.

To test if the latest changes resolve your issue, install Vanna from source:

pip install git+https://github.com/vanna-ai/vanna.git

and then you can set max_tokens through the config argument when initializing the client.

@zainhoda
Copy link
Contributor

zainhoda commented Feb 27, 2024

Chroma has an option for n_results with the default being 10
https://docs.trychroma.com/reference/Collection#query

Since we don't specify the n_results, it uses the default:
https://github.com/vanna-ai/vanna/blob/main/src/vanna/chromadb/chromadb_vector.py#L230-L232

Ideally we should have some config params to set n_ddl, n_documentation, and n_question_sql to retrieve a specific number of items.

In the meantime, as an end user you can do this without forking the package by simply overriding the get_similar_question_sql function when you set up the class

from vanna.openai.openai_chat import OpenAI_Chat
from vanna.chromadb.chromadb_vector import ChromaDB_VectorStore

class MyVanna(ChromaDB_VectorStore, OpenAI_Chat):
    def __init__(self, config=None):
        ChromaDB_VectorStore.__init__(self, config=config)
        OpenAI_Chat.__init__(self, config=config)

    def get_similar_question_sql(self, question: str, **kwargs) -> list:
        return ChromaDB_VectorStore._extract_documents(
            self.sql_collection.query(
                query_texts=[question],
                n_results=5, # Or whatever number you want
            )
        )

vn = MyVanna(config={'api_key': 'sk-...', 'model': 'gpt-3.5-turbo'})

@andreped
Copy link
Contributor

andreped commented Feb 27, 2024

Ideally we should have some config params to set n_ddl, n_documentation, and n_question_sql to retrieve a specific number of items.

@zainhoda I can make a PR to add support for this to set the n_results through the config argument, if you want? :]

@zainhoda
Copy link
Contributor

@andreped you are the best contributor an open-source project could ever hope for 🚀

if you have time and inclination, feel free -- I'll batch all of your PRs into the next release

@andreped
Copy link
Contributor

andreped commented Feb 27, 2024

Ideally we should have some config params to set n_ddl, n_documentation, and n_question_sql to retrieve a specific number of items.

@zainhoda I just saw this. Do we only want to set the n_results for the self.sql_collection.query() only, or did we want to be able to set this for the get_related_ddl() and get_related_documentation() methods as well?

If we want to set n_results for all three, should it be three different n_results values that can be set, or should it just be the same for all?

@andreped
Copy link
Contributor

I made a PR, feel free to review directly there, @zainhoda :]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants