-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Colbert local mode support both as retriever and reranker. #797
Conversation
Fixes and Features in the PR The notebook to follow with all the changes is here
[Prediction(
pid=[6, 48, 74, 47, 33],
rerank_score=[15.8359375, 14.2109375, 12.5703125, 11.7890625, 9.1796875],
passages=['The best things in life are free.', 'Patience is a virtue.', 'To be or not to be, that is the question.', 'Out of sight, out of mind.', 'No pain, no gain.']
),
Prediction(
pid=[33, 0, 47, 74, 16],
rerank_score=[19.828125, 12.2890625, 11.171875, 9.09375, 6.8984375],
passages=['No pain, no gain.', "It's a piece of cake.", 'Out of sight, out of mind.', 'To be or not to be, that is the question.', 'Keep your friends close and your enemies closer.']
)]
colbert_config = ColBERTConfig()
colbert_config.index_name = 'colbert-ir-index'
colbert_reranker = dspy.ColBERTv2RerankerLocal(
checkpoint='colbert-ir/colbertv2.0',colbert_config=colbert_config)
dspy.settings.configure(rm=colbert_retriever,reranker=colbert_reranker)
retrieve_rerank = dspy.RetrieveThenRerank(k=5)
pred = retrieve_rerank(
["What is the meaning of life?","Meaning of pain?"]
) The @okhat, @arnavsinghvi11 @CShorten. Please review the PR and suggest feedbacks on improving it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice work 👏🏻👏🏻
Thanks @Josephrp |
dspy/retrieve/retrieve.py
Outdated
# print(queries) | ||
# TODO: Consider removing any quote-like markers that surround the query too. | ||
k = k if k is not None else self.k | ||
passages = dsp.retrieveRerankEnsemble(queries, k=k,**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have maintain the forward pass call from before and abstract the repetitive code from below within the forward pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I was not able to understand this. @arnavsinghvi11
Do you want to have a common utility function for both Retrieve
and RetrieveTheRerank
to process the returned documents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Athe-kunal , yeah it seemed like there is some repetitive code in both forward passes that can be abstracted out for the different retriever types. let me know if this change makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @arnavsinghvi11
I have abstracted the repetitive code part. However there are some nuances in the multi-query retriever, hence I didn't make a helper function for it. But for a single query, I have added a helper function single_query_passage
. Please let me know if I need to make some other changes.
Thanks a lot for these additions @Athe-kunal ! Really appreciate the tutorial notebook! I've left some comments for the PR to mainly address some code cleanup changes. It would also be great if you could add some documentation for the local ColBERT models to the Retrieval documentation. You can follow the existing ColBERTv2 docs for reference and it would be great to add more specifics on the user parameters for interacting with the RM and potentially some of the relevant implementation details. Thanks! |
Thanks a ton, @arnavsinghvi11 for this detailed feedback on my PR. It certainly helps me to be a better contributor. I have made the changes and also addressed some confusion, please o help me out there. I will work on the documentation changes for Colbert. Looking forward to more collaboration |
@arnavsinghvi11 |
@arnavsinghvi11 Can you please review it? |
Hi @arnavsinghvi11 |
dspy/retrieve/retrieve.py
Outdated
# print(queries) | ||
# TODO: Consider removing any quote-like markers that surround the query too. | ||
k = k if k is not None else self.k | ||
passages = dsp.retrieveRerankEnsemble(queries, k=k,**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Athe-kunal , yeah it seemed like there is some repetitive code in both forward passes that can be abstracted out for the different retriever types. let me know if this change makes sense
dsp/primitives/search.py
Outdated
@@ -9,17 +10,21 @@ def retrieve(query: str, k: int, **kwargs) -> list[str]: | |||
"""Retrieves passages from the RM for the query and returns the top k passages.""" | |||
if not dsp.settings.rm: | |||
raise AssertionError("No RM is loaded.") | |||
if not dsp.settings.reranker: | |||
warnings.warn("If you want to use the Reranker, please use dspy.RetrieveThenRerank") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dspy/dspy/evaluate/evaluate.py
Line 56 in d09d984
"DeprecationWarning: 'display' has been deprecated. To see all information for debugging, use 'dspy.set_log_level('debug')'. In the future this will raise an error.", |
|
||
def retrieveEnsemble(queries: list[str], k: int, by_prob: bool = True,**kwargs) -> list[str]: | ||
"""Retrieves passages from the RM for each query in queries and returns the top k passages | ||
based on the probability or score. | ||
""" | ||
if not dsp.settings.rm: | ||
raise AssertionError("No RM is loaded.") | ||
if dsp.settings.reranker: | ||
return retrieveRerankEnsemble(queries, k) | ||
if not dsp.settings.reranker: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as with above -
dspy/dspy/evaluate/evaluate.py
Line 56 in d09d984
"DeprecationWarning: 'display' has been deprecated. To see all information for debugging, use 'dspy.set_log_level('debug')'. In the future this will raise an error.", |
Hi @Athe-kunal , just left some follow up comments. it seems that there are still some leftover comments to address. feel free to discuss them more directly here as needed. |
Hi @arnavsinghvi11 |
Great PR! Hope it gets merged soon, I find myself needing this. |
@arnavsinghvi11 |
Hi @Athe-kunal , thanks again. I'd love to merge this PR but unfortunately, it breaks some existing caches we have in the repository, particularly the intro.ipynb. For more clarity, the changes to search.py and retrieve.py break the cached retrieved outputs in the intro notebook (which we keep fully cached so users can interact with the DSPy tutorial without having to expend their API key credits). If you could make changes to this PR that enable the existing search/retrieve functions as I mentioned earlier, while introducing the new behavior separately, I can merge it after that. lmk if this makes sense! |
Thanks for the suggestion @arnavsinghvi11 |
Hi @arnavsinghvi11 |
Hi @Athe-kunal , I would firstly recommend keeping the In more simple terms, the Some steps I suggest with this is reverting back to the original behavior of search and retrieve and then integrating From there, the next step would be to mirror the existing colbertv2 cache functions to align with the local mode being introduced. Let me know if this all makes sense! |
Hi @arnavsinghvi11 |
Not a direct answer, but HFClientTGI supports caching for models hosted locally (REST API). Maybe it would benefit here as well. But caching is not a requirement with this PR. Ensuring backward compatibility with the existing |
Got it @arnavsinghvi11, working on it Also, if you don't mind, can you share your discord handle on the DSPy discord channel, it would be helpful to interact there too. |
Hi @arnavsinghvi11 |
Hi @Athe-kunal , I don't see any changes to the PR. could you check if they are pushed? my username is bigman11 btw |
Hi @arnavsinghvi11
This PR has become way too much intractable, I will work on ColbertLocal caching in another PR. Can you review this one @arnavsinghvi11 ? |
Thanks @Athe-kunal for your patience with this PR! The changes no longer break the existing |
Hi @arnavsinghvi11 Thanks again for helping me in this PR. |
Merged. Thanks again @Athe-kunal ! |
Here are the changes proposed here