Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colbert local mode support both as retriever and reranker. #797

Merged
merged 32 commits into from
Jun 15, 2024

Conversation

Athe-kunal
Copy link
Contributor

Here are the changes proposed here

  1. Colbert as a local retriever
  2. Colbert as a local reranker
  3. Not treating reranking as first-class citizen, rather added RetrieveThenRerank apart from Retrieve for retrieving and reranking
  4. Attached a jupyter notebook for showing the implementation details

@Athe-kunal
Copy link
Contributor Author

Fixes and Features in the PR

The notebook to follow with all the changes is here

  1. Here, the retriever for retrieveEnsemble and retrieve, combines the passages from all the queries into one list. It is erroneous, as the user who is passing multiple queries, expects that he/she will get the relevant passages for each query in the queries list. In the PR, I have fixed it to return a prediction object or prediction objects in a list where each prediction object contains relevant passages for the given query
  2. Also, the retriever only returns the text without metadata. The metadata will be helpful in downstream tasks of source citations. The metadata return support was added, but the rest of the pipeline won't change. Example of current output
[Prediction(
     pid=[6, 48, 74, 47, 33],
     rerank_score=[15.8359375, 14.2109375, 12.5703125, 11.7890625, 9.1796875],
     passages=['The best things in life are free.', 'Patience is a virtue.', 'To be or not to be, that is the question.', 'Out of sight, out of mind.', 'No pain, no gain.']
 ),
 Prediction(
     pid=[33, 0, 47, 74, 16],
     rerank_score=[19.828125, 12.2890625, 11.171875, 9.09375, 6.8984375],
     passages=['No pain, no gain.', "It's a piece of cake.", 'Out of sight, out of mind.', 'To be or not to be, that is the question.', 'Keep your friends close and your enemies closer.']
 )]
  1. Currently, the reranker will not work in retrieveEnsemble as it is expecting the query and the passages. But there is no reranker support. In this PR, I have added support for Colbert as a reranker. An example of it below
colbert_config = ColBERTConfig()
colbert_config.index_name = 'colbert-ir-index'
colbert_reranker = dspy.ColBERTv2RerankerLocal(
    checkpoint='colbert-ir/colbertv2.0',colbert_config=colbert_config)

dspy.settings.configure(rm=colbert_retriever,reranker=colbert_reranker)

retrieve_rerank = dspy.RetrieveThenRerank(k=5)

pred = retrieve_rerank(
    ["What is the meaning of life?","Meaning of pain?"]
)

The RetrieveThenRerank will first retrieve, and then from those passages will rerank using the Max-sim operator in ColBERT. Other reranker integrations will be next. The idea about RetrieveThenRerank was mentioned as a TO-DO here.

@okhat, @arnavsinghvi11 @CShorten. Please review the PR and suggest feedbacks on improving it.

Copy link

@Josephrp Josephrp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice work 👏🏻👏🏻

@Athe-kunal
Copy link
Contributor Author

very nice work 👏🏻👏🏻

Thanks @Josephrp

dsp/modules/colbertv2.py Outdated Show resolved Hide resolved
dsp/modules/colbertv2.py Outdated Show resolved Hide resolved
dsp/modules/colbertv2.py Outdated Show resolved Hide resolved
dsp/modules/colbertv2.py Outdated Show resolved Hide resolved
dsp/modules/colbertv2.py Outdated Show resolved Hide resolved
dsp/primitives/search.py Outdated Show resolved Hide resolved
dsp/primitives/search.py Outdated Show resolved Hide resolved
dspy/retrieve/retrieve.py Outdated Show resolved Hide resolved
# print(queries)
# TODO: Consider removing any quote-like markers that surround the query too.
k = k if k is not None else self.k
passages = dsp.retrieveRerankEnsemble(queries, k=k,**kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have maintain the forward pass call from before and abstract the repetitive code from below within the forward pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I was not able to understand this. @arnavsinghvi11
Do you want to have a common utility function for both Retrieve and RetrieveTheRerank to process the returned documents?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Athe-kunal , yeah it seemed like there is some repetitive code in both forward passes that can be abstracted out for the different retriever types. let me know if this change makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arnavsinghvi11
I have abstracted the repetitive code part. However there are some nuances in the multi-query retriever, hence I didn't make a helper function for it. But for a single query, I have added a helper function single_query_passage. Please let me know if I need to make some other changes.

@arnavsinghvi11
Copy link
Collaborator

Thanks a lot for these additions @Athe-kunal ! Really appreciate the tutorial notebook! I've left some comments for the PR to mainly address some code cleanup changes.

It would also be great if you could add some documentation for the local ColBERT models to the Retrieval documentation. You can follow the existing ColBERTv2 docs for reference and it would be great to add more specifics on the user parameters for interacting with the RM and potentially some of the relevant implementation details. Thanks!

@Athe-kunal
Copy link
Contributor Author

Thanks a ton, @arnavsinghvi11 for this detailed feedback on my PR. It certainly helps me to be a better contributor. I have made the changes and also addressed some confusion, please o help me out there. I will work on the documentation changes for Colbert. Looking forward to more collaboration

@Athe-kunal
Copy link
Contributor Author

@arnavsinghvi11
I have added the documentation for COlbert. Please do review this and suggest edits

@Athe-kunal
Copy link
Contributor Author

@arnavsinghvi11 Can you please review it?

@Athe-kunal
Copy link
Contributor Author

Hi @arnavsinghvi11
It has been a while, can you review the changes here?

# print(queries)
# TODO: Consider removing any quote-like markers that surround the query too.
k = k if k is not None else self.k
passages = dsp.retrieveRerankEnsemble(queries, k=k,**kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Athe-kunal , yeah it seemed like there is some repetitive code in both forward passes that can be abstracted out for the different retriever types. let me know if this change makes sense

@@ -9,17 +10,21 @@ def retrieve(query: str, k: int, **kwargs) -> list[str]:
"""Retrieves passages from the RM for the query and returns the top k passages."""
if not dsp.settings.rm:
raise AssertionError("No RM is loaded.")
if not dsp.settings.reranker:
warnings.warn("If you want to use the Reranker, please use dspy.RetrieveThenRerank")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"DeprecationWarning: 'display' has been deprecated. To see all information for debugging, use 'dspy.set_log_level('debug')'. In the future this will raise an error.",
- feel free to reference this


def retrieveEnsemble(queries: list[str], k: int, by_prob: bool = True,**kwargs) -> list[str]:
"""Retrieves passages from the RM for each query in queries and returns the top k passages
based on the probability or score.
"""
if not dsp.settings.rm:
raise AssertionError("No RM is loaded.")
if dsp.settings.reranker:
return retrieveRerankEnsemble(queries, k)
if not dsp.settings.reranker:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as with above -

"DeprecationWarning: 'display' has been deprecated. To see all information for debugging, use 'dspy.set_log_level('debug')'. In the future this will raise an error.",

dsp/primitives/search.py Outdated Show resolved Hide resolved
@arnavsinghvi11
Copy link
Collaborator

Hi @Athe-kunal , just left some follow up comments. it seems that there are still some leftover comments to address. feel free to discuss them more directly here as needed.

@Athe-kunal
Copy link
Contributor Author

Hi @arnavsinghvi11
I have made the required changes and resolved some issues. Please let me know if this is good to merge.

@ahmed-moubtahij
Copy link

Great PR! Hope it gets merged soon, I find myself needing this.

@Athe-kunal
Copy link
Contributor Author

@arnavsinghvi11
Can you review this again?

@arnavsinghvi11
Copy link
Collaborator

Hi @Athe-kunal , thanks again. I'd love to merge this PR but unfortunately, it breaks some existing caches we have in the repository, particularly the intro.ipynb. For more clarity, the changes to search.py and retrieve.py break the cached retrieved outputs in the intro notebook (which we keep fully cached so users can interact with the DSPy tutorial without having to expend their API key credits).

If you could make changes to this PR that enable the existing search/retrieve functions as I mentioned earlier, while introducing the new behavior separately, I can merge it after that. lmk if this makes sense!

@Athe-kunal
Copy link
Contributor Author

Thanks for the suggestion @arnavsinghvi11
I had a query regarding the joblib memory cache. In the current implementation, we have a list of text (only the relevant context) which is being cached. But in my implementation, I am returning a prediction object. Can joblib cache any abstracted data type or just text?
Also, for the local Colbertv2. Should I integrate the caching mechanism? It is a local model, and it does not require us to send the request to an API. Can you help me on the PR @arnavsinghvi11?

@Athe-kunal
Copy link
Contributor Author

Hi @arnavsinghvi11
Can you help me with the caching functionality here for the PR? I am unable to understand the caching mechanism here.

@arnavsinghvi11
Copy link
Collaborator

Hi @Athe-kunal ,

I would firstly recommend keeping the search and retrieve abstractions with as minimal differences as possible. I think I mentioned this in an earlier review but the changes to retrieve, retrieveRerankEnsemble, retrieveEnsemble, and the forward function in Retrieve directly impact existing caches with the intro.ipynb notebook. Any changes to those make it difficult to merge PRs without having to rerun all the requests with the new changes (which we highly prefer not to do to maintain consistency).

In more simple terms, the ColBERTv2 pipeline with search and retrieve should work as-is, and you can test this with your PR changes by running the intro notebook and checking if any changes added result in the notebook failing to execute.
Let me know if this is clear as it's not a caching problem to implement in this PR, but rather to maintain existing behavior that will ensure existing caches work!

Some steps I suggest with this is reverting back to the original behavior of search and retrieve and then integrating ColBERTv2RetrieverLocal and ColBERTv2RerankerLocal to work with those untouched functions. If you follow the existing setup of ColBERTv2 with its corresponding return types, I think this can work. Once that's done, feel free to ping me back on the PR so we can take a look and ensure no existing caches break.

From there, the next step would be to mirror the existing colbertv2 cache functions to align with the local mode being introduced.

Let me know if this all makes sense!

@Athe-kunal
Copy link
Contributor Author

Hi @arnavsinghvi11
Thanks for your detailed description. I had one small query; the Colbert retriever is a local model. The caching is essential if we are sending to a third party API. But if it is a local model, then does it require caching? If users expose the colbert model to a server, then they can directly use previous ColbertRetriever which has caching. Please let me know if I am
missing something.

@arnavsinghvi11
Copy link
Collaborator

But if it is a local model, then does it require caching?

Not a direct answer, but HFClientTGI supports caching for models hosted locally (REST API). Maybe it would benefit here as well. But caching is not a requirement with this PR. Ensuring backward compatibility with the existing dspy.ColBERTv2 in addition to supporting local ColBERT is!

@Athe-kunal
Copy link
Contributor Author

Athe-kunal commented May 16, 2024

Got it @arnavsinghvi11, working on it
I thought that users can first create a colbertv2 local index or reranker first, and then expose it to a server and then they can use the existing Colbertv2 functionality (which supports caching). However, exposing local models to caching can also be helpful. I provided a notebook for these Colbertv2 retrievers and reranker, and the search and retrieve functions were working properly.

Also, if you don't mind, can you share your discord handle on the DSPy discord channel, it would be helpful to interact there too.

@Athe-kunal
Copy link
Contributor Author

Hi @arnavsinghvi11
I tested with the intro.ipynb file, and it worked without hiccups.
Can you respond to the above requests?

@arnavsinghvi11
Copy link
Collaborator

Hi @Athe-kunal , I don't see any changes to the PR. could you check if they are pushed? my username is bigman11 btw

@Athe-kunal
Copy link
Contributor Author

Hi @arnavsinghvi11
I have made the following changes

  1. Added separate functions for retrieving with metadata so that existing cache won't break. I am passing with_metadata parameter, which has a default value of False, thus the current tutorials would work fine with it
  2. Also, I added a by_prob parameter with default value of True to dspy.Retrieve. It turns out that dsp.retrieve needs this by_prob parameter, but it is being passed as a kwargs parameter.

This PR has become way too much intractable, I will work on ColbertLocal caching in another PR. Can you review this one @arnavsinghvi11 ?

@arnavsinghvi11
Copy link
Collaborator

Thanks @Athe-kunal for your patience with this PR! The changes no longer break the existing intro.ipynb caches and is good to merge. Did you want to add all these changes to the separate PR, or should I go ahead and squash+merge the changes?

@Athe-kunal
Copy link
Contributor Author

Hi @arnavsinghvi11
Thanks for your support and guidance throughout the PR process, it was a great learning experience. You can squash and merge the changes for this PR.
I will work on the caching for the new Colbert Local models this weekend, but for now you can merge this one

Thanks again for helping me in this PR.

@arnavsinghvi11 arnavsinghvi11 merged commit 37b3759 into stanfordnlp:main Jun 15, 2024
3 of 4 checks passed
@arnavsinghvi11
Copy link
Collaborator

Merged. Thanks again @Athe-kunal !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants