Colbert local mode support both as retriever and reranker. #797

Athe-kunal · 2024-04-10T01:04:06Z

Here are the changes proposed here

Colbert as a local retriever
Colbert as a local reranker
Not treating reranking as first-class citizen, rather added RetrieveThenRerank apart from Retrieve for retrieving and reranking
Attached a jupyter notebook for showing the implementation details

Athe-kunal · 2024-04-10T06:29:47Z

Fixes and Features in the PR

The notebook to follow with all the changes is here

Here, the retriever for retrieveEnsemble and retrieve, combines the passages from all the queries into one list. It is erroneous, as the user who is passing multiple queries, expects that he/she will get the relevant passages for each query in the queries list. In the PR, I have fixed it to return a prediction object or prediction objects in a list where each prediction object contains relevant passages for the given query
Also, the retriever only returns the text without metadata. The metadata will be helpful in downstream tasks of source citations. The metadata return support was added, but the rest of the pipeline won't change. Example of current output

[Prediction(
     pid=[6, 48, 74, 47, 33],
     rerank_score=[15.8359375, 14.2109375, 12.5703125, 11.7890625, 9.1796875],
     passages=['The best things in life are free.', 'Patience is a virtue.', 'To be or not to be, that is the question.', 'Out of sight, out of mind.', 'No pain, no gain.']
 ),
 Prediction(
     pid=[33, 0, 47, 74, 16],
     rerank_score=[19.828125, 12.2890625, 11.171875, 9.09375, 6.8984375],
     passages=['No pain, no gain.', "It's a piece of cake.", 'Out of sight, out of mind.', 'To be or not to be, that is the question.', 'Keep your friends close and your enemies closer.']
 )]

Currently, the reranker will not work in retrieveEnsemble as it is expecting the query and the passages. But there is no reranker support. In this PR, I have added support for Colbert as a reranker. An example of it below

colbert_config = ColBERTConfig()
colbert_config.index_name = 'colbert-ir-index'
colbert_reranker = dspy.ColBERTv2RerankerLocal(
    checkpoint='colbert-ir/colbertv2.0',colbert_config=colbert_config)

dspy.settings.configure(rm=colbert_retriever,reranker=colbert_reranker)

retrieve_rerank = dspy.RetrieveThenRerank(k=5)

pred = retrieve_rerank(
    ["What is the meaning of life?","Meaning of pain?"]
)

The RetrieveThenRerank will first retrieve, and then from those passages will rerank using the Max-sim operator in ColBERT. Other reranker integrations will be next. The idea about RetrieveThenRerank was mentioned as a TO-DO here.

@okhat, @arnavsinghvi11 @CShorten. Please review the PR and suggest feedbacks on improving it.

Josephrp

very nice work 👏🏻👏🏻

Athe-kunal · 2024-04-10T20:17:03Z

very nice work 👏🏻👏🏻

Thanks @Josephrp

dsp/modules/colbertv2.py

dsp/primitives/search.py

dspy/retrieve/retrieve.py

arnavsinghvi11 · 2024-04-13T00:59:07Z

dspy/retrieve/retrieve.py

+ # print(queries)
+ # TODO: Consider removing any quote-like markers that surround the query too.
+ k = k if k is not None else self.k
+ passages = dsp.retrieveRerankEnsemble(queries, k=k,**kwargs)


could we have maintain the forward pass call from before and abstract the repetitive code from below within the forward pass?

Sorry I was not able to understand this. @arnavsinghvi11
Do you want to have a common utility function for both Retrieve and RetrieveTheRerank to process the returned documents?

Hi @Athe-kunal , yeah it seemed like there is some repetitive code in both forward passes that can be abstracted out for the different retriever types. let me know if this change makes sense

Hi @arnavsinghvi11
I have abstracted the repetitive code part. However there are some nuances in the multi-query retriever, hence I didn't make a helper function for it. But for a single query, I have added a helper function single_query_passage. Please let me know if I need to make some other changes.

examples/integrations/colbert/colbert_local.ipynb

arnavsinghvi11 · 2024-04-13T01:05:23Z

Thanks a lot for these additions @Athe-kunal ! Really appreciate the tutorial notebook! I've left some comments for the PR to mainly address some code cleanup changes.

It would also be great if you could add some documentation for the local ColBERT models to the Retrieval documentation. You can follow the existing ColBERTv2 docs for reference and it would be great to add more specifics on the user parameters for interacting with the RM and potentially some of the relevant implementation details. Thanks!

Athe-kunal · 2024-04-13T20:52:58Z

Thanks a ton, @arnavsinghvi11 for this detailed feedback on my PR. It certainly helps me to be a better contributor. I have made the changes and also addressed some confusion, please o help me out there. I will work on the documentation changes for Colbert. Looking forward to more collaboration

Athe-kunal · 2024-04-17T02:53:56Z

@arnavsinghvi11
I have added the documentation for COlbert. Please do review this and suggest edits

Athe-kunal · 2024-04-21T04:46:22Z

@arnavsinghvi11 Can you please review it?

Athe-kunal · 2024-04-28T03:16:56Z

Hi @arnavsinghvi11
It has been a while, can you review the changes here?

arnavsinghvi11 · 2024-04-28T17:44:39Z

dspy/retrieve/retrieve.py

+ # print(queries)
+ # TODO: Consider removing any quote-like markers that surround the query too.
+ k = k if k is not None else self.k
+ passages = dsp.retrieveRerankEnsemble(queries, k=k,**kwargs)


Hi @Athe-kunal , yeah it seemed like there is some repetitive code in both forward passes that can be abstracted out for the different retriever types. let me know if this change makes sense

arnavsinghvi11 · 2024-04-28T17:46:28Z

dsp/primitives/search.py

@@ -9,17 +10,21 @@ def retrieve(query: str, k: int, **kwargs) -> list[str]:
 """Retrieves passages from the RM for the query and returns the top k passages."""
 if not dsp.settings.rm:
 raise AssertionError("No RM is loaded.")
+ if not dsp.settings.reranker:
+ warnings.warn("If you want to use the Reranker, please use dspy.RetrieveThenRerank")


dspy/dspy/evaluate/evaluate.py

Line 56 in d09d984

"DeprecationWarning: 'display' has been deprecated. To see all information for debugging, use 'dspy.set_log_level('debug')'. In the future this will raise an error.",

- feel free to reference this

arnavsinghvi11 · 2024-04-28T17:46:35Z

dsp/primitives/search.py


 def retrieveEnsemble(queries: list[str], k: int, by_prob: bool = True,**kwargs) -> list[str]:
 """Retrieves passages from the RM for each query in queries and returns the top k passages
 based on the probability or score.
 """
 if not dsp.settings.rm:
 raise AssertionError("No RM is loaded.")
- if dsp.settings.reranker:
- return retrieveRerankEnsemble(queries, k)
+ if not dsp.settings.reranker:


as with above -

dspy/dspy/evaluate/evaluate.py

Line 56 in d09d984

"DeprecationWarning: 'display' has been deprecated. To see all information for debugging, use 'dspy.set_log_level('debug')'. In the future this will raise an error.",

dsp/primitives/search.py

examples/integrations/colbert/colbert_local.ipynb

arnavsinghvi11 · 2024-04-28T17:49:43Z

Hi @Athe-kunal , just left some follow up comments. it seems that there are still some leftover comments to address. feel free to discuss them more directly here as needed.

Athe-kunal · 2024-04-29T06:33:50Z

Hi @arnavsinghvi11
I have made the required changes and resolved some issues. Please let me know if this is good to merge.

ahmed-moubtahij · 2024-05-04T16:18:24Z

Great PR! Hope it gets merged soon, I find myself needing this.

Athe-kunal · 2024-05-04T19:22:02Z

@arnavsinghvi11
Can you review this again?

arnavsinghvi11 · 2024-05-04T23:09:11Z

Hi @Athe-kunal , thanks again. I'd love to merge this PR but unfortunately, it breaks some existing caches we have in the repository, particularly the intro.ipynb. For more clarity, the changes to search.py and retrieve.py break the cached retrieved outputs in the intro notebook (which we keep fully cached so users can interact with the DSPy tutorial without having to expend their API key credits).

If you could make changes to this PR that enable the existing search/retrieve functions as I mentioned earlier, while introducing the new behavior separately, I can merge it after that. lmk if this makes sense!

Athe-kunal · 2024-05-08T20:18:21Z

Thanks for the suggestion @arnavsinghvi11
I had a query regarding the joblib memory cache. In the current implementation, we have a list of text (only the relevant context) which is being cached. But in my implementation, I am returning a prediction object. Can joblib cache any abstracted data type or just text?
Also, for the local Colbertv2. Should I integrate the caching mechanism? It is a local model, and it does not require us to send the request to an API. Can you help me on the PR @arnavsinghvi11?

Athe-kunal · 2024-05-10T18:54:49Z

Hi @arnavsinghvi11
Can you help me with the caching functionality here for the PR? I am unable to understand the caching mechanism here.

arnavsinghvi11 · 2024-05-11T18:34:15Z

Hi @Athe-kunal ,

I would firstly recommend keeping the search and retrieve abstractions with as minimal differences as possible. I think I mentioned this in an earlier review but the changes to retrieve, retrieveRerankEnsemble, retrieveEnsemble, and the forward function in Retrieve directly impact existing caches with the intro.ipynb notebook. Any changes to those make it difficult to merge PRs without having to rerun all the requests with the new changes (which we highly prefer not to do to maintain consistency).

In more simple terms, the ColBERTv2 pipeline with search and retrieve should work as-is, and you can test this with your PR changes by running the intro notebook and checking if any changes added result in the notebook failing to execute.
Let me know if this is clear as it's not a caching problem to implement in this PR, but rather to maintain existing behavior that will ensure existing caches work!

Some steps I suggest with this is reverting back to the original behavior of search and retrieve and then integrating ColBERTv2RetrieverLocal and ColBERTv2RerankerLocal to work with those untouched functions. If you follow the existing setup of ColBERTv2 with its corresponding return types, I think this can work. Once that's done, feel free to ping me back on the PR so we can take a look and ensure no existing caches break.

From there, the next step would be to mirror the existing colbertv2 cache functions to align with the local mode being introduced.

Let me know if this all makes sense!

Athe-kunal · 2024-05-14T02:01:19Z

Hi @arnavsinghvi11
Thanks for your detailed description. I had one small query; the Colbert retriever is a local model. The caching is essential if we are sending to a third party API. But if it is a local model, then does it require caching? If users expose the colbert model to a server, then they can directly use previous ColbertRetriever which has caching. Please let me know if I am
missing something.

arnavsinghvi11 · 2024-05-15T21:07:56Z

But if it is a local model, then does it require caching?

Not a direct answer, but HFClientTGI supports caching for models hosted locally (REST API). Maybe it would benefit here as well. But caching is not a requirement with this PR. Ensuring backward compatibility with the existing dspy.ColBERTv2 in addition to supporting local ColBERT is!

Athe-kunal · 2024-05-16T22:28:03Z

Got it @arnavsinghvi11, working on it
I thought that users can first create a colbertv2 local index or reranker first, and then expose it to a server and then they can use the existing Colbertv2 functionality (which supports caching). However, exposing local models to caching can also be helpful. I provided a notebook for these Colbertv2 retrievers and reranker, and the search and retrieve functions were working properly.

Also, if you don't mind, can you share your discord handle on the DSPy discord channel, it would be helpful to interact there too.

Athe-kunal · 2024-05-21T18:05:06Z

Hi @arnavsinghvi11
I tested with the intro.ipynb file, and it worked without hiccups.
Can you respond to the above requests?

arnavsinghvi11 · 2024-05-26T00:11:37Z

Hi @Athe-kunal , I don't see any changes to the PR. could you check if they are pushed? my username is bigman11 btw

Athe-kunal · 2024-06-08T03:48:10Z

Hi @arnavsinghvi11
I have made the following changes

Added separate functions for retrieving with metadata so that existing cache won't break. I am passing with_metadata parameter, which has a default value of False, thus the current tutorials would work fine with it
Also, I added a by_prob parameter with default value of True to dspy.Retrieve. It turns out that dsp.retrieve needs this by_prob parameter, but it is being passed as a kwargs parameter.

This PR has become way too much intractable, I will work on ColbertLocal caching in another PR. Can you review this one @arnavsinghvi11 ?

arnavsinghvi11 · 2024-06-13T14:11:19Z

Thanks @Athe-kunal for your patience with this PR! The changes no longer break the existing intro.ipynb caches and is good to merge. Did you want to add all these changes to the separate PR, or should I go ahead and squash+merge the changes?

Athe-kunal · 2024-06-13T19:10:20Z

Hi @arnavsinghvi11
Thanks for your support and guidance throughout the PR process, it was a great learning experience. You can squash and merge the changes for this PR.
I will work on the caching for the new Colbert Local models this weekend, but for now you can merge this one

Thanks again for helping me in this PR.

arnavsinghvi11 · 2024-06-15T18:24:50Z

Merged. Thanks again @Athe-kunal !

Athe-kunal and others added 16 commits April 3, 2024 22:41

return metadata changes

9632e5e

Merge branch 'main' of https://github.com/Athe-kunal/dspy

e415f39

add metadata changes

a4b3844

Merge branch 'stanfordnlp:main' into main

321a768

add support for returning metadata and reranking

6cd1d56

colbert integration

eeafacb

colbert local modifications

1639bd2

kwargs filtered ids

ec062b6

colbert return

987d923

colbert retriever and reranker

9ff5b28

colbert retriever error fixes

825a272

colbert config changes in __init__

c25e9c4

colbert notebook

ab5b12e

Merge branch 'stanfordnlp:main' into main

63dd534

import errors for colbert

f6a9293

improt dspy fixes and linting fixes

197a2c2

Josephrp approved these changes Apr 10, 2024

View reviewed changes

arnavsinghvi11 requested changes Apr 13, 2024

View reviewed changes

Athe-kunal and others added 3 commits April 13, 2024 12:12

Merge branch 'stanfordnlp:main' into main

4698b00

PR fixes for colbert

81d142f

making the linting gods happy

b73753c

Athe-kunal and others added 3 commits April 14, 2024 03:02

remove unnecessary outputs

0ec1ded

Merge branch 'stanfordnlp:main' into main

567d5c4

colbertv2 docs

685df2a

Merge branch 'stanfordnlp:main' into main

fa2bc20

Merge branch 'stanfordnlp:main' into main

509b36c

Athe-kunal added 2 commits April 21, 2024 20:20

Merge branch 'stanfordnlp:main' into main

34328fd

Merge branch 'stanfordnlp:main' into main

146ec7b

arnavsinghvi11 requested changes Apr 28, 2024

View reviewed changes

Athe-kunal and others added 4 commits April 28, 2024 21:52

Merge branch 'stanfordnlp:main' into main

f0437e3

Colbert PR fixes

9cb522b

linting fixes

ec4b9b3

more linting fixes

326ce01

fixing previous cache breaks with separate funcs

b5913fc

Merge branch 'main' into main

c60fadc

arnavsinghvi11 merged commit 37b3759 into stanfordnlp:main Jun 15, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colbert local mode support both as retriever and reranker. #797

Colbert local mode support both as retriever and reranker. #797

Athe-kunal commented Apr 10, 2024

Athe-kunal commented Apr 10, 2024

Josephrp left a comment

Athe-kunal commented Apr 10, 2024

arnavsinghvi11 Apr 13, 2024

Athe-kunal Apr 13, 2024

arnavsinghvi11 Apr 28, 2024

Athe-kunal Apr 29, 2024

arnavsinghvi11 commented Apr 13, 2024

Athe-kunal commented Apr 13, 2024

Athe-kunal commented Apr 17, 2024

Athe-kunal commented Apr 21, 2024

Athe-kunal commented Apr 28, 2024

arnavsinghvi11 Apr 28, 2024

arnavsinghvi11 Apr 28, 2024

arnavsinghvi11 Apr 28, 2024

arnavsinghvi11 commented Apr 28, 2024

Athe-kunal commented Apr 29, 2024

ahmed-moubtahij commented May 4, 2024

Athe-kunal commented May 4, 2024

arnavsinghvi11 commented May 4, 2024

Athe-kunal commented May 8, 2024

Athe-kunal commented May 10, 2024

arnavsinghvi11 commented May 11, 2024

Athe-kunal commented May 14, 2024

arnavsinghvi11 commented May 15, 2024

Athe-kunal commented May 16, 2024 •

edited

Loading

Athe-kunal commented May 21, 2024

arnavsinghvi11 commented May 26, 2024

Athe-kunal commented Jun 8, 2024

arnavsinghvi11 commented Jun 13, 2024

Athe-kunal commented Jun 13, 2024

arnavsinghvi11 commented Jun 15, 2024

Colbert local mode support both as retriever and reranker. #797

Colbert local mode support both as retriever and reranker. #797

Conversation

Athe-kunal commented Apr 10, 2024

Athe-kunal commented Apr 10, 2024

Josephrp left a comment

Choose a reason for hiding this comment

Athe-kunal commented Apr 10, 2024

arnavsinghvi11 Apr 13, 2024

Choose a reason for hiding this comment

Athe-kunal Apr 13, 2024

Choose a reason for hiding this comment

arnavsinghvi11 Apr 28, 2024

Choose a reason for hiding this comment

Athe-kunal Apr 29, 2024

Choose a reason for hiding this comment

arnavsinghvi11 commented Apr 13, 2024

Athe-kunal commented Apr 13, 2024

Athe-kunal commented Apr 17, 2024

Athe-kunal commented Apr 21, 2024

Athe-kunal commented Apr 28, 2024

arnavsinghvi11 Apr 28, 2024

Choose a reason for hiding this comment

arnavsinghvi11 Apr 28, 2024

Choose a reason for hiding this comment

arnavsinghvi11 Apr 28, 2024

Choose a reason for hiding this comment

arnavsinghvi11 commented Apr 28, 2024

Athe-kunal commented Apr 29, 2024

ahmed-moubtahij commented May 4, 2024

Athe-kunal commented May 4, 2024

arnavsinghvi11 commented May 4, 2024

Athe-kunal commented May 8, 2024

Athe-kunal commented May 10, 2024

arnavsinghvi11 commented May 11, 2024

Athe-kunal commented May 14, 2024

arnavsinghvi11 commented May 15, 2024

Athe-kunal commented May 16, 2024 • edited Loading

Athe-kunal commented May 21, 2024

arnavsinghvi11 commented May 26, 2024

Athe-kunal commented Jun 8, 2024

arnavsinghvi11 commented Jun 13, 2024

Athe-kunal commented Jun 13, 2024

arnavsinghvi11 commented Jun 15, 2024

Athe-kunal commented May 16, 2024 •

edited

Loading