Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colbert local mode support both as retriever and reranker. #797

Merged
merged 32 commits into from
Jun 15, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
9632e5e
return metadata changes
Athe-kunal Apr 4, 2024
e415f39
Merge branch 'main' of https://github.com/Athe-kunal/dspy
Athe-kunal Apr 4, 2024
a4b3844
add metadata changes
Athe-kunal Apr 4, 2024
321a768
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 5, 2024
6cd1d56
add support for returning metadata and reranking
Athe-kunal Apr 6, 2024
eeafacb
colbert integration
Athe-kunal Apr 8, 2024
1639bd2
colbert local modifications
Athe-kunal Apr 8, 2024
ec062b6
kwargs filtered ids
Athe-kunal Apr 8, 2024
987d923
colbert return
Athe-kunal Apr 8, 2024
9ff5b28
colbert retriever and reranker
Athe-kunal Apr 9, 2024
825a272
colbert retriever error fixes
Athe-kunal Apr 9, 2024
c25e9c4
colbert config changes in __init__
Athe-kunal Apr 10, 2024
ab5b12e
colbert notebook
Athe-kunal Apr 10, 2024
63dd534
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 10, 2024
f6a9293
import errors for colbert
Athe-kunal Apr 10, 2024
197a2c2
improt dspy fixes and linting fixes
Athe-kunal Apr 10, 2024
4698b00
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 13, 2024
81d142f
PR fixes for colbert
Athe-kunal Apr 13, 2024
b73753c
making the linting gods happy
Athe-kunal Apr 13, 2024
0ec1ded
remove unnecessary outputs
Athe-kunal Apr 14, 2024
567d5c4
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 17, 2024
685df2a
colbertv2 docs
Athe-kunal Apr 17, 2024
fa2bc20
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 19, 2024
509b36c
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 20, 2024
34328fd
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 22, 2024
146ec7b
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 26, 2024
f0437e3
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 29, 2024
9cb522b
Colbert PR fixes
Athe-kunal Apr 29, 2024
ec4b9b3
linting fixes
Athe-kunal Apr 29, 2024
326ce01
more linting fixes
Athe-kunal Apr 29, 2024
b5913fc
fixing previous cache breaks with separate funcs
Athe-kunal Jun 8, 2024
c60fadc
Merge branch 'main' into main
arnavsinghvi11 Jun 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
colbertv2 docs
  • Loading branch information
Athe-kunal committed Apr 17, 2024
commit 685df2a24e1633992505c2e534630f7207931df6
78 changes: 78 additions & 0 deletions docs/api/retrieval_model_clients/ColBERTv2.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,3 +49,81 @@ retrieval_response = colbertv2_wiki17_abstracts('When was the first FIFA World C
for result in retrieval_response:
print("Text:", result['text'], "\n")
```

# dspy.ColBERTv2RetrieverLocal

This is taken from the official documentation of [Colbertv2](https://github.com/stanford-futuredata/ColBERT/tree/main) following the [paper](https://arxiv.org/abs/2112.01488).

You can install Colbertv2 by the following instructions from [here](https://github.com/stanford-futuredata/ColBERT?tab=readme-ov-file#installation)

### Constructor
The constructor initializes the ColBERTv2 as a local retriever object. You can initialize a server instance from your ColBERTv2 local instance using the code snippet from [here](https://github.com/stanford-futuredata/ColBERT/blob/main/server.py)

```python
class ColBERTv2RetrieverLocal:
def __init__(
self,
passages:List[str],
colbert_config=None,
load_only:bool=False):
```

**Parameters**
- `passages` (_List[str]_): List of passages to be indexed
- `colbert_config` (_ColBERTConfig_, _Optional_): colbert config for building and searching. Defaults to None.
- `load_only` (_Boolean_): whether to load the index or build and then load. Defaults to False.

The `colbert_config` object is required for ColBERTv2, and it can be imported from `from colbert.infra.config import ColBERTConfig`. You can find the descriptions of config attributes from [here](https://github.com/stanford-futuredata/ColBERT/blob/main/colbert/infra/config/settings.py)

### Methods

#### `forward(self, query:str, k:int, **kwargs) -> Union[list[str], list[dotdict]]`

It retrieves relevant passages from the index based on the query. If you already have a local index, then you can pass the `load_only` flag as `True` and change the `index` attribute of ColBERTConfig to the local path. Also, make sure to change the `checkpoint` attribute of ColBERTConfig to the embedding model that you used to build the index.

**Parameters:**
- `query` (_str_): Query string used for retrieval.
- `k` (_int_, _optional_): Number of passages to retrieve. Defaults to 7

It returns a `Prediction` object for each query

```python
Prediction(
pid=[33, 6, 47, 74, 48],
passages=['No pain, no gain.', 'The best things in life are free.', 'Out of sight, out of mind.', 'To be or not to be, that is the question.', 'Patience is a virtue.']
)
```
# dspy.ColBERTv2RerankerLocal

You can also use ColBERTv2 as a reranker in DSPy.

### Constructor

```python
class ColBERTv2RerankerLocal:

def __init__(
self,
colbert_config=None,
checkpoint:str='bert-base-uncased'):
```

**Parameters**
- `colbert_config` (_ColBERTConfig_, _Optional_): colbert config for building and searching. Defaults to None.
- `checkpoint` (_str_): Embedding model for embeddings the documents and query

### Methods
#### `forward(self,query:str,passages:List[str])`

Based on a query and list of passages, it reranks the passages and returns the scores along with the passages ordered in descending order based on the similarity scores.

**Parameters:**
- `query` (_str_): Query string used for reranking.
- `passages` (_List[str]_): List of passages to be reranked

It returns the similarity scores array and you can link it to the passages by

```python
for idx in np.argsort(scores_arr)[::-1]:
print(f"Passage = {passages[idx]} --> Score = {scores_arr[idx]}")
```
4 changes: 2 additions & 2 deletions dsp/modules/colbertv2.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ def __init__(self,colbert_config=None,checkpoint:str='bert-base-uncased'):
checkpoint_name (str, optional): checkpoint for embeddings. Defaults to 'bert-base-uncased'.
"""
self.colbert_config = colbert_config
self.checkpoint_name = checkpoint
self.checkpoint = checkpoint
self.colbert_config.checkpoint = checkpoint

def __call__(self, *args: Any, **kwargs: Any) -> Any:
Expand All @@ -184,7 +184,7 @@ def forward(self,query:str,passages:List[str]=[]):
query_ids,query_masks = query_tokenizer.tensorize([query])
doc_ids,doc_masks = doc_tokenizer.tensorize(passages)

col = ColBERT(self.checkpoint_name,self.colbert_config)
col = ColBERT(self.checkpoint,self.colbert_config)
Q = col.query(query_ids,query_masks)
DOC_IDS,DOC_MASKS = col.doc(doc_ids,doc_masks,keep_dims='return_mask')
Q_duplicated = Q.repeat_interleave(len(passages), dim=0).contiguous()
Expand Down
Loading