Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colbert local mode support both as retriever and reranker. #797

Merged
merged 32 commits into from
Jun 15, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
9632e5e
return metadata changes
Athe-kunal Apr 4, 2024
e415f39
Merge branch 'main' of https://github.com/Athe-kunal/dspy
Athe-kunal Apr 4, 2024
a4b3844
add metadata changes
Athe-kunal Apr 4, 2024
321a768
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 5, 2024
6cd1d56
add support for returning metadata and reranking
Athe-kunal Apr 6, 2024
eeafacb
colbert integration
Athe-kunal Apr 8, 2024
1639bd2
colbert local modifications
Athe-kunal Apr 8, 2024
ec062b6
kwargs filtered ids
Athe-kunal Apr 8, 2024
987d923
colbert return
Athe-kunal Apr 8, 2024
9ff5b28
colbert retriever and reranker
Athe-kunal Apr 9, 2024
825a272
colbert retriever error fixes
Athe-kunal Apr 9, 2024
c25e9c4
colbert config changes in __init__
Athe-kunal Apr 10, 2024
ab5b12e
colbert notebook
Athe-kunal Apr 10, 2024
63dd534
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 10, 2024
f6a9293
import errors for colbert
Athe-kunal Apr 10, 2024
197a2c2
improt dspy fixes and linting fixes
Athe-kunal Apr 10, 2024
4698b00
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 13, 2024
81d142f
PR fixes for colbert
Athe-kunal Apr 13, 2024
b73753c
making the linting gods happy
Athe-kunal Apr 13, 2024
0ec1ded
remove unnecessary outputs
Athe-kunal Apr 14, 2024
567d5c4
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 17, 2024
685df2a
colbertv2 docs
Athe-kunal Apr 17, 2024
fa2bc20
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 19, 2024
509b36c
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 20, 2024
34328fd
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 22, 2024
146ec7b
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 26, 2024
f0437e3
Merge branch 'stanfordnlp:main' into main
Athe-kunal Apr 29, 2024
9cb522b
Colbert PR fixes
Athe-kunal Apr 29, 2024
ec4b9b3
linting fixes
Athe-kunal Apr 29, 2024
326ce01
more linting fixes
Athe-kunal Apr 29, 2024
b5913fc
fixing previous cache breaks with separate funcs
Athe-kunal Jun 8, 2024
c60fadc
Merge branch 'main' into main
arnavsinghvi11 Jun 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
colbert config changes in __init__
  • Loading branch information
Athe-kunal committed Apr 10, 2024
commit c25e9c44ed3202e6b770bc07225d5c5ed39fa5ef
42 changes: 22 additions & 20 deletions dsp/modules/colbertv2.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,39 +77,41 @@ def colbertv2_post_request_v2_wrapped(*args, **kwargs):
os.environ['COLBERT_LOAD_TORCH_EXTENSION_VERBOSE'] = "True"
Athe-kunal marked this conversation as resolved.
Show resolved Hide resolved

class ColBERTv2RetrieverLocal:
def __init__(self,checkpoint:str='colbert-ir/colbertv2.0',passages:List[str]=[],index_name_or_path:str = "Colbert-RM",experiment_name:str="Colbert-Experiment",load_only:bool=False,nranks:int=1,nbits:int=2,DOC_MAXLEN:int=300,INDEX_BSIZE:int=256,KMEANS_ITER:int=8,**kwargs):


from colbert.infra import Run, RunConfig, ColBERTConfig
def __init__(self,passages:List[str],load_only:bool=False,checkpoint:str='colbert-ir/colbertv2.0',colbert_config:ColBERTConfig=ColBERTConfig()):
"""Colbertv2 retriever module

Args:
passages (List[str]): list of passages
load_only (bool, optional): whether to load the index or . Defaults to False.
Athe-kunal marked this conversation as resolved.
Show resolved Hide resolved
checkpoint (str, optional): checkpoint for generating embeddings. Defaults to 'colbert-ir/colbertv2.0'.
colbert_config (ColBERTConfig, optional): colbert config for building and searching. Defaults to ColBERTConfig().
Athe-kunal marked this conversation as resolved.
Show resolved Hide resolved
"""
self.checkpoint = checkpoint
self.index_name_or_path = index_name_or_path
self.experiment_name = experiment_name
self.nranks = nranks
self.nbits = nbits
self.DOC_MAXLEN = DOC_MAXLEN
self.INDEX_BSIZE = INDEX_BSIZE
self.KMEANS_ITER = KMEANS_ITER
self.colbert_config = colbert_config
self.checkpoint = checkpoint
Athe-kunal marked this conversation as resolved.
Show resolved Hide resolved
self.colbert_config.checkpoint = checkpoint
self.passages = passages

if not load_only:
print(f"Building the index for experiment {self.experiment_name} with index name {self.index_name_or_path}")
self.build_index(**kwargs)
print(f"Building the index for experiment {self.colbert_config.experiment} with index name {self.colbert_config.index_name}")
self.build_index()

print(f"Loading the index for experiment {self.experiment_name} with index name {self.index_name_or_path}")
print(f"Loading the index for experiment {self.experiment} with index name {self.index_name}")
self.searcher = self.get_index()

def build_index(self,**kwargs):
def build_index(self):

try:
import colbert
except ImportError:
print("Colbert not found. Please check your installation or install the module using pip install colbert-ai[faiss-gpu,torch].")

from colbert import Indexer
from colbert.infra import Run, RunConfig, ColBERTConfig
with Run().context(RunConfig(nranks=self.nranks, experiment=self.experiment_name)):
config = ColBERTConfig(doc_maxlen=self.DOC_MAXLEN, nbits=self.nbits, kmeans_niters=self.KMEANS_ITER,index_bsize=self.INDEX_BSIZE,**kwargs)
indexer = Indexer(checkpoint=self.checkpoint, config=config)
indexer.index(name=self.index_name_or_path, collection=self.passages, overwrite=True)
from colbert.infra import Run, RunConfig
with Run().context(RunConfig(nranks=self.colbert_config.nranks, experiment=self.colbert_config.experiment)):
indexer = Indexer(checkpoint=self.checkpoint, config=self.colbert_config)
indexer.index(name=self.colbert_config.index_name, collection=self.passages, overwrite=True)

def get_index(self):
try:
Expand Down Expand Up @@ -152,7 +154,7 @@ class ColBERTv2RerankerLocal:
print("Colbert not found. Please check your installation or install the module using pip install colbert-ai[faiss-gpu,torch].")
from colbert.infra.config.config import ColBERTConfig

def __init__(self,checkpoint_name:str='bert-base-uncased',colbert_config:ColBERTConfig=None):
def __init__(self,checkpoint_name:str='bert-base-uncased',colbert_config:ColBERTConfig=ColBERTConfig()):
self.colbert_config = colbert_config
self.checkpoint_name = checkpoint_name
self.colbert_config.checkpoint = checkpoint_name
Expand Down
127 changes: 127 additions & 0 deletions examples/integrations/colbert/colbert_local.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
{
Athe-kunal marked this conversation as resolved.
Show resolved Hide resolved
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ColBERTConfig(query_token_id='[unused0]', doc_token_id='[unused1]', query_token='[Q]', doc_token='[D]', ncells=None, centroid_score_threshold=None, ndocs=None, load_index_with_mmap=False, index_path=None, index_bsize=64, nbits=1, kmeans_niters=4, resume=False, similarity='cosine', bsize=32, accumsteps=1, lr=3e-06, maxsteps=500000, save_every=None, warmup=None, warmup_bert=None, relu=False, nway=2, use_ib_negatives=False, reranker=False, distillation_alpha=1.0, ignore_scores=False, model_name=None, query_maxlen=32, attend_to_mask_tokens=False, interaction='colbert', dim=128, doc_maxlen=220, mask_punctuation=True, checkpoint=None, triples=None, collection=None, queries=None, index_name=None, overwrite=False, root='/home/athekunal/DSPy-contributions/dspy/examples/integrations/colbert/experiments', experiment='default', index_root=None, name='2024-04/09/19.53.42', rank=0, nranks=1, amp=True, gpus=1, avoid_fork_if_possible=False)\n"
]
}
],
"source": [
"from colbert.infra.config import ColBERTConfig\n",
"\n",
"print(ColBERTConfig())"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"query_token_id --> [unused0]\n",
"doc_token_id --> [unused1]\n",
"query_token --> [Q]\n",
"doc_token --> [D]\n",
"ncells --> None\n",
"centroid_score_threshold --> None\n",
"ndocs --> None\n",
"load_index_with_mmap --> False\n",
"index_path --> None\n",
"index_bsize --> 64\n",
"nbits --> 1\n",
"kmeans_niters --> 4\n",
"resume --> False\n",
"similarity --> cosine\n",
"bsize --> 32\n",
"accumsteps --> 1\n",
"lr --> 3e-06\n",
"maxsteps --> 500000\n",
"save_every --> None\n",
"warmup --> None\n",
"warmup_bert --> None\n",
"relu --> False\n",
"nway --> 2\n",
"use_ib_negatives --> False\n",
"reranker --> False\n",
"distillation_alpha --> 1.0\n",
"ignore_scores --> False\n",
"model_name --> None\n",
"query_maxlen --> 32\n",
"attend_to_mask_tokens --> False\n",
"interaction --> colbert\n",
"dim --> 128\n",
"doc_maxlen --> 220\n",
"mask_punctuation --> True\n",
"checkpoint --> None\n",
"triples --> None\n",
"collection --> None\n",
"queries --> None\n",
"index_name --> None\n",
"overwrite --> False\n",
"root --> /home/athekunal/DSPy-contributions/dspy/examples/integrations/colbert/experiments\n",
"experiment --> default\n",
"index_root --> None\n",
"name --> 2024-04/09/19.53.42\n",
"rank --> 0\n",
"nranks --> 1\n",
"amp --> True\n",
"gpus --> 1\n",
"avoid_fork_if_possible --> False\n",
"assigned --> {}\n"
]
}
],
"source": [
"for k,v in ColBERTConfig().__dict__.items():\n",
" print(f\"{k} --> {v}\")"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"passages = [\"It's a piece of cake.\", \"Don't put off until tomorrow what you can do today.\", 'To kill two birds with one stone.', 'Actions speak louder than words.', 'Honesty is the best policy.', 'If you want something done right, do it yourself.', 'The best things in life are free.', \"Don't count your chickens before they hatch.\", 'She sells seashells by the seashore.', 'Practice makes perfect.', \"Where there's a will, there's a way.\", 'Absence makes the heart grow fonder.', 'When the going gets tough, the tough get going.', 'A journey of a thousand miles begins with a single step.', \"You can't have your cake and eat it too.\", \"If you can't beat them, join them.\", 'Keep your friends close and your enemies closer.', \"Don't put all your eggs in one basket.\", \"All's fair in love and war.\", 'Every dog has its day.', 'All good things must come to an end.', 'Once bitten, twice shy.', \"The apple doesn't fall far from the tree.\", 'A penny saved is a penny earned.', \"Don't bite the hand that feeds you.\", 'You reap what you sow.', 'An apple a day keeps the doctor away.', \"One man's trash is another man's treasure.\", 'The squeaky wheel gets the grease.', 'A picture is worth a thousand words.', 'Fortune favors the bold.', 'Practice what you preach.', 'A watched pot never boils.', 'No pain, no gain.', \"You can't make an omelet without breaking eggs.\", \"There's no place like home.\", 'Ask and you shall receive.', 'Let sleeping dogs lie.', 'If the shoe fits, wear it.', 'Every cloud has a silver lining.', 'Look before you leap.', 'The more, the merrier.', 'The grass is always greener on the other side.', 'Beauty is only skin deep.', \"Two wrongs don't make a right.\", 'Beauty is in the eye of the beholder.', 'Necessity is the mother of invention.', 'Out of sight, out of mind.', 'Patience is a virtue.', 'Curiosity killed the cat.', \"If at first you don't succeed, try, try again.\", \"Beggars can't be choosers.\", 'Too many cooks spoil the broth.', 'Easy come, easy go.', \"Don't cry over spilled milk.\", \"There's no such thing as a free lunch.\", 'A bird in the hand is worth two in the bush.', 'Good things come to those who wait.', 'The quick brown fox jumps over the lazy dog.', 'It takes two to tango.', 'A friend in need is a friend indeed.', 'Like father, like son.', 'Let bygones be bygones.', 'Kill two birds with one stone.', 'A penny for your thoughts.', 'I am the master of my fate, I am the captain of my soul.', 'The pen is mightier than the sword.', 'When in Rome, do as the Romans do.', \"Rome wasn't built in a day.\", \"You can't judge a book by its cover.\", \"It's raining cats and dogs.\", 'Make hay while the sun shines.', \"It's better to be safe than sorry.\", 'The early bird catches the worm.', 'To be or not to be, that is the question.', 'Better late than never.']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 2
}