# An Introduction to Using [Pyserini](https://pyserini.io/) for DSPy

Pyserini is a tool maintained by the Data Systems Group at the University of Waterloo, and you can use it to incorporate your own data into `dspy.Retrieve`. Currently, `dspy.Pyserini` supports using your own Faiss index one of pyserini's [prebuilt indexes](https://github.com/castorini/pyserini/blob/master/pyserini/prebuilt_index_info.py) to perform retrieval.

## 1. Installation/Setup
Using `dspy.Pyserini` will require installing pyserini, Pytorch, and Faiss. Pyserini can be installed with `pip install pyserini`, and if you're on your own device we'll leave it to you to decide the right versions of Pytorch and Faiss to install.

On Colab, make sure to run this notebook with GPU by going to Edit > Notebook Settings > Select a GPU under Hardware Accelerator

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import pkg_resources 

try: # When on Colab, let's install pyserini, Pytorch, and Faiss
    import google.colab
    repo_path = 'dspy'
    !git -C $repo_path pull origin || git clone https://github.com/stanfordnlp/dspy $repo_path
    %cd $repo_path
    !pip install -e .
    if not "pyserini" in {pkg.key for pkg in pkg_resources.working_set}:
        !pip install pyserini
    if not "torch" in {pkg.key for pkg in pkg_resources.working_set}:
        !pip install torch
    if not "faiss-cpu" in {pkg.key for pkg in pkg_resources.working_set}:
        !pip install faiss-cpu
except:
    repo_path = '.'
    # Install the package if it's not installed
    if not "dspy-ai" in {pkg.key for pkg in pkg_resources.working_set}:
        !pip install -U pip
        !pip install dspy-ai
        # !pip install -e $repo_path

if repo_path not in sys.path:
    sys.path.append(repo_path)

import dspy

  import pkg_resources


## 2. Using Pyserini's prebuilt indexes

In [2]:
pys_ret_prebuilt = dspy.Pyserini(index='beir-v1.0.0-nfcorpus.contriever-msmarco', query_encoder='facebook/contriever-msmarco', id_field='_id', text_fields=['title', 'text'])

dspy.settings.configure(rm=pys_ret_prebuilt)

example_question = "How Curry Can Kill Cancer Cells"

retrieve = dspy.Retrieve(k=3)
topK_passages = retrieve(example_question).passages

print(f"Top {retrieve.k} passages for question: {example_question} \n", '-' * 30, '\n')

for idx, passage in enumerate(topK_passages):
    print(f'{idx+1}]', passage, '\n')


Attempting to initialize pre-built index beir-v1.0.0-nfcorpus.contriever-msmarco.
/store2/scratch/j5xian/cache/pyserini/indexes/faiss.beir-v1.0.0-nfcorpus.contriever-msmarco.20230124.657649d19fafd06cb031c6b11868d7f9 already exists, skipping download.
Initializing beir-v1.0.0-nfcorpus.contriever-msmarco...
Top 3 passages for question: How Curry Can Kill Cancer Cells 
 ------------------------------ 

1] Curcumin and Cancer Cells: How Many Ways Can Curry Kill Tumor Cells Selectively? Cancer is a hyperproliferative disorder that is usually treated by chemotherapeutic agents that are toxic not only to tumor cells but also to normal cells, so these agents produce major side effects. In addition, these agents are highly expensive and thus not affordable for most. Moreover, such agents cannot be used for cancer prevention. Traditional medicines are generally free of the deleterious side effects and usually inexpensive. Curcumin, a component of turmeric (Curcuma longa), is one such agent that 

## 3. Using your own data
As an example, we'll be using [NFCorpus](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/), a full-text learning to rank dataset for medical information retrieval. This corpus is quite small so encoding, indexing, and retrieval should be tolerable on CPU. 

First, let's fetch the data:

In [7]:
!wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip -P collections
!unzip collections/nfcorpus.zip -d collections

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--2023-09-16 12:33:31--  https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip
Resolving public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)... 130.83.167.186
Connecting to public.ukp.informatik.tu-darmstadt.de (public.ukp.informatik.tu-darmstadt.de)|130.83.167.186|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2448432 (2.3M) [application/zip]
Saving to: ‘collections/nfcorpus.zip’


2023-09-16 12:33:33 (2.64 MB/s) - ‘collections/nfcorpus.zip’ saved [2448432/2448432]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `toke

Next, we can use pyserini to encode and pack up our data into a Faiss index:

In [5]:
!python -m pyserini.encode \
  input   --corpus collections/nfcorpus/corpus.jsonl \
          --fields title text \
  output  --embeddings indexes/faiss.nfcorpus.contriever-msmarco \
          --to-faiss \
  encoder --encoder facebook/contriever-msmarco \
          --device cuda:0 \
          --pooling mean \
          --fields title text

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
3633it [00:00, 78339.56it/s]
100%|███████████████████████████████████████████| 57/57 [00:17<00:00,  3.33it/s]


Now, we can use `dspy.Pyserini` to read our local Faiss index and perform retrieval. Note that using a local index requires passing in a Huggingface `Dataset` for document lookup.

In [6]:
from datasets import load_dataset

dataset = load_dataset(path='json', data_files='collections/nfcorpus/corpus.jsonl', split='train')

pys_ret_local = dspy.Pyserini(index='indexes/faiss.nfcorpus.contriever-msmarco', query_encoder='facebook/contriever-msmarco', dataset=dataset, id_field='_id', text_fields=['title', 'text'])

dspy.settings.configure(rm=pys_ret_local)

dev_example = "How Curry Can Kill Cancer Cells"

retrieve = dspy.Retrieve(k=3)
topK_passages = retrieve(dev_example).passages

print(f"Top {retrieve.k} passages for question: {dev_example} \n", '-' * 30, '\n')

for idx, passage in enumerate(topK_passages):
    print(f'{idx+1}]', passage, '\n')


Top 3 passages for question: How Curry Can Kill Cancer Cells 
 ------------------------------ 

1] Curcumin and Cancer Cells: How Many Ways Can Curry Kill Tumor Cells Selectively? Cancer is a hyperproliferative disorder that is usually treated by chemotherapeutic agents that are toxic not only to tumor cells but also to normal cells, so these agents produce major side effects. In addition, these agents are highly expensive and thus not affordable for most. Moreover, such agents cannot be used for cancer prevention. Traditional medicines are generally free of the deleterious side effects and usually inexpensive. Curcumin, a component of turmeric (Curcuma longa), is one such agent that is safe, affordable, and efficacious. How curcumin kills tumor cells is the focus of this review. We show that curcumin modulates growth of tumor cells through regulation of multiple cell signaling pathways including cell proliferation pathway (cyclin D1, c-myc), cell survival pathway (Bcl-2, Bcl-xL, cFLIP