<img src="../../docs/images/DSPy8.png" alt="DSPy7 Image" height="150"/>

## DSPy: Compiling chains from `LangChain`

One of the most powerful features in **DSPy** is optimizers. **DSPy optimizers** can take any LM system and tune the prompts (or the LM weights) to maximize any objective.

Optimizers can improve the quality of your LM systems and make your code adaptive to new LMs or new data. This is meant to bring structure and modularity in place of hacky things like (i) manual prompt engineering, (ii) designing complex pipelines for generating synthetic data, (iii) or designing complex pipelines for finetuning.

In [1]:
# Install the dependencies if needed.
# %pip install -U dspy-ai
# %pip install -U openai jinja2
# %pip install -U langchain langchain-community langchain-openai langchain-core

Typically, we use DSPy optimizers with DSPy modules. But here, we've worked with [Harrison Chase](https://twitter.com/hwchase17) to make sure DSPy can also optimize chains built with the `LangChain` library.

This short tutorial demonstrates how this proof-of-concept feature works. _This will **not** give you the full power of DSPy or LangChain yet, but we will expand it if there's high demand._

If we convert this into a fuller integration, all users stand to benefit. LangChain users will gain the ability to optimize any chain with any DSPy optimizer. DSPy users will gain the ability to _export_ any DSPy program into an LCEL that supports streaming and tracing, and other rich production-targeted features in LangChain.

### 1) Setting Up

First, let's import `dspy` and configure the default language model and retrieval model in it.

In [2]:
import dspy

from dspy.evaluate.evaluate import Evaluate
from dspy.teleprompt import BootstrapFewShotWithRandomSearch

colbertv2 = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

dspy.configure(rm=colbertv2)

  from .autonotebook import tqdm as notebook_tqdm


Next, let's import `langchain` and the DSPy modules for interacting with LangChain runnables, namely, `LangChainPredict` and `LangChainModule`.

In [3]:
from langchain_openai import OpenAI
from langchain.globals import set_llm_cache
from langchain.cache import SQLiteCache

set_llm_cache(SQLiteCache(database_path="cache.db"))

llm = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=0)
retrieve = lambda x: dspy.Retrieve(k=5)(x["question"]).passages

If it's useful, we can set up some caches so you can run this whole notebook in Google Colab without any API keys. Let us know.

### 2) Defining a chain as a `LangChain` expression

For illustration, let's tackle the following task.

**Task:** Build a RAG system for generating informative tweets.
- **Input:** A factual **question**, which may be fairly complex.
- **Output:** An engaging **tweet** that correctly answers the question from the retrieved info.

Let's use LangChain's expression language (LCEL) to illustrate this. Any prompt here will do, we will optimize the final prompt with DSPy.

Considering that, let's just keep it to the barebones: **Given {context}, answer the question {question} as a tweet.**

In [4]:
# From LangChain, import standard modules for prompting.
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Just a simple prompt for this task. It's fine if it's complex too.
prompt = PromptTemplate.from_template("Given {context}, answer the question `{question}` as a tweet.")

# This is how you'd normally build a chain with LCEL. This chain does retrieval then generation (RAG).
vanilla_chain = RunnablePassthrough.assign(context=retrieve) | prompt | llm | StrOutputParser()

### 3) Converting the chain into a **DSPy module**

Our goal is to optimize this prompt so we have a better tweet generator. DSPy optimizers can help, but they only work with DSPy modules!

For this reason, we created two new modules in DSPy: `LangChainPredict` and `LangChainModule`.

In [5]:
# From DSPy, import the modules that know how to interact with LangChain LCEL.
from dspy.predict.langchain import LangChainPredict, LangChainModule

# This is how to wrap it so it behaves like a DSPy program.
# Just Replace every pattern like `prompt | llm` with `LangChainPredict(prompt, llm)`.
zeroshot_chain = RunnablePassthrough.assign(context=retrieve) | LangChainPredict(prompt, llm) | StrOutputParser()
zeroshot_chain = LangChainModule(zeroshot_chain)  # then wrap the chain in a DSPy module.

### 4) Trying the module

How good is our `LangChainModule` at this task? Well, we can ask it to generate a tweet for the following question.

In [6]:
question = "In what region was Eddy Mazzoleni born?"

zeroshot_chain.invoke({"question": question})

' Eddy Mazzoleni, Italian professional cyclist, was born in Bergamo, Italy on July 29, 1973. #cyclist #Italy #Bergamo'

Ah that sounds about right! (It's technically not perfect: we asked for the _region_ not the city. We can do better below.)

Inspecting questions and answers manually is very important to get a sense of your system. However, a good system designer always looks to iteratively **benchmark** their work to quantify progress!

To do this, we need two things: the **metric** we want to maximize and a (tiny) **dataset** of examples for our system.

Are there pre-defined metrics for good tweets? Should I label 100,000 tweets by hand? Probably not. We can easily do something reasonable, though, until you start getting data in production!

### 5) Evaluating the module

To get started, we'll define our own simple metric and we'll borrow a bunch of questions from a QA dataset and use them here for tuning.

**What makes a good tweet?** I don't know, but in the spirit of iterative development, let's start simple!

Define a good tweet to be have three properties: it should be (1) factually correct, (2) based on real sources, and (3) engaging for people.

In [7]:
# We took the liberty to define this metric and load a few examples from a standard QA dataset.
# Let's impore them from `tweet_metric.py` in the same directory that contains this notebook.
from tweet_metric import metric, trainset, valset, devset

# We loaded 200, 50, and 150 examples for training, validation (tuning), and development (evaluation), respectively.
# You could load less (or more) and, chances are, the right DSPy optimizers will work well for many problems.
len(trainset), len(valset), len(devset)

  table = cls._concat_blocks(blocks, axis=0)


(200, 50, 150)

Is this the right metric or the most representative set of questions? Not necessarily. But they get us started in a way we can iterate on systematically!

**Note:** Notice that our dataset doesn't actually include any tweets! It only has questions and answers. That's OK, our metric will take care of evaluating outputs in tweet form.

Okay, let's evaluate the unoptimized "zero-shot" version of our chain, converted from our `LangChain` LCEL object.

In [8]:
evaluate = Evaluate(metric=metric, devset=devset, num_threads=8, display_progress=True, display_table=5)
evaluate(zeroshot_chain)

Average Metric: 63.999999999999986 / 150  (42.7): 100%|██████████| 150/150 [00:02<00:00, 66.08it/s]
  df = df.applymap(truncate_cell)


Average Metric: 63.999999999999986 / 150  (42.7%)


Unnamed: 0,question,answer,gold_titles,output,tweet_response,metric
0,Who was a producer who produced albums for both rock bands Juke Karten and Thirty Seconds to Mars?,Brian Virtue,"{'Thirty Seconds to Mars', 'Levolution (album)'}","Brian Virtue, who has worked with bands like Jane's Addiction and Velvet Revolver, produced albums for both Juke Kartel and Thirty Seconds to Mars, showcasing...","Brian Virtue, who has worked with bands like Jane's Addiction and Velvet Revolver, produced albums for both Juke Kartel and Thirty Seconds to Mars, showcasing...",1.0
1,Are both the University of Chicago and Syracuse University public universities?,no,"{'Syracuse University', 'University of Chicago'}","No, only Syracuse University is a public university. The University of Chicago is a private research university. #Syracuse #University #Chicago #Public #Private","No, only Syracuse University is a public university. The University of Chicago is a private research university. #Syracuse #University #Chicago #Public #Private",0.3333333333333333
2,In what region was Eddy Mazzoleni born?,"Lombardy, northern Italy","{'Eddy Mazzoleni', 'Bergamo'}","Eddy Mazzoleni, Italian professional cyclist, was born in Bergamo, Italy on July 29, 1973. #cyclist #Italy #Bergamo","Eddy Mazzoleni, Italian professional cyclist, was born in Bergamo, Italy on July 29, 1973. #cyclist #Italy #Bergamo",0.0
3,Who edited the 1990 American romantic comedy film directed by Garry Marshall?,Raja Raymond Gosnell,"{'Raja Gosnell', 'Pretty Woman'}",J. F. Lawton edited the 1990 American romantic comedy film directed by Garry Marshall. #PrettyWoman #GarryMarshall #JFLawton,J. F. Lawton edited the 1990 American romantic comedy film directed by Garry Marshall. #PrettyWoman #GarryMarshall #JFLawton,0.0
4,Burrs Country Park railway station is what stop on the railway line that runs between Heywood and Rawtenstall,seventh,"{'East Lancashire Railway', 'Burrs Country Park railway station'}",Burrs Country Park railway station is the seventh stop on the East Lancashire Railway line that runs between Heywood and Rawtenstall.,Burrs Country Park railway station is the seventh stop on the East Lancashire Railway line that runs between Heywood and Rawtenstall.,1.0


42.67

Okay, cool. Our `zeroshot_chain` gets about **43%** on the 150 questions from the devset.

The table above shows some examples. For instance:

- **Question**: Who was a producer who produced albums for both rock bands Juke Karten and Thirty Seconds to Mars?	
- **Tweet**: Brian Virtue, who has worked with bands like Jane's Addiction and Velvet Revolver, produced albums for both Juke Kartel and Thirty Seconds to Mars, showcasing... [truncated]
- **Metric**: 1.0 (A tweet that is correct, faithful, and engaging!*)

footnote: *  At least according to our metric, which is just a DSPy program, so _it too_ can be optimized if you'd like! Topic for another notebook,  though.

### 6) Optimizing the module

DSPy has many optimizers, but the de-facto default one currently is `BootstrapFewShotWithRandomSearch`.

**If you're curious how it works:** This optimizer works by running your program (in this case, `zeroshot_chain`) on `trainset` questions. Each time it runs, DSPy will remember the input and output of each LM call. These are called traces, and this particular optimizer will keep track of "good" traces (i.e., ones that the metric likes). Then, this optimizer will try to find good ways to leverage these traces as automatic few-shot examples. It will try them out, seeking to maximize the average metric on `valset`. There are many ways to self-generate (bootstrap) examples. There are also many ways to optimize their selection (here, with random search). That's why there are several other optimizers in DSPy.

In [9]:
# Set up the optimizer. We'll use very minimal hyperparameters for this example.
# Just do random search with ~3 attempts, and in each attempt, bootstrap <= 3 traces.
optimizer = BootstrapFewShotWithRandomSearch(metric=metric, max_bootstrapped_demos=3, num_candidate_programs=3)

# Now use the optimizer to *compile* the chain. This could take 5-10 minutes, unless it's cached.
optimized_chain = optimizer.compile(zeroshot_chain, trainset=trainset, valset=valset)

Going to sample between 1 and 3 traces per predictor.
Will attempt to train 3 candidate sets.


Average Metric: 22.333333333333336 / 50  (44.7): 100%|██████████| 50/50 [00:00<00:00, 55.47it/s]
  df = df.applymap(truncate_cell)


Average Metric: 22.333333333333336 / 50  (44.7%)
Score: 44.67 for set: [0]
New best score: 44.67 for seed -3
Scores so far: [44.67]
Best score: 44.67


Average Metric: 22.333333333333336 / 50  (44.7): 100%|██████████| 50/50 [00:00<00:00, 166.70it/s]
  df = df.applymap(truncate_cell)


Average Metric: 22.333333333333336 / 50  (44.7%)
Score: 44.67 for set: [16]
Scores so far: [44.67, 44.67]
Best score: 44.67


  2%|▎         | 5/200 [00:00<00:07, 26.88it/s]


Bootstrapped 3 full traces after 6 examples in round 0.


Average Metric: 27.000000000000004 / 50  (54.0): 100%|██████████| 50/50 [00:00<00:00, 72.21it/s]
  df = df.applymap(truncate_cell)


Average Metric: 27.000000000000004 / 50  (54.0%)
Score: 54.0 for set: [16]
New best score: 54.0 for seed -1
Scores so far: [44.67, 44.67, 54.0]
Best score: 54.0
Average of max per entry across top 1 scores: 0.54
Average of max per entry across top 2 scores: 0.5933333333333334
Average of max per entry across top 3 scores: 0.5933333333333334
Average of max per entry across top 5 scores: 0.5933333333333334
Average of max per entry across top 8 scores: 0.5933333333333334
Average of max per entry across top 9999 scores: 0.5933333333333334


  4%|▍         | 9/200 [00:00<00:06, 28.04it/s]


Bootstrapped 2 full traces after 10 examples in round 0.


Average Metric: 25.000000000000007 / 50  (50.0): 100%|██████████| 50/50 [00:00<00:00, 70.71it/s]
  df = df.applymap(truncate_cell)


Average Metric: 25.000000000000007 / 50  (50.0%)
Score: 50.0 for set: [16]
Scores so far: [44.67, 44.67, 54.0, 50.0]
Best score: 54.0
Average of max per entry across top 1 scores: 0.54
Average of max per entry across top 2 scores: 0.5933333333333334
Average of max per entry across top 3 scores: 0.6066666666666667
Average of max per entry across top 5 scores: 0.6066666666666667
Average of max per entry across top 8 scores: 0.6066666666666667
Average of max per entry across top 9999 scores: 0.6066666666666667


  0%|          | 1/200 [00:00<00:07, 28.24it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 25.666666666666664 / 50  (51.3): 100%|██████████| 50/50 [00:00<00:00, 75.37it/s]
  df = df.applymap(truncate_cell)


Average Metric: 25.666666666666664 / 50  (51.3%)
Score: 51.33 for set: [16]
Scores so far: [44.67, 44.67, 54.0, 50.0, 51.33]
Best score: 54.0
Average of max per entry across top 1 scores: 0.54
Average of max per entry across top 2 scores: 0.5800000000000001
Average of max per entry across top 3 scores: 0.6133333333333334
Average of max per entry across top 5 scores: 0.6266666666666667
Average of max per entry across top 8 scores: 0.6266666666666667
Average of max per entry across top 9999 scores: 0.6266666666666667


  1%|          | 2/200 [00:00<00:07, 27.81it/s]


Bootstrapped 1 full traces after 3 examples in round 0.


Average Metric: 26.0 / 50  (52.0): 100%|██████████| 50/50 [00:00<00:00, 73.67it/s]              
  df = df.applymap(truncate_cell)


Average Metric: 26.0 / 50  (52.0%)
Score: 52.0 for set: [16]
Scores so far: [44.67, 44.67, 54.0, 50.0, 51.33, 52.0]
Best score: 54.0
Average of max per entry across top 1 scores: 0.54
Average of max per entry across top 2 scores: 0.5733333333333335
Average of max per entry across top 3 scores: 0.6133333333333334
Average of max per entry across top 5 scores: 0.64
Average of max per entry across top 8 scores: 0.64
Average of max per entry across top 9999 scores: 0.64
6 candidate programs found.


### 7) Evaluating the optimized chain

Well, how good is this? _Not every optimization run will magically result in improvement on unseen examples!_ So let's check!

First let's ask that question from above.

In [10]:
question = "In what region was Eddy Mazzoleni born?"

optimized_chain.invoke({"question": question})

' Eddy Mazzoleni was born in Bergamo, a city in the Lombardy region of Italy. #EddyMazzoleni #Italy #Lombardy'

Nice, anecdotally, it appears a bit more precise than the answer with `zeroshot_chain`. But now let's do some proper evals!

In [11]:
evaluate(optimized_chain)

Average Metric: 78.66666666666667 / 150  (52.4): 100%|██████████| 150/150 [00:02<00:00, 72.64it/s] 

Average Metric: 78.66666666666667 / 150  (52.4%)



  df = df.applymap(truncate_cell)


Unnamed: 0,question,answer,gold_titles,output,tweet_response,metric
0,Who was a producer who produced albums for both rock bands Juke Karten and Thirty Seconds to Mars?,Brian Virtue,"{'Thirty Seconds to Mars', 'Levolution (album)'}","Brian Virtue is a producer who has worked with both Juke Kartel and Thirty Seconds to Mars, helping to create their unique sounds. #BrianVirtue #producer...","Brian Virtue is a producer who has worked with both Juke Kartel and Thirty Seconds to Mars, helping to create their unique sounds. #BrianVirtue #producer...",1.0
1,Are both the University of Chicago and Syracuse University public universities?,no,"{'Syracuse University', 'University of Chicago'}","Yes, both Northeastern Illinois University and Syracuse University are public universities. #publicuniversity #Chicago #Syracuse","Yes, both Northeastern Illinois University and Syracuse University are public universities. #publicuniversity #Chicago #Syracuse",0.0
2,In what region was Eddy Mazzoleni born?,"Lombardy, northern Italy","{'Eddy Mazzoleni', 'Bergamo'}","Eddy Mazzoleni was born in Bergamo, a city in the Lombardy region of Italy. #EddyMazzoleni #Italy #Lombardy","Eddy Mazzoleni was born in Bergamo, a city in the Lombardy region of Italy. #EddyMazzoleni #Italy #Lombardy",1.0
3,Who edited the 1990 American romantic comedy film directed by Garry Marshall?,Raja Raymond Gosnell,"{'Raja Gosnell', 'Pretty Woman'}","Garry Marshall directed and edited the 1990 American romantic comedy film ""Pretty Woman"", starring Richard Gere and Julia Roberts. #PrettyWoman #GarryMarshall #RomanticComedy","Garry Marshall directed and edited the 1990 American romantic comedy film ""Pretty Woman"", starring Richard Gere and Julia Roberts. #PrettyWoman #GarryMarshall #RomanticComedy",0.0
4,Burrs Country Park railway station is what stop on the railway line that runs between Heywood and Rawtenstall,seventh,"{'East Lancashire Railway', 'Burrs Country Park railway station'}","Burrs Country Park railway station is the seventh stop on the East Lancashire Railway line, which runs between Heywood and Rawtenstall. #EastLancashireRailway #BurrsCountryPark #railwaystation","Burrs Country Park railway station is the seventh stop on the East Lancashire Railway line, which runs between Heywood and Rawtenstall. #EastLancashireRailway #BurrsCountryPark #railwaystation",1.0


52.44

We started with `zeroshot_chain` at **43%** and now we have **52%**. That's a nice **21%** relative improvement. Not bad!

### 8) Inspecting the optimized chain in action

In [12]:
prompt, output = dspy.settings.langchain_history[-4]

print('PROMPT:\n\n', prompt)
print('\n\nOUTPUT:\n\n', output)

PROMPT:

 Essential Instructions: Respond to the provided question based on the given context in the style of a tweet, which typically requires a concise and engaging answer within the character limit of a tweet (280 characters).

---

Follow the following format.

Context: ${context}
Question: ${question}
Tweet Response: ${tweet_response}

---

Context:
[1] «Candace Kita | Kita's first role was as a news anchor in the 1991 movie "Stealth Hunters". Kita's first recurring television role was in Fox's "Masked Rider", from 1995 to 1996. She appeared as a series regular lead in all 40 episodes. Kita also portrayed a frantic stewardess in a music video directed by Mark Pellington for the British group, Catherine Wheel, titled, "Waydown" in 1995. In 1996, Kita also appeared in the film "Barb Wire" (1996) and guest starred on "The Wayans Bros.". She also guest starred in "Miriam Teitelbaum: Homicide" with "Saturday Night Live" alumni Nora Dunn, "Wall To Wall Records" with Jordan Bridges, "Eve

#### Acknowledgements:

Thanks to [Harrison Chase](https://twitter.com/hwchase17) for co-leading this new integration. Thanks to our own [Arnav Singhvi](https://arnavsinghvi11.github.io/) for helping cook this tweet generation task and the insight about how to get data to use here.