[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mistralai/cookbook/blob/main/pinecone_rag.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/mistralai/cookbook/blob/main/pinecone_rag.ipynb)

# RAG with Mistral

To begin, we setup our prerequisite libraries.

In [1]:
!pip install -qU \
    datasets==2.14.5 \
    mistralai==0.1.8 \
    pinecone-client==4.1.0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/215.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m215.0/215.5 kB[0m [31m9.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.5/215.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## Data Preparation

We start by downloading a dataset that we will encode and store. The dataset [`jamescalam/ai-arxiv2-semantic-chunks`](https://huggingface.co/datasets/jamescalam/ai-arxiv2-semantic-chunks) contains scraped data from many popular ArXiv papers centred around LLMs and GenAI.

In [25]:
from datasets import load_dataset

data = load_dataset(
    "jamescalam/ai-arxiv2-semantic-chunks",
    split="train[:10000]"
)
data

Dataset({
    features: ['id', 'title', 'content', 'prechunk_id', 'postchunk_id', 'arxiv_id', 'references'],
    num_rows: 10000
})

We have 200K chunks, where each chunk is roughly the length of 1-2 paragraphs in length. Here is an example of a single record:

In [26]:
data[0]

{'id': '2401.04088#0',
 'title': 'Mixtral of Experts',
 'content': '4 2 0 2 n a J 8 ] G L . s c [ 1 v 8 8 0 4 0 . 1 0 4 2 : v i X r a # Mixtral of Experts Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, LÃ©lio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, ThÃ©ophile Gervet, Thibaut Lavril, Thomas Wang, TimothÃ©e Lacroix, William El Sayed Abstract We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts

Format the data into the format we need, this will contain `id`, `text` (which we will embed), and `metadata`.

In [27]:
data = data.map(lambda x: {
    "id": x["id"],
    "metadata": {
        "title": x["title"],
        "content": x["content"],
    }
})
# drop unneeded columns
data = data.remove_columns([
    "title", "content", "prechunk_id",
    "postchunk_id", "arxiv_id", "references"
])
data

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'metadata'],
    num_rows: 10000
})

We need to define an embedding model to create our embedding vectors for retrieval, for that we will be using Mistral AI's `mistral-embed`. There is some cost associated with this model, so be aware of that (costs for running this notebook are <$1).

In [28]:
import os
from mistralai.client import MistralClient
import getpass  # console.mistral.ai/api-keys/

# get API key from left navbar in Mistral console
mistral_api_key = os.getenv("MISTRAL_API_KEY") or getpass.getpass("Enter your Mistral API key: ")

# initialize client
mistral = MistralClient(api_key=mistral_api_key)

We can create embeddings now like so:

In [29]:
embed_model = "mistral-embed"

embeds = mistral.embeddings(
    model=embed_model, input=["this is a test"]
)

We can view the dimensionality of our returned embeddings, which we'll need soon when initializing our vector index:

In [30]:
dims = len(embeds.data[0].embedding)
dims

1024

Now we create our vector DB to store our vectors. For this we need to get a [free Pinecone API key](https://app.pinecone.io) — the API key can be found in the "API Keys" button found in the left navbar of the Pinecone dashboard.

In [31]:
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv("PINECONE_API_KEY") or getpass.getpass("Enter your Pinecone API key: ")

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [32]:
from pinecone import ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-west-2"
)

Creating an index, we set `dimension` equal to the dimensionality of `mistral-embed` (`1024`), and use a `metric` also compatible with `mistral-embed` (this can be either `cosine` or `dotproduct`). We also pass our `spec` to index initialization.

In [33]:
import time

index_name = "mistral-rag"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]

# check if index already exists (it shouldn't if this is first time)
if index_name not in existing_indexes:
    # if does not exist, create index
    pc.create_index(
        index_name,
        dimension=dims,  # dimensionality of mistral-embed
        metric='dotproduct',
        spec=spec
    )
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pc.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1024,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 256}},
 'total_vector_count': 256}

We will define an embedding function that will allow us to avoid throwing too many tokens into a single embedding batch (as of 21 May 2024 the limit is `16384` tokens).

In [43]:
from mistralai.exceptions import MistralAPIException

def embed(metadata: list[dict]):
    batch_size = len(metadata)
    passed = False
    while batch_size >= 2:
        try:
            embeds = []
            for j in range(0, len(metadata), batch_size):
                j_end = min(len(metadata), j+batch_size)
                embed_response = mistral.embeddings(
                    input=[
                        x["title"]+"\n"+x["content"] for x in metadata[j:j_end]
                    ],
                    model=embed_model
                )
                embeds.extend([x.embedding for x in embed_response.data])
            return embeds
        except MistralAPIException as e:
            batch_size = int(batch_size / 2)
            print(f"Hit MistralAPIException, attempting {batch_size=}")
    raise e


We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with `mistral-embed` built embeddings like so:

**⚠️ WARNING: Embedding costs for the full dataset as of 3 Jan 2024 is ~$5.70**

In [46]:
from tqdm.auto import tqdm

batch_size = 32  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(data), batch_size)):
    # find end of batch
    i_end = min(len(data), i+batch_size)
    # create batch
    batch = data[i:i_end]
    # create embeddings
    embeds = embed(batch["metadata"])
    assert len(embeds) == (i_end-i)
    to_upsert = list(zip(batch["id"], embeds, batch["metadata"]))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

  0%|          | 0/313 [00:00<?, ?it/s]

Hit MistralAPIException, attempting batch_size=16
Hit MistralAPIException, attempting batch_size=16


Now let's test retrieval!

In [58]:
def get_docs(query: str, top_k: int) -> list[str]:
    # encode query
    xq = mistral.embeddings(
        input=[query],
        model=embed_model
    ).data[0].embedding
    # search pinecone index
    res = index.query(vector=xq, top_k=top_k, include_metadata=True)
    # get doc text
    docs = [x["metadata"]['content'] for x in res["matches"]]
    return docs

In [59]:
query = "can you tell me about mistral LLM?"
docs = get_docs(query, top_k=5)
print("\n---\n".join(docs))

[25] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thÃ©e Lacroix, Baptiste RoziÃ¨re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. [26] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. [27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Å ukasz Kaiser, and Illia Polosukhin.
---
Mistral 7B outperforms the previous best 13B model (Llama 2, [26]) across all tested benchmarks, and surpasses the best 34B model (LLaMa 34B, [25]) in mathematics and code generation. Furthermore, Mistral 7B approaches the coding performance of Code-Llama 7B [20], without sacrificing performance on non-code related benchmarks. Mistral

Our retrieval component works, now let's try feeding this into Mistral Large LLM to produce an answer.

In [61]:
from mistralai.models.chat_completion import ChatMessage


def generate(query: str, docs: list[str]):
    system_message = (
        "You are a helpful assistant that answers questions about AI using the "
        "context provided below.\n\n"
        "CONTEXT:\n"
        "\n---\n".join(docs)
    )
    messages = [
        ChatMessage(role="system", content=system_message),
        ChatMessage(role="user", content=query)
    ]
    # generate response
    chat_response = mistral.chat(
        model="mistral-large-latest",
        messages=messages
    )
    return chat_response.choices[0].message.content

In [63]:
out = generate(query=query, docs=docs)
print(out)

Mistral 7B is a 7-billion-parameter language model engineered for superior performance and efficiency. It outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation. Mistral 7B leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.

The model also demonstrates strong performance in reasoning, comprehension, and STEM reasoning benchmarks, often mirroring the performance of larger models. However, its performance on knowledge benchmarks is somewhat lower, likely due to its limited parameter count.

Mistral 7B has also been fine-tuned on instruction datasets to create Mistral 7B â Instruct, which exhibits superior performance compared to all 7B models on MT-Bench, and is comparable to 13B â Chat models.

The Mistral 7B model and its fine-tuned varian

Don't forget to delete your index when you're done to save resources!

In [64]:
pc.delete_index(index_name)

---