LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders.
To use LLM2Vec, first install the llm2vec package from PyPI.
pip install llm2vec
You can also directly install it from our code by cloning the repository and:
pip install -e .
LLM2Vec is a generic model, which takes a tokenizer
and a model
. First, we define the model and tokenizer using transformers
library:
import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
config = AutoConfig.from_pretrained("McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp", trust_remote_code=True)
model = AutoModel.from_pretrained("McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp", trust_remote_code=True, config=config, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp")
Then, we define our llm2vec model as follows:
from llm2vec import LLM2Vec
l2v = LLM2Vec(model, tokenizer)
This model now returns the text embedding for any input in the form of [[instruction, text]]
.
inputs = [
['Retrieve duplicate questions from StackOverflow forum', 'Python (Numpy) array sorting'],
['', 'Sort a list in python'],
['', 'Sort an array in Java'],
]
repr = l2v.encode(inputs, convert_to_tensor=True)
sim_pos = torch.nn.functional.cosine_similarity(repr[0].unsqueeze(0), repr[1].unsqueeze(0)) # tensor([0.5987])
sim_neg = torch.nn.functional.cosine_similarity(repr[0].unsqueeze(0), repr[2].unsqueeze(0)) # tensor([0.5585])
Training code will be available soon.
If you have any question about the code, feel free to email Parishad ([email protected]
) and Vaibhav ([email protected]
).