Skip to content

Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'

License

Notifications You must be signed in to change notification settings

McGill-NLP/llm2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM2Vec

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders.

LLM2Vec_figure1

Instrallation

To use LLM2Vec, first install the llm2vec package from PyPI.

pip install llm2vec

You can also directly install it from our code by cloning the repository and:

pip install -e .

Getting Started

LLM2Vec is a generic model, which takes a tokenizer and a model. First, we define the model and tokenizer using transformers library:

import torch
from transformers import AutoTokenizer, AutoModel, AutoConfig
config = AutoConfig.from_pretrained("McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp", trust_remote_code=True)
model = AutoModel.from_pretrained("McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp", trust_remote_code=True, config=config, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp")

Then, we define our llm2vec model as follows:

from llm2vec import LLM2Vec

l2v = LLM2Vec(model, tokenizer)

This model now returns the text embedding for any input in the form of [[instruction, text]].

inputs = [
  ['Retrieve duplicate questions from StackOverflow forum', 'Python (Numpy) array sorting'],
  ['', 'Sort a list in python'],
  ['', 'Sort an array in Java'],
]
repr = l2v.encode(inputs, convert_to_tensor=True)
sim_pos = torch.nn.functional.cosine_similarity(repr[0].unsqueeze(0), repr[1].unsqueeze(0))  # tensor([0.5987])
sim_neg = torch.nn.functional.cosine_similarity(repr[0].unsqueeze(0), repr[2].unsqueeze(0))  # tensor([0.5585])

Model List

Training

Training code will be available soon.

Bugs or questions?

If you have any question about the code, feel free to email Parishad ([email protected]) and Vaibhav ([email protected]).