Vectorizers#

HFTextVectorizer#

class HFTextVectorizer(model='sentence-transformers/all-mpnet-base-v2', *, dims)[source]#

Bases: BaseVectorizer

The HFTextVectorizer class is designed to leverage the power of Hugging Face’s Sentence Transformers for generating text embeddings. This vectorizer is particularly useful in scenarios where advanced natural language processing and understanding are required, and ideal for running on your own hardware (for free).

Utilizing this vectorizer involves specifying a pre-trained model from Hugging Face’s vast collection of Sentence Transformers. These models are trained on a variety of datasets and tasks, ensuring versatility and robust performance across different text embedding needs. Additionally, make sure the sentence-transformers library is installed with pip install sentence-transformers==2.2.2.

# Embedding a single text
vectorizer = HFTextVectorizer(model="sentence-transformers/all-mpnet-base-v2")
embedding = vectorizer.embed("Hello, world!")

# Embedding a batch of texts
embeddings = vectorizer.embed_many(["Hello, world!", "How are you?"], batch_size=2)

Initialize the Hugging Face text vectorizer.

Parameters:
  • model (str) – The pre-trained model from Hugging Face’s Sentence Transformers to be used for embedding. Defaults to ‘sentence-transformers/all-mpnet-base-v2’.

  • dims (int)

Raises:
  • ImportError – If the sentence-transformers library is not installed.

  • ValueError – If there is an error setting the embedding model dimensions.

embed(text, preprocess=None, as_buffer=False, **kwargs)[source]#

Embed a chunk of text using the Hugging Face sentence transformer.

Parameters:
  • text (str) – Chunk of text to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

Embedding.

Return type:

List[float]

Raises:

TypeError – If the wrong input type is passed in for the text.

embed_many(texts, preprocess=None, batch_size=1000, as_buffer=False, **kwargs)[source]#

Asynchronously embed many chunks of texts using the Hugging Face sentence transformer.

Parameters:
  • texts (List[str]) – List of text chunks to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • batch_size (int, optional) – Batch size of texts to use when creating embeddings. Defaults to 10.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

List of embeddings.

Return type:

List[List[float]]

Raises:

TypeError – If the wrong input type is passed in for the test.

OpenAITextVectorizer#

class OpenAITextVectorizer(model='text-embedding-ada-002', api_config=None)[source]#

Bases: BaseVectorizer

The OpenAITextVectorizer class utilizes OpenAI’s API to generate embeddings for text data.

This vectorizer is designed to interact with OpenAI’s embeddings API, requiring an API key for authentication. The key can be provided directly in the api_config dictionary or through the OPENAI_API_KEY environment variable. Users must obtain an API key from OpenAI’s website (https://api.openai.com/). Additionally, the openai python client must be installed with pip install openai>=1.13.0.

The vectorizer supports both synchronous and asynchronous operations, allowing for batch processing of texts and flexibility in handling preprocessing tasks.

# Synchronous embedding of a single text
vectorizer = OpenAITextVectorizer(
    model="text-embedding-ada-002",
    api_config={"api_key": "your_api_key"} # OR set OPENAI_API_KEY in your env
)
embedding = vectorizer.embed("Hello, world!")

# Asynchronous batch embedding of multiple texts
embeddings = await vectorizer.aembed_many(
    ["Hello, world!", "How are you?"],
    batch_size=2
)

Initialize the OpenAI vectorizer.

Parameters:
  • model (str) – Model to use for embedding. Defaults to ‘text-embedding-ada-002’.

  • api_config (Optional[Dict], optional) – Dictionary containing the API key and any additional OpenAI API options. Defaults to None.

Raises:
  • ImportError – If the openai library is not installed.

  • ValueError – If the OpenAI API key is not provided.

async aembed(text, preprocess=None, as_buffer=False, **kwargs)[source]#

Asynchronously embed a chunk of text using the OpenAI API.

Parameters:
  • text (str) – Chunk of text to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

Embedding.

Return type:

List[float]

Raises:

TypeError – If the wrong input type is passed in for the text.

async aembed_many(texts, preprocess=None, batch_size=1000, as_buffer=False, **kwargs)[source]#

Asynchronously embed many chunks of texts using the OpenAI API.

Parameters:
  • texts (List[str]) – List of text chunks to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • batch_size (int, optional) – Batch size of texts to use when creating embeddings. Defaults to 10.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

List of embeddings.

Return type:

List[List[float]]

Raises:

TypeError – If the wrong input type is passed in for the text.

embed(text, preprocess=None, as_buffer=False, **kwargs)[source]#

Embed a chunk of text using the OpenAI API.

Parameters:
  • text (str) – Chunk of text to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

Embedding.

Return type:

List[float]

Raises:

TypeError – If the wrong input type is passed in for the text.

embed_many(texts, preprocess=None, batch_size=10, as_buffer=False, **kwargs)[source]#

Embed many chunks of texts using the OpenAI API.

Parameters:
  • texts (List[str]) – List of text chunks to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • batch_size (int, optional) – Batch size of texts to use when creating embeddings. Defaults to 10.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

List of embeddings.

Return type:

List[List[float]]

Raises:

TypeError – If the wrong input type is passed in for the text.

AzureOpenAITextVectorizer#

class AzureOpenAITextVectorizer(model='text-embedding-ada-002', api_config=None)[source]#

Bases: BaseVectorizer

The AzureOpenAITextVectorizer class utilizes AzureOpenAI’s API to generate embeddings for text data.

This vectorizer is designed to interact with AzureOpenAI’s embeddings API, requiring an API key, an AzureOpenAI deployment endpoint and API version. These values can be provided directly in the api_config dictionary with the parameters ‘azure_endpoint’, ‘api_version’ and ‘api_key’ or through the environment variables ‘AZURE_OPENAI_ENDPOINT’, ‘OPENAI_API_VERSION’, and ‘AZURE_OPENAI_API_KEY’. Users must obtain these values from the ‘Keys and Endpoints’ section in their Azure OpenAI service. Additionally, the openai python client must be installed with pip install openai>=1.13.0.

The vectorizer supports both synchronous and asynchronous operations, allowing for batch processing of texts and flexibility in handling preprocessing tasks.

# Synchronous embedding of a single text
vectorizer = AzureOpenAITextVectorizer(
    model="text-embedding-ada-002",
    api_config={
        "api_key": "your_api_key", # OR set AZURE_OPENAI_API_KEY in your env
        "api_version": "your_api_version", # OR set OPENAI_API_VERSION in your env
        "azure_endpoint": "your_azure_endpoint", # OR set AZURE_OPENAI_ENDPOINT in your env
    }
)
embedding = vectorizer.embed("Hello, world!")

# Asynchronous batch embedding of multiple texts
embeddings = await vectorizer.aembed_many(
    ["Hello, world!", "How are you?"],
    batch_size=2
)

Initialize the AzureOpenAI vectorizer.

Parameters:
  • model (str) – Deployment to use for embedding. Must be the ‘Deployment name’ not the ‘Model name’. Defaults to ‘text-embedding-ada-002’.

  • api_config (Optional[Dict], optional) – Dictionary containing the API key, API version, Azure endpoint, and any other API options. Defaults to None.

Raises:
  • ImportError – If the openai library is not installed.

  • ValueError – If the AzureOpenAI API key, version, or endpoint are not provided.

async aembed(text, preprocess=None, as_buffer=False, **kwargs)[source]#

Asynchronously embed a chunk of text using the OpenAI API.

Parameters:
  • text (str) – Chunk of text to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

Embedding.

Return type:

List[float]

Raises:

TypeError – If the wrong input type is passed in for the test.

async aembed_many(texts, preprocess=None, batch_size=1000, as_buffer=False, **kwargs)[source]#

Asynchronously embed many chunks of texts using the AzureOpenAI API.

Parameters:
  • texts (List[str]) – List of text chunks to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • batch_size (int, optional) – Batch size of texts to use when creating embeddings. Defaults to 10.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

List of embeddings.

Return type:

List[List[float]]

Raises:

TypeError – If the wrong input type is passed in for the test.

embed(text, preprocess=None, as_buffer=False, **kwargs)[source]#

Embed a chunk of text using the AzureOpenAI API.

Parameters:
  • text (str) – Chunk of text to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

Embedding.

Return type:

List[float]

Raises:

TypeError – If the wrong input type is passed in for the test.

embed_many(texts, preprocess=None, batch_size=10, as_buffer=False, **kwargs)[source]#

Embed many chunks of texts using the AzureOpenAI API.

Parameters:
  • texts (List[str]) – List of text chunks to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • batch_size (int, optional) – Batch size of texts to use when creating embeddings. Defaults to 10.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

List of embeddings.

Return type:

List[List[float]]

Raises:

TypeError – If the wrong input type is passed in for the test.

VertexAITextVectorizer#

class VertexAITextVectorizer(model='textembedding-gecko', api_config=None)[source]#

Bases: BaseVectorizer

The VertexAITextVectorizer uses Google’s VertexAI Palm 2 embedding model API to create text embeddings.

This vectorizer is tailored for use in environments where integration with Google Cloud Platform (GCP) services is a key requirement.

Utilizing this vectorizer requires an active GCP project and location (region), along with appropriate application credentials. These can be provided through the api_config dictionary or set the GOOGLE_APPLICATION_CREDENTIALS env var. Additionally, the vertexai python client must be installed with pip install google-cloud-aiplatform>=1.26.

# Synchronous embedding of a single text
vectorizer = VertexAITextVectorizer(
    model="textembedding-gecko",
    api_config={
        "project_id": "your_gcp_project_id", # OR set GCP_PROJECT_ID
        "location": "your_gcp_location",     # OR set GCP_LOCATION
    })
embedding = vectorizer.embed("Hello, world!")

# Asynchronous batch embedding of multiple texts
embeddings = await vectorizer.embed_many(
    ["Hello, world!", "Goodbye, world!"],
    batch_size=2
)

Initialize the VertexAI vectorizer.

Parameters:
  • model (str) – Model to use for embedding. Defaults to ‘textembedding-gecko’.

  • api_config (Optional[Dict], optional) – Dictionary containing the API config details. Defaults to None.

Raises:
  • ImportError – If the google-cloud-aiplatform library is not installed.

  • ValueError – If the API key is not provided.

embed(text, preprocess=None, as_buffer=False, **kwargs)[source]#

Embed a chunk of text using the VertexAI API.

Parameters:
  • text (str) – Chunk of text to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

Embedding.

Return type:

List[float]

Raises:

TypeError – If the wrong input type is passed in for the test.

embed_many(texts, preprocess=None, batch_size=10, as_buffer=False, **kwargs)[source]#

Embed many chunks of texts using the VertexAI API.

Parameters:
  • texts (List[str]) – List of text chunks to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • batch_size (int, optional) – Batch size of texts to use when creating embeddings. Defaults to 10.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

List of embeddings.

Return type:

List[List[float]]

Raises:

TypeError – If the wrong input type is passed in for the test.

CohereTextVectorizer#

class CohereTextVectorizer(model='embed-english-v3.0', api_config=None)[source]#

Bases: BaseVectorizer

The CohereTextVectorizer class utilizes Cohere’s API to generate embeddings for text data.

This vectorizer is designed to interact with Cohere’s /embed API, requiring an API key for authentication. The key can be provided directly in the api_config dictionary or through the COHERE_API_KEY environment variable. User must obtain an API key from Cohere’s website (https://dashboard.cohere.com/). Additionally, the cohere python client must be installed with pip install cohere.

The vectorizer supports only synchronous operations, allows for batch processing of texts and flexibility in handling preprocessing tasks.

from redisvl.utils.vectorize import CohereTextVectorizer

vectorizer = CohereTextVectorizer(
    model="embed-english-v3.0",
    api_config={"api_key": "your-cohere-api-key"} # OR set COHERE_API_KEY in your env
)
query_embedding = vectorizer.embed(
    text="your input query text here",
    input_type="search_query"
)
doc_embeddings = cohere.embed_many(
    texts=["your document text", "more document text"],
    input_type="search_document"
)

Initialize the Cohere vectorizer.

Visit https://cohere.ai/embed to learn about embeddings.

Parameters:
  • model (str) – Model to use for embedding. Defaults to ‘embed-english-v3.0’.

  • api_config (Optional[Dict], optional) – Dictionary containing the API key. Defaults to None.

Raises:
  • ImportError – If the cohere library is not installed.

  • ValueError – If the API key is not provided.

embed(text, preprocess=None, as_buffer=False, **kwargs)[source]#

Embed a chunk of text using the Cohere Embeddings API.

Must provide the embedding input_type as a kwarg to this method that specifies the type of input you’re giving to the model.

Supported input types:
  • search_document: Used for embeddings stored in a vector database for search use-cases.

  • search_query: Used for embeddings of search queries run against a vector DB to find relevant documents.

  • classification: Used for embeddings passed through a text classifier

  • clustering: Used for the embeddings run through a clustering algorithm.

When hydrating your Redis DB, the documents you want to search over should be embedded with input_type= “search_document” and when you are querying the database, you should set the input_type = “search query”. If you want to use the embeddings for a classification or clustering task downstream, you should set input_type= “classification” or “clustering”.

Parameters:
  • text (str) – Chunk of text to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

  • input_type (str) – Specifies the type of input passed to the model. Required for embedding models v3 and higher.

Returns:

Embedding.

Return type:

List[float]

Raises:

TypeError – In an invalid input_type is provided.

embed_many(texts, preprocess=None, batch_size=10, as_buffer=False, **kwargs)[source]#

Embed many chunks of text using the Cohere Embeddings API.

Must provide the embedding input_type as a kwarg to this method that specifies the type of input you’re giving to the model.

Supported input types:
  • search_document: Used for embeddings stored in a vector database for search use-cases.

  • search_query: Used for embeddings of search queries run against a vector DB to find relevant documents.

  • classification: Used for embeddings passed through a text classifier

  • clustering: Used for the embeddings run through a clustering algorithm.

When hydrating your Redis DB, the documents you want to search over should be embedded with input_type= “search_document” and when you are querying the database, you should set the input_type = “search query”. If you want to use the embeddings for a classification or clustering task downstream, you should set input_type= “classification” or “clustering”.

Parameters:
  • texts (List[str]) – List of text chunks to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • batch_size (int, optional) – Batch size of texts to use when creating embeddings. Defaults to 10.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

  • input_type (str) – Specifies the type of input passed to the model. Required for embedding models v3 and higher.

Returns:

List of embeddings.

Return type:

List[List[float]]

Raises:

TypeError – In an invalid input_type is provided.

CustomTextVectorizer#

class CustomTextVectorizer(embed, embed_many=None, aembed=None, aembed_many=None)[source]#

Bases: BaseVectorizer

The CustomTextVectorizer class wraps user-defined embeding methods to create embeddings for text data.

This vectorizer is designed to accept a provided callable text vectorizer and provides a class definition to allow for compatibility with RedisVL.

The vectorizer may support both synchronous and asynchronous operations which allows for batch processing of texts, but at a minimum only syncronous embedding is required to satisfy the ‘embed()’ method.

# Synchronous embedding of a single text
vectorizer = CustomTextVectorizer(
    embed = my_vectorizer.generate_embedding
)
embedding = vectorizer.embed("Hello, world!")

# Asynchronous batch embedding of multiple texts
embeddings = await vectorizer.aembed_many(
    ["Hello, world!", "How are you?"],
    batch_size=2
)

Initialize the Custom vectorizer.

Parameters:
  • embed (Callable) – a Callable function that accepts a string object and returns a list of floats.

  • embed_many (Optional[Callable) – a Callable function that accepts a list of string objects and returns a list containing lists of floats. Defaults to None.

  • aembed (Optional[Callable]) – an asyncronous Callable function that accepts a string object and returns a lists of floats. Defaults to None.

  • aembed_many (Optional[Callable]) – an asyncronous Callable function that accepts a list of string objects and returns a list containing lists of floats. Defaults to None.

Raises:
  • ValueError if any of the provided functions accept or return incorrect types.

  • TypeError if any of the provided functions are not Callable objects.

async aembed(text, preprocess=None, as_buffer=False, **kwargs)[source]#

Asynchronously embed a chunk of text.

Parameters:
  • text (str) – Chunk of text to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

Embedding.

Return type:

List[float]

Raises:
  • TypeError – If the wrong input type is passed in for the text.

  • NotImplementedError – if aembed was not passed to constructor.

async aembed_many(texts, preprocess=None, batch_size=1000, as_buffer=False, **kwargs)[source]#

Asynchronously embed many chunks of texts.

Parameters:
  • texts (List[str]) – List of text chunks to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • batch_size (int, optional) – Batch size of texts to use when creating embeddings. Defaults to 10.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

List of embeddings.

Return type:

List[List[float]]

Raises:
  • TypeError – If the wrong input type is passed in for the text.

  • NotImplementedError – If aembed_many was not passed to constructor.

embed(text, preprocess=None, as_buffer=False, **kwargs)[source]#

Embed a chunk of text using the provided function.

Parameters:
  • text (str) – Chunk of text to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

Embedding.

Return type:

List[float]

Raises:

TypeError – If the wrong input type is passed in for the text.

embed_many(texts, preprocess=None, batch_size=10, as_buffer=False, **kwargs)[source]#

Embed many chunks of texts using the provided function.

Parameters:
  • texts (List[str]) – List of text chunks to embed.

  • preprocess (Optional[Callable], optional) – Optional preprocessing callable to perform before vectorization. Defaults to None.

  • batch_size (int, optional) – Batch size of texts to use when creating embeddings. Defaults to 10.

  • as_buffer (bool, optional) – Whether to convert the raw embedding to a byte string. Defaults to False.

Returns:

List of embeddings.

Return type:

List[List[float]]

Raises:
  • TypeError – If the wrong input type is passed in for the text.

  • NotImplementedError – if embed_many was not passed to constructor.