LVD: Learned Vector Database

LVD is a vector database that allows you to store and query embeddings. It is built on top of fork of ChromaDB. Internally, LVD uses learned indexing for management of unstructured data. As of time of this writting, LVS is only database that uses such indexing.

Setup

To start using this project first make sure that LMI index git submodule is initialized.

git submodule update --init --recursive

Create python conda environment and activate it.

conda create --name lvd_env python=3.8.18
conda activate lvd_env

Next install the dependencies.

pip install -r requirements.txt
pip install -r requirements_dev.txt

Install the torch library. The library is used for training the LMI index.

pip install torch --index-url https://download.pytorch.org/whl/cpu

Create kernel for the Jupyter notebook.

python -m ipykernel install --user --name lvd_env --display-name "lvd_env"

Demo

System Usage

Next run the command bellow to open system_usage.ipynb.

jupyter notebook ./lvd_notebooks/system_usage.ipynb --notebook-dir=./

After the notebook opens do not forget to change the kernel in the Jupyter interface to lvd_env that was created in the previous step. This notebook demonstrates usage of the LVD system for management of unstructured data. All the operations supported by LVD are used in this notebook, an example of some of the data retrieval operations:

# Constrained Search Operation
results = collection.query(
    query_embeddings=[[1, 2, 3]],
    include=["metadatas", 'documents', 'distances'],
    where={"cluster": "red"},
    n_results=5,
    n_buckets=1,
)

# Hybrid Search Operation
results = collection.query(
    query_embeddings=[[1, 2, 3]],
    include=["metadatas", 'distances', "documents"],
    n_results=5,
    n_buckets=1,
    where_document={"$hybrid":{ "$hybrid_terms": ["digital", "data", "programming"]}}
)

RAG Usage

Next run the command bellow to open rag_usage.ipynb.

jupyter notebook ./lvd_notebooks/rag_usage.ipynb --notebook-dir=./

The notebook rag_usage.ipynb shows how the LVD can be used within the RAG architecture. For this demonstration the LVD is combined with OpenAI ChatGPT 3.5. You can use any other LLM model if you want. The Arxiv dataset that contains scientific papers is used in this demonstration. The documents from the dataset have already been pre-split (pre-chunked) by the authors of the dataset. The LVD is used to store the chunked documents and serve them as a context to the LLM model.

Server

You can also set up LVD server locally or deploy it to Kubernetes cluster. Then you can use the LVD client to interact with the server.

Docker

The LVD is dockerized and can run locally in docker container or can be deployed in Kubernetes cluster. The Dockerfile in the root directory defines the docker image. The deployment.yaml file in the root directory then uses this image to deploy the LVD server in Kubernetes cluster.

To build Docker image run following command.

docker build -t lvd .

Start the LVD server through Dokcer container and listen on port 5000.

docker run -p 5000:8000 lvd

Client

If you want to use LVD client outside of this repository you can install it as a package from this repository into your Python environment. This package simply wraps and modifies original ChromaDB client to work with LVD.

Install LVD package from this repository.

python setup.py sdist
pip install ./dist/lvd-0.1.tar.gz

Import client from the package to interact with the LVD instance. The package has still same name as ChromaDB. Therefore, it can not be installed together with ChromaDB. Since that causes name conflicts.

from chromadb import HttpClient

ChromaDB Inherited README

The rest of the documentation is from the original ChromaDB repository. It describes additional features of the database. Since LVD is based on the ChromaDB it inherits all of these features.

Features

Simple: Fully-typed, fully-tested, fully-documented == happiness
Integrations: 🦜️🔗 LangChain (python and js), 🦙 LlamaIndex and more soon
Dev, Test, Prod: the same API that runs in your python notebook, scales to your cluster
Feature-rich: Queries, filtering, density estimation and more
Free & Open Source: Apache 2.0 Licensed

Use case: ChatGPT for ______

For example, the "Chat your data" use case:

Add documents to your database. You can pass in your own embeddings, embedding function, or let Chroma embed them for you.
Query relevant documents with natural language.
Compose documents into the context window of an LLM like GPT3 for additional summarization or analysis.

Embeddings?

What are embeddings?

Read the guide from OpenAI
Literal: Embedding something turns it from image/text/audio into a list of numbers. 🖼️ or 📄 => [1.2, 2.1, ....]. This process makes documents "understandable" to a machine learning model.
By analogy: An embedding represents the essence of a document. This enables documents and queries with the same essence to be "near" each other and therefore easy to find.
Technical: An embedding is the latent-space position of a document at a layer of a deep neural network. For models trained specifically to embed data, this is the last layer.
A small example: If you search your photos for "famous bridge in San Francisco". By embedding this query and comparing it to the embeddings of your photos and their metadata - it should return photos of the Golden Gate Bridge.

Embeddings databases (also known as vector databases) store embeddings and allow you to search by nearest neighbors rather than by substrings like a traditional database. By default, Chroma uses Sentence Transformers to embed for you but you can also use OpenAI embeddings, Cohere (multilingual) embeddings, or your own.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 1,083 Commits
.github		.github
.vscode		.vscode
bin		bin
chromadb		chromadb
clients		clients
docs		docs
examples		examples
go/coordinator		go/coordinator
idl		idl
k8s		k8s
lmi_examples		lmi_examples
lvd		lvd
lvd_notebooks		lvd_notebooks
rust/worker		rust/worker
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
DEVELOP.md		DEVELOP.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
RELEASE_PROCESS.md		RELEASE_PROCESS.md
bandit.yaml		bandit.yaml
deployment.yaml		deployment.yaml
docker-compose.server.example.yml		docker-compose.server.example.yml
docker-compose.test-auth.yml		docker-compose.test-auth.yml
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
metacentr_query.sh		metacentr_query.sh
mypy.ini		mypy.ini
pull_request_template.md		pull_request_template.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
server.htpasswd		server.htpasswd
setup.py		setup.py
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LVD: Learned Vector Database

Setup

Demo

System Usage

RAG Usage

Server

Docker

Client

ChromaDB Inherited README

Features

Use case: ChatGPT for ______

Embeddings?

License

About

Releases

Packages

Languages

License

Zovi343/lvd

Folders and files

Latest commit

History

Repository files navigation

LVD: Learned Vector Database

Setup

Demo

System Usage

RAG Usage

Server

Docker

Client

ChromaDB Inherited README

Features

Use case: ChatGPT for ______

Embeddings?

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages