Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: index.docstore is empty after persisting nodes in chromadb #14574

Closed
1 task done
BalasubramanyamEvani opened this issue Jul 4, 2024 · 4 comments
Closed
1 task done
Labels
question Further information is requested

Comments

@BalasubramanyamEvani
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

Hello,

I have persisted the nodes in ChromaDB along with the storage context. However, when retrieving the vector index, the index.docstore is empty, how can I get the nodes later to use for BM25Retriever? Here is the code used for persisting and retrieving:

# node transformation
node_parser = SentenceSplitter(chunk_size=1024, chunk_overlap=20)

# collect llama index documents
documents = process_documents(df)

# initialize chroma client, setting path to save data
db = chromadb.PersistentClient(path=chroma_db_path)

# create collection
chroma_collection = db.get_or_create_collection(collection_name)

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Embedding Model
embed_model = HuggingFaceEmbedding(model_name=hf_model_name, device=hf_device)

# create your index
index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True,
        transformations=[node_parser],
        embed_model=embed_model,
)

# Here we save the index to the path we want
index.storage_context.persist(persist_dir=os.path.join(chroma_db_path, "llamai"))
# initialize chroma client, setting path to save data
db = chromadb.PersistentClient(path=chroma_db_path)

# create collection
chroma_collection = db.get_or_create_collection(collection_name)

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(
      vector_store=vector_store, persist_dir=os.path.join(chroma_db_path, "llamai")
)

# Embedding Model
embed_model = HuggingFaceEmbedding(model_name=hf_model_name, device=hf_device)

# get the index
index = VectorStoreIndex.from_vector_store(
      vector_store=vector_store,
      storage_context=storage_context,
      embed_model=embed_model,
)

# return the index
return index
@BalasubramanyamEvani BalasubramanyamEvani added the question Further information is requested label Jul 4, 2024
@logan-markewich
Copy link
Collaborator

This is correct. The docstore is disabled with most 3rd party vector stores to simplify storage, since the nodes are stored in chroma itself

You can override this if you want: VectorStoreIndex.from_documents(...., store_nodes_override=True)

@BalasubramanyamEvani
Copy link
Author

I understand. Could you please clarify the correct way to use BM25Retriever? Instead of providing the nodes during initialization, I supplied a reference to the docstore, but it resulted in an error.

  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/llama_index/retrievers/bm25/base.py", line 73, in from_defaults
    return cls(
  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/llama_index/retrievers/bm25/base.py", line 40, in __init__
    self.bm25 = BM25Okapi(self._corpus)
  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/rank_bm25.py", line 83, in __init__
    super().__init__(corpus, tokenizer)
  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/rank_bm25.py", line 27, in __init__
    nd = self._initialize(corpus)
  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/rank_bm25.py", line 52, in _initialize
    self.avgdl = num_doc / self.corpus_size
ZeroDivisionError: division by zero

@logan-markewich
Copy link
Collaborator

You'll need to either manually populate the docstore or use the flag above. And then persist the dcostore somewhere.

Or, you can directly save the nodes somewhere

@BalasubramanyamEvani
Copy link
Author

Got it! Thanks for your help @logan-markewich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants