[Question]: index.docstore is empty after persisting nodes in chromadb #14574

BalasubramanyamEvani · 2024-07-04T18:43:17Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

Hello,

I have persisted the nodes in ChromaDB along with the storage context. However, when retrieving the vector index, the index.docstore is empty, how can I get the nodes later to use for BM25Retriever? Here is the code used for persisting and retrieving:

# node transformation
node_parser = SentenceSplitter(chunk_size=1024, chunk_overlap=20)

# collect llama index documents
documents = process_documents(df)

# initialize chroma client, setting path to save data
db = chromadb.PersistentClient(path=chroma_db_path)

# create collection
chroma_collection = db.get_or_create_collection(collection_name)

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Embedding Model
embed_model = HuggingFaceEmbedding(model_name=hf_model_name, device=hf_device)

# create your index
index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True,
        transformations=[node_parser],
        embed_model=embed_model,
)

# Here we save the index to the path we want
index.storage_context.persist(persist_dir=os.path.join(chroma_db_path, "llamai"))

# initialize chroma client, setting path to save data
db = chromadb.PersistentClient(path=chroma_db_path)

# create collection
chroma_collection = db.get_or_create_collection(collection_name)

# assign chroma as the vector_store to the context
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(
      vector_store=vector_store, persist_dir=os.path.join(chroma_db_path, "llamai")
)

# Embedding Model
embed_model = HuggingFaceEmbedding(model_name=hf_model_name, device=hf_device)

# get the index
index = VectorStoreIndex.from_vector_store(
      vector_store=vector_store,
      storage_context=storage_context,
      embed_model=embed_model,
)

# return the index
return index

logan-markewich · 2024-07-04T18:45:40Z

This is correct. The docstore is disabled with most 3rd party vector stores to simplify storage, since the nodes are stored in chroma itself

You can override this if you want: VectorStoreIndex.from_documents(...., store_nodes_override=True)

BalasubramanyamEvani · 2024-07-04T19:05:49Z

I understand. Could you please clarify the correct way to use BM25Retriever? Instead of providing the nodes during initialization, I supplied a reference to the docstore, but it resulted in an error.

  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/llama_index/retrievers/bm25/base.py", line 73, in from_defaults
    return cls(
  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/llama_index/retrievers/bm25/base.py", line 40, in __init__
    self.bm25 = BM25Okapi(self._corpus)
  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/rank_bm25.py", line 83, in __init__
    super().__init__(corpus, tokenizer)
  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/rank_bm25.py", line 27, in __init__
    nd = self._initialize(corpus)
  File "/usr/local/anaconda3/envs/rag-search/lib/python3.9/site-packages/rank_bm25.py", line 52, in _initialize
    self.avgdl = num_doc / self.corpus_size
ZeroDivisionError: division by zero

logan-markewich · 2024-07-04T21:39:59Z

You'll need to either manually populate the docstore or use the flag above. And then persist the dcostore somewhere.

Or, you can directly save the nodes somewhere

BalasubramanyamEvani · 2024-07-04T22:31:58Z

Got it! Thanks for your help @logan-markewich

BalasubramanyamEvani added the question Further information is requested label Jul 4, 2024

BalasubramanyamEvani closed this as completed Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: index.docstore is empty after persisting nodes in chromadb #14574

[Question]: index.docstore is empty after persisting nodes in chromadb #14574

BalasubramanyamEvani commented Jul 4, 2024

logan-markewich commented Jul 4, 2024

BalasubramanyamEvani commented Jul 4, 2024

logan-markewich commented Jul 4, 2024

BalasubramanyamEvani commented Jul 4, 2024

[Question]: index.docstore is empty after persisting nodes in chromadb #14574

[Question]: index.docstore is empty after persisting nodes in chromadb #14574

Comments

BalasubramanyamEvani commented Jul 4, 2024

Question Validation

Question

logan-markewich commented Jul 4, 2024

BalasubramanyamEvani commented Jul 4, 2024

logan-markewich commented Jul 4, 2024

BalasubramanyamEvani commented Jul 4, 2024