Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Showing duplicate entries in my vector db #14611

Open
1 task done
Amiya-Lahiri-AI opened this issue Jul 7, 2024 · 1 comment
Open
1 task done

[Question]: Showing duplicate entries in my vector db #14611

Amiya-Lahiri-AI opened this issue Jul 7, 2024 · 1 comment
Labels
question Further information is requested

Comments

@Amiya-Lahiri-AI
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I have created nodes using ingestion pipeline:

import time
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.extractors import SummaryExtractor

pipeline = IngestionPipeline(
    transformations=[
        TextCleaner(),
        SummaryExtractor(prompt_template=DEFAULT_SUMMARY_PROMPT_TMPL),
        VoyageEmbeddModel,
    ],
)

CvFirstChunk = pipeline.run(documents = first_node)
time.sleep(90)
CvSecondChunk = pipeline.run(documents = second_node)
time.sleep(90)
CvThirdChunk = pipeline.run(documents = third_node)
time.sleep(90)
CvFourthChunk = pipeline.run(documents = fourth_node)
time.sleep(90)
CvFifthChunk = pipeline.run(documents = fifth_node)
time.sleep(90)
CvSixthChunk = pipeline.run(documents = sixth_node)

and I have total 113 nodes after I concatenate all the chunks.
however When I am upserting those nodes in my pinecone vector db it is showing vector count 572

this is the code i am using to upserting the nodes to my vec db.

vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    nodes,
    storage_context=storage_context,
    embed_model= EmbeddModel,
)

can anyone please help me out with it.

@Amiya-Lahiri-AI Amiya-Lahiri-AI added the question Further information is requested label Jul 7, 2024
Copy link

dosubot bot commented Jul 7, 2024

The issue of duplicate entries in your vector database could be due to several factors in your ingestion and upserting process. Here are some steps and best practices to help you identify and resolve the problem:

  1. Ensure Unique Node Identifiers: Each node should have a unique identifier to prevent duplicates. When using the IngestionPipeline, make sure that each node has a unique ref_doc_id or node.id_.

  2. Check for Existing Hashes: Before upserting nodes, check if a node with the same ref_doc_id already exists in the document store and if its hash matches the current node's hash. If the hash is different, update the node; otherwise, skip it.

  3. Handle Upserts and Deletions: Use the docstore_strategy to manage duplicates by comparing document hashes or IDs. The IngestionPipeline supports strategies like UPSERTS, DUPLICATES_ONLY, or UPSERTS_AND_DELETE.

  4. PineconeVectorStore Configurations: Ensure that you are using the appropriate configurations in the PineconeVectorStore to manage duplicates. For example, Pinecone's default behavior is to replace duplicate entries if the same document is inserted twice.

Here is an example of how to handle upserts in the IngestionPipeline:

def _handle_upserts(self, nodes: List[BaseNode], store_doc_text: bool = True) -> List[BaseNode]:
    assert self.docstore is not None

    existing_doc_ids_before = set(self.docstore.get_all_document_hashes().values())
    doc_ids_from_nodes = set()
    deduped_nodes_to_run = {}
    for node in nodes:
        ref_doc_id = node.ref_doc_id if node.ref_doc_id else node.id_
        doc_ids_from_nodes.add(ref_doc_id)
        existing_hash = self.docstore.get_document_hash(ref_doc_id)
        if not existing_hash:
            self.docstore.set_document_hash(ref_doc_id, node.hash)
            deduped_nodes_to_run[ref_doc_id] = node
        elif existing_hash and existing_hash != node.hash:
            self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)
            if self.vector_store is not None:
                self.vector_store.delete(ref_doc_id)
            self.docstore.set_document_hash(ref_doc_id, node.hash)
            deduped_nodes_to_run[ref_doc_id] = node
        else:
            continue

    if self.docstore_strategy == DocstoreStrategy.UPSERTS_AND_DELETE:
        doc_ids_to_delete = existing_doc_ids_before - doc_ids_from_nodes
        for ref_doc_id in doc_ids_to_delete:
            self.docstore.delete_document(ref_doc_id)
            if self.vector_store is not None:
                self.vector_store.delete(ref_doc_id)

    nodes_to_run = list(deduped_nodes_to_run.values())
    self.docstore.add_documents(nodes_to_run, store_text=store_doc_text)
    return nodes_to_run

Additionally, ensure that your PineconeVectorStore is configured correctly:

vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    namespace="my_namespace",
    insert_kwargs={"some_key": "some_value"},
    batch_size=50,
    remove_text_from_metadata=True,
    add_sparse_vector=True,
    tokenizer=my_tokenizer,
    default_empty_query_vector=[0.0] * 128,
)

By following these practices, you can help prevent and manage duplicate entries in your vector database [1][2][3].

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant