Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic search is not working for MongoDBAtlasDocumentStore #6643

Closed
1 task done
bilgeyucel opened this issue Dec 25, 2023 · 6 comments · Fixed by #6811
Closed
1 task done

Semantic search is not working for MongoDBAtlasDocumentStore #6643

bilgeyucel opened this issue Dec 25, 2023 · 6 comments · Fixed by #6811
Assignees
Labels
1.x P1 High priority, add to the next sprint
Milestone

Comments

@bilgeyucel
Copy link
Contributor

bilgeyucel commented Dec 25, 2023

Describe the bug
EmbeddingRetriever doesn't return any results when used with MongoDBAtlasDocumentStore. The reason seems like MongoDBAtlasDocumentStore

Error message
No error

Expected behavior
Semantic search should be possible with EmbeddingRetriever

Additional context
There's no problem in write_documents, update_embeddings and get_document_count() methods of MongoDBAtlasDocumentStore. The code below works for InMemoryDocumentStore properly.

Issue #6632 can be related to this issue.

To Reproduce

from haystack.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack.nodes import EmbeddingRetriever
from haystack import Document, Pipeline
import os

ds=MongoDBAtlasDocumentStore(
    mongo_connection_string="mongodb+srv:https://{mongo_atlas_username}:{mongo_atlas_password}@{mongo_atlas_host}/?{mongo_atlas_params_string}",
    database_name="database_name",
    collection_name="collection_name",
    embedding_dim=1536
)

## Write Documents into MongoDBAtlasDocumentStore
ds.write_documents([Document("Berlin is an amazing city."), Document("Berlin is the capital of Germany"), Document("I love Berlin.")])
retriever = EmbeddingRetriever(document_store=ds, embedding_model="text-embedding-ada-002", api_key=os.getenv("OPENAI_API_KEY"))
ds.update_embeddings(retriever)
print(ds.get_document_count()) ## returns 3

## Document search pipeline
p = Pipeline()
p.add_node(component=retriever, name="EmbeddingRetriever", inputs=["Query"])
pipe_result = p.run(query="Berlin")
print(pipe_result) ## returns {'documents': [], 'root_node': 'Query', 'params': {}, 'query': 'Berlin', 'node_id': 'EmbeddingRetriever'}

## Use Retriever standalone
result = retriever.retrieve(query="Berlin", top_k=2)
print(result) ## returns []

FAQ Check

System:

  • OS: MacOS
  • GPU/CPU: CPU
  • Haystack version (commit or version number): 1.23
  • DocumentStore: MongoDBAtlasDocumentStore
  • Reader: -
  • Retriever: EmbeddingRetriever
@jvollmuller-risk
Copy link

Hi @bilgeyucel,

My issue was indeed the same, it looks like Haystack uses the old way to query Mongo with embeddings. With the latest MongoDB Atlas update there are two ways of searching with mongo:

  • Sparse BM Search with the $search aggregation
  • Dense search with $vectorSearch aggregation

I would love to see these two functions integrated, so the Sparse search for MongoDBAtlasDocumentStore(bm25=True)
and vectorsearch with an EmbeddingRetriever

I fixed your issue with semantic search by changing the pipeline to the following, but this only works with the new VectorSearch index, this index is vector_index:

       pipeline = [
            {
                "$vectorSearch": {
                    "index": "vector_index",
                    "queryVector": query_emb.tolist(),
                    "path": "embedding", 
                    "numCandidates": 100,
                    "limit": top_k,
                }

            }
        ]

@masci masci added the 1.x label Dec 31, 2023
@masci masci added the P1 High priority, add to the next sprint label Jan 8, 2024
@scottroot
Copy link

Just checking if this issue is going to be updated for 1.x? It appears to simply be the pipeline jvollmuller-risk mentioned. I get 0 results with retriever from MongoDB Atlas currently. If I change the pipeline to the new search syntax and specify the search index (the index field, not the collection) it works, but in my environment I would be required to change this each time the instance spins up.

@bilgeyucel
Copy link
Contributor Author

Hi @scottroot, we're working on the issue right now. If you'd like to help out, feel free to submit a PR to fix it. Your contribution would speed up the resolution process! :)

@julian-risch
Copy link
Member

fixed by #6811

@tillwf
Copy link

tillwf commented Feb 5, 2024

I'm not sure it is related but with the version 1.24.0 I try this code:

from haystack.document_stores import MongoDBAtlasDocumentStore
from haystack.pipelines import Pipeline
from haystack.nodes import EmbeddingRetriever
import os

document_store = MongoDBAtlasDocumentStore(
    mongo_connection_string=f"mongodb+srv:https://{os.getenv('MONGO_USER')}:{os.getenv('MONGO_PASS')}@{os.getenv('MONGO_URL')}",
    database_name=os.getenv("MONGO_DB"),
    collection_name="articles_embeddings",
    vector_search_index="embedding_index",
    embedding_dim=384
)

dense_retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    use_gpu=True,
    scale_score=False,
)
pipeline = Pipeline()
pipeline.add_node(component=dense_retriever, name="DenseRetriever", inputs=["Query"])

result = pipeline.run(
    query="test",
    params={
        "DenseRetriever": {
            "top_k": 10,
        }
    }
)

and I get this error:

Exception: Exception while running node 'DenseRetriever': Unrecognized pipeline stage name: '$vectorSearch', full error: {'ok': 0.0, 'errmsg': "Unrecognized pipeline stage name: '$vectorSearch'", 'code': 40324, 'codeName': 'Location40324', '$clusterTime': {'clusterTime': Timestamp(1707143848, 40), 'signature': {'hash': b'\xe1{hg\x0e\xc8\x91\xc6\xec\xf6\xbe\x91\xa5,\xda(@\x8eo\x1b', 'keyId': 7294832502911270929}}, 'operationTime': Timestamp(1707143848, 40)}

Here is a screen of my index I made:
image

and the code I used to create it:

{
  "fields":[
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 384,
      "similarity": "cosine"
    }
  ]
}

Did I do something wrong?

@tillwf
Copy link

tillwf commented Feb 9, 2024

@julian-risch Should I create another issue for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.x P1 High priority, add to the next sprint
Projects
None yet
6 participants