Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add query_by_embedding_batch #3546

Merged
merged 11 commits into from
Dec 8, 2022
Merged

Conversation

tstadel
Copy link
Member

@tstadel tstadel commented Nov 9, 2022

Related Issues

Proposed Changes:

  • create a batch pendant of query_by_embedding like query_batch for DocumentStores
  • default impl would be to delegate to query_by_embedding
  • for OpenSearchDocumentStore and ElasticsearchDocumentStore implement it to use msearch
  • remove logger.debug("Retriever query: %s", body) from SearchEngineDocumentStore as the same info is already being logged by the opensearch and elasticsearch clients

How did you test it?

  • added tests and rely on existing tests for run_batch of pipeline and retrievers

Notes for the reviewer

Checklist

@tstadel tstadel marked this pull request as ready for review November 30, 2022 18:39
@tstadel tstadel requested a review from a team as a code owner November 30, 2022 18:39
@tstadel tstadel requested review from mayankjobanputra and removed request for a team November 30, 2022 18:39
body.append(headers)
body.append(cur_query_body)

logger.debug("Retriever query: %s", body)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What issue are we expecting to debug with this log? Do we need to add more details to debug this issue or this is enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied it from query_batch I guess. But OpenSearch's and elastic search's client already produce these logs. So you can achieve the same by just running something like

logging.getLogger("opensearch").setLevel(logging.DEBUG)

I'll remove it.

scale_score=scale_score,
)
documents = document_store.query_by_embedding_batch(
query_embs=query_embeddings, # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this stop MultiModal retriever from using all the document stores?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that any document store still works because the default implementation is just to run query_by_embedding and only OpenSearchDocumentStore and ElasticsearchDocumentStore have a special implementation. Our tests use the MultiModalRetriever only with InMemoryDocumentStore if I am not mistaken. So we should double check that manually before merging. @tstadel You could try to run the tests involving MultiModalRetriever in test_retriever.py with ElasticsearchDocumentStore, maybe?

Copy link
Member Author

@tstadel tstadel Dec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julian-risch @mayankjobanputra Ok I ran all multimodal tests of test_retriever.py locally using OpenSearchDocumentStore and added another one that uses retrieve_batch. All tests were fine. I added the retrieve_batch test using InMemoryDocumentStore as all the others.
I know that #type: ignore is not really nice here, so I resolved the query_embs one by allowing for multi-dimensional np.ndarrays.
For the other one I'll open another PR to resolve this. It would be just too big for this one I guess.

@julian-risch
Copy link
Member

Nice improvement idea! Changes look good to me. We should do manual checks with the MultiModalRetriever as Mayank pointed out. Other than that I am curious whether you have an idea about the expected speed increase? We don't have an easy-to-run benchmark script at the moment but it might become a topic for the next quarter...

@tstadel
Copy link
Member Author

tstadel commented Dec 7, 2022

Nice improvement idea! Changes look good to me. We should do manual checks with the MultiModalRetriever as Mayank pointed out. Other than that I am curious whether you have an idea about the expected speed increase? We don't have an easy-to-run benchmark script at the moment but it might become a topic for the next quarter...

Ok, I'll try to do that this week. Regarding speed increase, it depends on how you choose the cluster setup. But we noticed a 49% latency drop in total on a quite powerfuI cluster when running multi-embedding queries (i.e. we run multiple queries (up to 50) in parallel and combine the results using reciprocal rank fusion) on a dataset with ~30 mio. documents*. Similar results should be possible for evaluating dense retrieval pipelines using eval_batch.

  • We used OpenSearchDocumentStore's default settings, that is also nmslib's default settings of OpenSearch's KNN plugin.

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me. Nothing to add. I agree that # type: ignore can be addressed in a separate PR. Let's merge this and release the speed improvements with v1.12 in the next two weeks!

@tstadel tstadel merged commit c1c1c97 into main Dec 8, 2022
@tstadel tstadel deleted the feat/query_by_embedding_batch branch December 8, 2022 07:28
@tstadel tstadel mentioned this pull request Dec 8, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Execute dense searches in parallel in OpenSearchDocumentStore when in batch mode
3 participants