feat: add query_by_embedding_batch #3546

tstadel · 2022-11-09T16:37:22Z

Related Issues

fixes Execute dense searches in parallel in OpenSearchDocumentStore when in batch mode #3647

Proposed Changes:

create a batch pendant of query_by_embedding like query_batch for DocumentStores
default impl would be to delegate to query_by_embedding
for OpenSearchDocumentStore and ElasticsearchDocumentStore implement it to use msearch
remove logger.debug("Retriever query: %s", body) from SearchEngineDocumentStore as the same info is already being logged by the opensearch and elasticsearch clients

How did you test it?

added tests and rely on existing tests for run_batch of pipeline and retrievers

Notes for the reviewer

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

mayankjobanputra · 2022-12-06T18:37:57Z

haystack/document_stores/search_engine.py

+ body.append(headers)
+ body.append(cur_query_body)
+
+ logger.debug("Retriever query: %s", body)


What issue are we expecting to debug with this log? Do we need to add more details to debug this issue or this is enough?

I just copied it from query_batch I guess. But OpenSearch's and elastic search's client already produce these logs. So you can achieve the same by just running something like

logging.getLogger("opensearch").setLevel(logging.DEBUG)

I'll remove it.

mayankjobanputra · 2022-12-06T18:40:00Z

haystack/nodes/retriever/multimodal/retriever.py

- scale_score=scale_score,
- )
+ documents = document_store.query_by_embedding_batch(
+ query_embs=query_embeddings, # type: ignore


Would this stop MultiModal retriever from using all the document stores?

My understanding is that any document store still works because the default implementation is just to run query_by_embedding and only OpenSearchDocumentStore and ElasticsearchDocumentStore have a special implementation. Our tests use the MultiModalRetriever only with InMemoryDocumentStore if I am not mistaken. So we should double check that manually before merging. @tstadel You could try to run the tests involving MultiModalRetriever in test_retriever.py with ElasticsearchDocumentStore, maybe?

@julian-risch @mayankjobanputra Ok I ran all multimodal tests of test_retriever.py locally using OpenSearchDocumentStore and added another one that uses retrieve_batch. All tests were fine. I added the retrieve_batch test using InMemoryDocumentStore as all the others.
I know that #type: ignore is not really nice here, so I resolved the query_embs one by allowing for multi-dimensional np.ndarrays.
For the other one I'll open another PR to resolve this. It would be just too big for this one I guess.

julian-risch · 2022-12-06T22:42:56Z

Nice improvement idea! Changes look good to me. We should do manual checks with the MultiModalRetriever as Mayank pointed out. Other than that I am curious whether you have an idea about the expected speed increase? We don't have an easy-to-run benchmark script at the moment but it might become a topic for the next quarter...

tstadel · 2022-12-07T08:17:23Z

Nice improvement idea! Changes look good to me. We should do manual checks with the MultiModalRetriever as Mayank pointed out. Other than that I am curious whether you have an idea about the expected speed increase? We don't have an easy-to-run benchmark script at the moment but it might become a topic for the next quarter...

Ok, I'll try to do that this week. Regarding speed increase, it depends on how you choose the cluster setup. But we noticed a 49% latency drop in total on a quite powerfuI cluster when running multi-embedding queries (i.e. we run multiple queries (up to 50) in parallel and combine the results using reciprocal rank fusion) on a dataset with ~30 mio. documents*. Similar results should be possible for evaluating dense retrieval pipelines using eval_batch.

We used OpenSearchDocumentStore's default settings, that is also nmslib's default settings of OpenSearch's KNN plugin.

julian-risch

Looks very good to me. Nothing to add. I agree that # type: ignore can be addressed in a separate PR. Let's merge this and release the speed improvements with v1.12 in the next two weeks!

tstadel added 8 commits November 9, 2022 17:36

add query_by_embedding_batch

cb48847

fix mypy

51ec1f8

fix pylint

a8b3c6e

Merge branch 'main' into feat/query_by_embedding_batch

545e40f

add test

442f4aa

move query_by_embedding_batch to search_engine

cdda0b1

fix and add tests

e30f3ab

fix pylint

a260299

tstadel marked this pull request as ready for review November 30, 2022 18:39

tstadel requested a review from a team as a code owner November 30, 2022 18:39

tstadel requested review from mayankjobanputra and removed request for a team November 30, 2022 18:39

mayankjobanputra reviewed Dec 6, 2022

View reviewed changes

julian-risch added the topic:document_store label Dec 6, 2022

tstadel added 2 commits December 7, 2022 17:46

remove Retriever query logs

6c53765

add test for multimodal batch retrieval

40fb60d

tstadel requested a review from mayankjobanputra December 7, 2022 18:30

allow for np.ndarray

144766d

julian-risch added topic:opensearch topic:elasticsearch labels Dec 7, 2022

julian-risch added this to the 1.12.0 milestone Dec 7, 2022

julian-risch approved these changes Dec 7, 2022

View reviewed changes

mayankjobanputra approved these changes Dec 8, 2022

View reviewed changes

tstadel merged commit c1c1c97 into main Dec 8, 2022

tstadel deleted the feat/query_by_embedding_batch branch December 8, 2022 07:28

tstadel mentioned this pull request Dec 8, 2022

refactor: filters type #3682

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add query_by_embedding_batch #3546

feat: add query_by_embedding_batch #3546

tstadel commented Nov 9, 2022 •

edited

Loading

mayankjobanputra Dec 6, 2022

tstadel Dec 7, 2022

mayankjobanputra Dec 6, 2022

julian-risch Dec 6, 2022

tstadel Dec 7, 2022 •

edited

Loading

julian-risch commented Dec 6, 2022

tstadel commented Dec 7, 2022 •

edited

Loading

julian-risch left a comment

feat: add query_by_embedding_batch #3546

feat: add query_by_embedding_batch #3546

Conversation

tstadel commented Nov 9, 2022 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

mayankjobanputra Dec 6, 2022

Choose a reason for hiding this comment

tstadel Dec 7, 2022

Choose a reason for hiding this comment

mayankjobanputra Dec 6, 2022

Choose a reason for hiding this comment

julian-risch Dec 6, 2022

Choose a reason for hiding this comment

tstadel Dec 7, 2022 • edited Loading

Choose a reason for hiding this comment

julian-risch commented Dec 6, 2022

tstadel commented Dec 7, 2022 • edited Loading

julian-risch left a comment

Choose a reason for hiding this comment

tstadel commented Nov 9, 2022 •

edited

Loading

tstadel Dec 7, 2022 •

edited

Loading

tstadel commented Dec 7, 2022 •

edited

Loading