Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for BM25Retriever in InMemoryDocumentStore #3447

Closed
tholor opened this issue Oct 21, 2022 · 8 comments · Fixed by #3561
Closed

Add support for BM25Retriever in InMemoryDocumentStore #3447

tholor opened this issue Oct 21, 2022 · 8 comments · Fixed by #3561
Assignees
Labels
Contributions wanted! Looking for external contributions topic:document_store topic:retriever type:feature New feature or request

Comments

@tholor
Copy link
Member

tholor commented Oct 21, 2022

Is your feature request related to a problem? Please describe.
Many of our tutorials are using the ElasticsearchDocumentStore. While this is a good choice for a production system, it can be quite cumbersome to set up and run during your "first minutes with haystack". It would be awesome to use the InMemoryDocumentStore instead for the first steps of users (Tutorial 1, Quick Start ...). The only thing that I see holding us back here: InMemoryDocumentStore doesn't support our BM25Retriever which is a fast and easy retriever to get started with (and therefore a good choice in tutorials). Switching to another retriever might complicate tutorials, slow them down or reduce the quality of answers users get.

Describe the solution you'd like
Supporting the usage of the BM25Retriever in combination with the InMemoryDocumentStore

Describe alternatives you've considered
Using TFIDFRetriever but I am concerned about the quality of results and leading our users into a wrong direction here.

Priority
I don't think this feature is urgent but it might be a helpful step when we want to improve the early user experience

@ZanSara ZanSara added the Contributions wanted! Looking for external contributions label Oct 21, 2022
@ZanSara
Copy link
Contributor

ZanSara commented Oct 21, 2022

Some additional context (same idea from @bglearning): deepset-ai/haystack-tutorials#44 (comment)

@bogdankostic
Copy link
Contributor

BM25 was recently added to gensim: piskvorky/gensim#3304, we might use this.

@vtharmalingam
Copy link

@ZanSara: please advise me on how I can contribute to the haystack, in general, or to this issue, in particular. Thanks :)

@ZanSara
Copy link
Contributor

ZanSara commented Oct 26, 2022

Hello @vtharmalingam! Start from here: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md

@julian-risch
Copy link
Member

Hi @vtharmalingam great to hear that you would like to contribute on this issue! Feel free to open a draft pull request early on so that we can help with feedback. Please let me know if you need any other help.

@anakin87
Copy link
Member

anakin87 commented Nov 3, 2022

I am also interested in contributing to this feature!
Anyway, @vtharmalingam takes precedence.

After the discussions linked above, I want to bring some ideas.

  • I tend to think that the retriever should only query the document store. On the other hand, the ds is responsible for storing documents and their representations (such as BM25). Unlike the TF-IDF retriever implementation
  • To do that, InMemoryDocumentStore should be a subclass of KeywordDocumentStore (instead of BaseDocumentStore) and implement the methods query and query_batch.
  • I like the library rank_bm25: it is simple and lightweight. It must be said that it is less performant than Gensim at retrieval time, but maybe we could accept that, as this document store is not meant for production use cases.
  • About the practical implementation, I imagine that for every index, we could have an instance of the BM25 class:
    self.indexes[index].bm25 = BM25Okapi(tokenized_corpus)
  • A doubt: should the BM25 representation be generated by default when working with the InMemoryDocumentStore?
    It may not be essential if we are using the IMDS for dense retrieval or for TF-IDF retrieval; in those cases, BM25 computation would make the document store unnecessarily slow.

WDYT?

@ZanSara
Copy link
Contributor

ZanSara commented Nov 4, 2022

Hey @anakin87! Sounds good 🙁

  • I tend to think that the retriever should only query the document store. On the other hand, the ds is responsible for storing documents and their representations (such as BM25). Unlike the TF-IDF retriever implementation
  • To do that, InMemoryDocumentStore should be a subclass of KeywordDocumentStore (instead of BaseDocumentStore) and implement the methods query and query_batch.

Fully agree!

  • I like the library rank_bm25: it is simple and lightweight. It must be said that it is less performant than Gensim at retrieval time, but maybe we could accept that, as this document store is not meant for production use cases.

I also agree on this. It bring in no dependencies, which is a relief 😄 For now at least I imagine it would be a nice compromise. We should make sure that for workloads like the one in the tutorials it doesn't become too slow, however. Let's see after how many documents the 1s threshold for retrieval is reached.

  • About the practical implementation, I imagine that for every index, we could have an instance of the BM25 class:
    self.indexes[index].bm25 = BM25Okapi(tokenized_corpus)

I trust you on this for now, I haven't read the rank_bm25 docs yet. in principle sounds good to me 👍

  • A doubt: should the BM25 representation be generated by default when working with the InMemoryDocumentStore?
    It may not be essential if we are using the IMDS for dense retrieval or for TF-IDF retrieval; in those cases, BM25 computation would make the document store unnecessarily slow.

This issue really has no good answer in the current architecture, but so is the amazing retriever.document_store.update_embeddings(retriever=retriever) incantation, right? 🥲 My opinion is to not make it automatic. But we can do better than the update_embedding stuff above: let's make a __init__ param for this!

Something like docstore = InMemoryDocumentStore(bm25_ranks=True). Let's also leave the update_bm25_ranks() method public, so if one wants to manually update the ranks, it can still be done. But I'd like it to be an init param so one can set it and forget.

Then we can do the same to the other docstores for update_embeddings... but in another PR I guess 😄

@anakin87
Copy link
Member

I'm starting to work on this... 🛠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Contributions wanted! Looking for external contributions topic:document_store topic:retriever type:feature New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants