Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to Document Stores/Retrievers docs #4955

Closed
anakin87 opened this issue May 19, 2023 · 1 comment
Closed

Improvements to Document Stores/Retrievers docs #4955

anakin87 opened this issue May 19, 2023 · 1 comment
Assignees
Labels
P1 High priority, add to the next sprint type:documentation Improvements on the docs

Comments

@anakin87
Copy link
Member

Haystack docs on Document Stores and Retrievers are extremely important for those new to the framework, but I feel they could improved.

I will try to list some personal opinions that have come to mind, also based on my interactions with the community.

Document Stores page

  • Approximate Nearest Neighbors Search
    The list of DBs supporting ANN is probably out of date.

    • Does Elasticsearch support ANN?
    • Does Qdrant support ANN?
  • Choosing the Right Document Store
    OpenDistroElasticsearchDocumentStore is still present, but has been removed in chore!: remove deprecated OpenDistroElasticsearchDocumentStore #4361
    In general, this table should be reviewed/updated carefully.

  • Our Recommendations
    In the past, a pure Vector DB was also proposed (this was Milvus).
    After the deprecation of Milvus, we are suggesting Elasticsearch as an allrounder solution (while mentioning that it can be "slow for dense retrieval with more than ~ 1 Mio documents").
    It would probably also be better to propose a Vector specialist (Weaviate? Qdrant?).

FAISSDocumentStore?
(Somehow related to the previous point).
I often see people using FAISSDocumentStore, which seems to me a thin and imperfect implementation built on FAISS.
These users often encounter problems.
I would suggest not promoting FAISSDocumentStore much, whereas there are more powerful, well-designed and integrated vector DBs today.

Retrievers page

  • DocumentStore Compatibility table

    • InMemory supports BM25
    • OpenDistroElasticsearch should be removed
    • DeepsetCloud appears in the table, but not in the Document Stores page
    • I would add Qdrant here, even if it is an external integration
  • DPR
    Embedding Retrieval is recommended but DPR is still described as "a highly performant retrieval method".
    You know the NLP domain better than I do, but my impression is that most of the avalaible DPR models performe worse than Sentence Transformers models, especially for out-of-domain retrieval (see also BEIR paper).
    Therefore, I would probably include a less positive description of the DPR.

  • TF-IDF
    Perhaps we should add a little hint that the BM25 is generally better...

In general, I would try to make BM25 and EmbeddingRetriever more prominent and visible, moving everything else further down the page (Multihop, Table retrieval and Multimodal retrieval).


@dfokina feel free to discard my opinions if they do not make sense, and to involve other people in the discussion as well!
😃

@dfokina dfokina added the type:documentation Improvements on the docs label May 19, 2023
@julian-risch julian-risch added the P1 High priority, add to the next sprint label May 25, 2023
@dfokina
Copy link
Contributor

dfokina commented Jun 27, 2023

Hi @anakin87 , thank you for the recommendations, I implemented changes in both docs :)

Just a couple of comments:

  • Decided to not recommend a vector specialist at the moment, as it varies for different use cases, so we are not yet comfortable pointing any specific one in the docs.
  • I erased deepsetCloudDocumentStore from the DocumentStore Compatibility table – it is not intended for production use anyway and is specific to deepsetCloud users.

@dfokina dfokina closed this as completed Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 High priority, add to the next sprint type:documentation Improvements on the docs
Projects
None yet
Development

No branches or pull requests

3 participants