Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add component SimilarDocumentsRetriever #7733

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bglearning
Copy link
Contributor

@bglearning bglearning commented May 23, 2024

Related Issues

Evolved from #5629 and #5666

Proposed Changes:

Addition of a new component SimilarDocumentsRetriever.

This component retrieves similar documents for each of the given documents for each preset retrievers. So, in a way it's simply a wrapper around multiple retrievers to be run on multiple documents as queries.

Usage Example:

from haystack import Document
from haystack.components.retrievers import SimilarDocumentsRetriever
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = [
    Document(content="Javascript is a popular programming language"),
    Document(content="Python is a popular programming language"),
    Document(content="A chromosome is a package of DNA"),
    Document(content="DNA carries genetic information"),
]

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store, top_k=1)
sim_docs_retriever = SimilarDocumentsRetriever(retrievers=[retriever])

result = sim_docs_retriever.run(
    documents=[
        Document(content="A DNA document"),
        Document(content="A programming document"),
    ]
)

print(result["document_lists"])
# [
#   [Document(..., content: 'DNA carries genetic information', ...)],
#   [Document(..., content: 'Javascript is a popular programming language', ...)]
# ]

Background

It was conceptualized when considering the addition of FileSimilarityRetriever here. There, it's one of the components that come together to provide a file similarity functionality. Route 3 in the demonstrative Colab Notebook there.

But this component by itself should possibly be useful for other use-cases too. E.g. finding similar sets of documents to the current output set from a DocSearch pipeline.

How did you test it?

Mostly unit tests, one integration test.

Notes for the reviewer

Open to feedback at all levels, including if there could be another way to have this functionality.

Some concrete big and small things I'm not sure of:

Should this be reformulated as a different component?

  • Should it be more clear from the naming that it's a wrapper for retriever(s) and not itself a retriever?
  • Should this be generalized a bit more? E.g. Instead of just List[Document], also accept List[str] for added flexibility. And rename component to something like BatchedRetriever or GroupRetriever or MultiRetriever, although each name seems misleading in their own way.

Related to the last point, I'm a bit unsure what's the policy for flexible input and output types right now? E.g. accepting List[Document] as well as List[str] or output format (List[List[Document]] vs List[List[List[Document]]] or something else) based on a preset init argument.

Minor: Unsure what the type hint for the init argument retrievers should be? As (afaik) there is no general Retriever interface. Right now it's just List.

Checklist

@bglearning bglearning requested a review from a team as a code owner May 23, 2024 10:16
@bglearning bglearning requested review from masci and removed request for a team May 23, 2024 10:16
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels May 23, 2024
@bglearning bglearning requested a review from a team as a code owner May 23, 2024 10:23
@bglearning bglearning requested review from dfokina and removed request for a team May 23, 2024 10:23
@coveralls
Copy link
Collaborator

coveralls commented May 23, 2024

Pull Request Test Coverage Report for Build 9206339010

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.05%) to 90.612%

Totals Coverage Status
Change from base Build 9204054527: 0.05%
Covered Lines: 6631
Relevant Lines: 7318

💛 - Coveralls

masci
masci previously requested changes May 23, 2024
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should go in this direction. I see two red flags here:

  1. A component taking other components in its constructor suggests a design problem somewhere
  2. There's more code needed for serialisation than the actual run

Let's take one step back and see what problems we're trying to solve:

  1. We want to use document.content as query
  2. We want to pass more than one query to a retriever

The first problem can be solved with a tiny component taking a list of documents and returning a list of doc.content. The second is a known issue that sometimes is wrongly referred to as "batching" - we attempted to solve this problem in a generic way but we never had the time to finish, maybe we should invest there?

@bglearning
Copy link
Contributor Author

bglearning commented May 23, 2024

A component taking other components in its constructor suggests a design problem somewhere

Yes, it does feel off. Essentially, as you say, it comes about because of the "batching" problem.

When trying to do it directly without this wrapper (Route 2 mentioned here) (after using an OutputAdapter for prob.1 document.content -> query) that issue (prob.2) became the roadblock.

@bglearning
Copy link
Contributor Author

@masci On the path forward, as you say, looks like better to solve the "batching" problem first.

we attempted to solve this problem in a generic way but we never had the time to finish, maybe we should invest there?

Where is it under discussion? just to know where we can follow and/or contribute. Can't seem to find the right keywords to search.

@masci masci requested a review from anakin87 June 21, 2024 11:26
@masci masci dismissed their stale review June 21, 2024 11:27

Handing over to @anakin87

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants