feat: Add component SimilarDocumentsRetriever #7733
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related Issues
Evolved from #5629 and #5666
Proposed Changes:
Addition of a new component
SimilarDocumentsRetriever
.This component retrieves similar documents for each of the given documents for each preset retrievers. So, in a way it's simply a wrapper around multiple retrievers to be run on multiple documents as queries.
Usage Example:
Background
It was conceptualized when considering the addition of
FileSimilarityRetriever
here. There, it's one of the components that come together to provide a file similarity functionality. Route 3 in the demonstrative Colab Notebook there.But this component by itself should possibly be useful for other use-cases too. E.g. finding similar sets of documents to the current output set from a DocSearch pipeline.
How did you test it?
Mostly unit tests, one integration test.
Notes for the reviewer
Open to feedback at all levels, including if there could be another way to have this functionality.
Some concrete big and small things I'm not sure of:
Should this be reformulated as a different component?
List[Document]
, also acceptList[str]
for added flexibility. And rename component to something likeBatchedRetriever
orGroupRetriever
orMultiRetriever
, although each name seems misleading in their own way.Related to the last point, I'm a bit unsure what's the policy for flexible input and output types right now? E.g. accepting
List[Document]
as well asList[str]
or output format (List[List[Document]]
vsList[List[List[Document]]]
or something else) based on a preset init argument.Minor: Unsure what the type hint for the init argument
retrievers
should be? As (afaik) there is no general Retriever interface. Right now it's justList
.Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.