-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: filters type #3682
refactor: filters type #3682
Conversation
This reverts commit e8c561b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to have this refactored. Looks much cleaner now and is easier to read. Thanks for improving the tests too! I have only two small change requests/things to check before we can merge this.
1) The if filters is not None else [{}] * len(query_embs)
was removed but I think we need it or something similar to it.
If filters is None
, we otherwise run into a NoneType is not iterable error
(see my comment)
- There were some
len(filters) != len(queries)
checks in the code that have been removed but I think they are still useful. Is the check redundant because we do them somewhere else or why should we remove the checks?
haystack/document_stores/faiss.py
Outdated
@@ -306,7 +306,7 @@ def update_embeddings( | |||
retriever: DenseRetriever, | |||
index: Optional[str] = None, | |||
update_existing_embeddings: bool = True, | |||
filters: Optional[Dict[str, Any]] = None, # TODO: Adapt type once we allow extended filters in FAISSDocStore | |||
filters: Optional[FilterType] = None, # TODO: Adapt type once we allow extended filters in FAISSDocStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a couple of these # TODO: Adapt type once we allow extended filters in ...
comments but now that the type is FilterType
, the type won't need to be adapted, correct? Or what would be the type if this DocumentStore or any of the others supports extended filters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think the type won't need to be adapted anymore. Everything about filters we have in our filter_util.py module uses FilterType
. I don't think it makes sense to have something else. So I removed the comment.
@@ -453,15 +448,6 @@ def retrieve_batch( | |||
if batch_size is None: | |||
batch_size = self.batch_size | |||
|
|||
if isinstance(filters, list): | |||
if len(filters) != len(queries): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lists need to be of the same length so I would prefer to keep this check. Could you please explain why we should remove it? Is it checked somewhere else already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These checks were moved down to BaseDocumentStore.query_by_embedding_batch
and SearchEngineDocumentStore.query_by_embedding_batch
I hope I've checked any occurence. But I'll double check and add a comment on each removal to point out where the check is now happening.
if isinstance(filters, list): | ||
if len(filters) != len(queries): | ||
raise HaystackError( | ||
"Number of filters does not match number of queries. Please provide as many filters" | ||
" as queries or a single filter that will be applied to each query." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved down to BaseDocumentStore.query_by_embedding_batch
and `SearchEngineDocumentStore.query_by_embedding_batch
if isinstance(filters, list): | ||
if len(filters) != len(queries): | ||
raise HaystackError( | ||
"Number of filters does not match number of queries. Please provide as many filters" | ||
" as queries or a single filter that will be applied to each query." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved down to BaseDocumentStore.query_by_embedding_batch
and `SearchEngineDocumentStore.query_by_embedding_batch
if len(filters) != len(queries): | ||
raise MultiModalRetrieverError( | ||
"The number of filters does not match the number of queries. Provide as many filters " | ||
"as queries, or a single filter that will be applied to all queries." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved down to BaseDocumentStore.query_by_embedding_batch
and `SearchEngineDocumentStore.query_by_embedding_batch
Alright, thanks for clarifying! 👍 I will approve once the only remaining change of the comments ( |
Yes sorry, I thought I've pushed these changes already. Now they are in. Thanks for the quick review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me! 👍
Related Issues
filters
throughout the codebase have inconsistent types. This PR consolidates them. During this work, a bug inaggregate_labels
was revealed. We fix that too:aggregate_labels
cannot handle non-sequence typed filter values #3687Proposed Changes:
FilterType
from multimodal retriever and make it the standard filters type because it is a more accurate specification of the filter typeaggregate_labels
for non-sequence valued filtersget_all_labels_aggregated
integration tests intoaggregate_labels
unit testsHow did you test it?
aggregate_labels
with non-sequence valued filtersNotes for the reviewer
# type: ignore
Optional
for values if filters is aList
(filters: Optional[Union[Dict[str, Any], List[Optional[Dict[str, Any]]]]] = None,
instead offilters: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
: so now you can pass[{"name": "my-filename.txt"}, None]
to filters. I added a test for this and checked manually that all document stores can actually handle it. They do as they all support passingNone
tofilters
param of the non-batch methods.Labels.filters
cannot useFilterType
. That is becauseDict
is covariant, which means that we cannot assignDict[str, str]
toDict[str, Union[str, int]]
variables (see Dict[str, float] Incompatible with Dict[str, Union[int, float]] python/mypy#9418). To resolve that we would have to formulate each Dict type explicitly in one hugeUnion
block. The more accurate specification prohibits arbitrary types on the first level likeDict[str, Dict[Any, Any]]
andDict[str, object]
but it couldn't enforce that for the nested dicts or list values. So I guess it is better to keep it simple here. Note that this is not an issue for parameters, just for direct variable assignments.Checklist