Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: filters type #3682

Merged
merged 19 commits into from
Dec 12, 2022
Merged

refactor: filters type #3682

merged 19 commits into from
Dec 12, 2022

Conversation

tstadel
Copy link
Member

@tstadel tstadel commented Dec 8, 2022

Related Issues

filters throughout the codebase have inconsistent types. This PR consolidates them. During this work, a bug in aggregate_labels was revealed. We fix that too:

Proposed Changes:

  • take FilterType from multimodal retriever and make it the standard filters type because it is a more accurate specification of the filter type
  • fix aggregate_labels for non-sequence valued filters
  • transform existing get_all_labels_aggregated integration tests into aggregate_labels unit tests

How did you test it?

  • existing tests
  • additional tests for
    • None values in filters lists
    • aggregate_labels with non-sequence valued filters

Notes for the reviewer

  • this is a follow-up PR to feat: add query_by_embedding_batch #3546 to fix the introduced # type: ignore
  • the new typehint introduces Optional for values if filters is a List (filters: Optional[Union[Dict[str, Any], List[Optional[Dict[str, Any]]]]] = None, instead of filters: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None, : so now you can pass [{"name": "my-filename.txt"}, None] to filters. I added a test for this and checked manually that all document stores can actually handle it. They do as they all support passing None to filters param of the non-batch methods.
  • Labels.filters cannot use FilterType. That is because Dict is covariant, which means that we cannot assign Dict[str, str] to Dict[str, Union[str, int]] variables (see Dict[str, float] Incompatible with Dict[str, Union[int, float]] python/mypy#9418). To resolve that we would have to formulate each Dict type explicitly in one huge Union block. The more accurate specification prohibits arbitrary types on the first level like Dict[str, Dict[Any, Any]] and Dict[str, object] but it couldn't enforce that for the nested dicts or list values. So I guess it is better to keep it simple here. Note that this is not an issue for parameters, just for direct variable assignments.

Checklist

@tstadel tstadel requested a review from a team as a code owner December 8, 2022 11:25
@tstadel tstadel requested review from julian-risch and removed request for a team December 8, 2022 11:25
@tstadel tstadel added the type:refactor Not necessarily visible to the users label Dec 8, 2022
@tstadel tstadel marked this pull request as draft December 8, 2022 13:26
@tstadel tstadel added the type:bug Something isn't working label Dec 9, 2022
@tstadel tstadel marked this pull request as ready for review December 9, 2022 16:43
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to have this refactored. Looks much cleaner now and is easier to read. Thanks for improving the tests too! I have only two small change requests/things to check before we can merge this.
1) The if filters is not None else [{}] * len(query_embs) was removed but I think we need it or something similar to it.
If filters is None, we otherwise run into a NoneType is not iterable error (see my comment)

  1. There were some len(filters) != len(queries) checks in the code that have been removed but I think they are still useful. Is the check redundant because we do them somewhere else or why should we remove the checks?

@@ -306,7 +306,7 @@ def update_embeddings(
retriever: DenseRetriever,
index: Optional[str] = None,
update_existing_embeddings: bool = True,
filters: Optional[Dict[str, Any]] = None, # TODO: Adapt type once we allow extended filters in FAISSDocStore
filters: Optional[FilterType] = None, # TODO: Adapt type once we allow extended filters in FAISSDocStore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a couple of these # TODO: Adapt type once we allow extended filters in ... comments but now that the type is FilterType, the type won't need to be adapted, correct? Or what would be the type if this DocumentStore or any of the others supports extended filters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think the type won't need to be adapted anymore. Everything about filters we have in our filter_util.py module uses FilterType. I don't think it makes sense to have something else. So I removed the comment.

@@ -453,15 +448,6 @@ def retrieve_batch(
if batch_size is None:
batch_size = self.batch_size

if isinstance(filters, list):
if len(filters) != len(queries):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lists need to be of the same length so I would prefer to keep this check. Could you please explain why we should remove it? Is it checked somewhere else already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks were moved down to BaseDocumentStore.query_by_embedding_batch and SearchEngineDocumentStore.query_by_embedding_batch I hope I've checked any occurence. But I'll double check and add a comment on each removal to point out where the check is now happening.

Comment on lines -1090 to -1095
if isinstance(filters, list):
if len(filters) != len(queries):
raise HaystackError(
"Number of filters does not match number of queries. Please provide as many filters"
" as queries or a single filter that will be applied to each query."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved down to BaseDocumentStore.query_by_embedding_batch and `SearchEngineDocumentStore.query_by_embedding_batch

Comment on lines -1792 to -1797
if isinstance(filters, list):
if len(filters) != len(queries):
raise HaystackError(
"Number of filters does not match number of queries. Please provide as many filters"
" as queries or a single filter that will be applied to each query."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved down to BaseDocumentStore.query_by_embedding_batch and `SearchEngineDocumentStore.query_by_embedding_batch

Comment on lines -187 to -191
if len(filters) != len(queries):
raise MultiModalRetrieverError(
"The number of filters does not match the number of queries. Provide as many filters "
"as queries, or a single filter that will be applied to all queries."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved down to BaseDocumentStore.query_by_embedding_batch and `SearchEngineDocumentStore.query_by_embedding_batch

@julian-risch
Copy link
Member

Alright, thanks for clarifying! 👍 I will approve once the only remaining change of the comments (# TODO: Adapt type once we allow extended filters in ...) is made.

@tstadel
Copy link
Member Author

tstadel commented Dec 12, 2022

Alright, thanks for clarifying! +1 I will approve once the only remaining change of the comments (# TODO: Adapt type once we allow extended filters in ...) is made.

Yes sorry, I thought I've pushed these changes already. Now they are in. Thanks for the quick review!

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me! 👍

@julian-risch julian-risch added this to the 1.12.0 milestone Dec 12, 2022
@tstadel tstadel merged commit 600dc2d into main Dec 12, 2022
@tstadel tstadel deleted the refactor/filters_type branch December 12, 2022 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:document_store type:bug Something isn't working type:refactor Not necessarily visible to the users
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aggregate_labels cannot handle non-sequence typed filter values
2 participants