refactor: filters type #3682

tstadel · 2022-12-08T11:25:23Z

Related Issues

filters throughout the codebase have inconsistent types. This PR consolidates them. During this work, a bug in aggregate_labels was revealed. We fix that too:

fixes aggregate_labels cannot handle non-sequence typed filter values #3687

Proposed Changes:

take FilterType from multimodal retriever and make it the standard filters type because it is a more accurate specification of the filter type
fix aggregate_labels for non-sequence valued filters
transform existing get_all_labels_aggregated integration tests into aggregate_labels unit tests

How did you test it?

existing tests
additional tests for
- None values in filters lists
- aggregate_labels with non-sequence valued filters

Notes for the reviewer

this is a follow-up PR to feat: add query_by_embedding_batch #3546 to fix the introduced # type: ignore
the new typehint introduces Optional for values if filters is a List (filters: Optional[Union[Dict[str, Any], List[Optional[Dict[str, Any]]]]] = None, instead of filters: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None, : so now you can pass [{"name": "my-filename.txt"}, None] to filters. I added a test for this and checked manually that all document stores can actually handle it. They do as they all support passing None to filters param of the non-batch methods.
Labels.filters cannot use FilterType. That is because Dict is covariant, which means that we cannot assign Dict[str, str] to Dict[str, Union[str, int]] variables (see Dict[str, float] Incompatible with Dict[str, Union[int, float]] python/mypy#9418). To resolve that we would have to formulate each Dict type explicitly in one huge Union block. The more accurate specification prohibits arbitrary types on the first level like Dict[str, Dict[Any, Any]] and Dict[str, object] but it couldn't enforce that for the nested dicts or list values. So I guess it is better to keep it simple here. Note that this is not an issue for parameters, just for direct variable assignments.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added tests that demonstrate the correct behavior of the change
I've used the conventional commit convention for my PR title
I documented my code
I ran pre-commit hooks and fixed any issue

This reverts commit e8c561b.

haystack/utils/labels.py

haystack/document_stores/base.py

julian-risch

Great to have this refactored. Looks much cleaner now and is easier to read. Thanks for improving the tests too! I have only two small change requests/things to check before we can merge this.
1) The if filters is not None else [{}] * len(query_embs) was removed but I think we need it or something similar to it.
If filters is None, we otherwise run into a NoneType is not iterable error (see my comment)

There were some len(filters) != len(queries) checks in the code that have been removed but I think they are still useful. Is the check redundant because we do them somewhere else or why should we remove the checks?

julian-risch · 2022-12-12T08:51:08Z

haystack/document_stores/faiss.py

@@ -306,7 +306,7 @@ def update_embeddings(
 retriever: DenseRetriever,
 index: Optional[str] = None,
 update_existing_embeddings: bool = True,
- filters: Optional[Dict[str, Any]] = None, # TODO: Adapt type once we allow extended filters in FAISSDocStore
+ filters: Optional[FilterType] = None, # TODO: Adapt type once we allow extended filters in FAISSDocStore


There are a couple of these # TODO: Adapt type once we allow extended filters in ... comments but now that the type is FilterType, the type won't need to be adapted, correct? Or what would be the type if this DocumentStore or any of the others supports extended filters?

Yes I think the type won't need to be adapted anymore. Everything about filters we have in our filter_util.py module uses FilterType. I don't think it makes sense to have something else. So I removed the comment.

julian-risch · 2022-12-12T09:00:14Z

haystack/nodes/retriever/dense.py

@@ -453,15 +448,6 @@ def retrieve_batch(
 if batch_size is None:
 batch_size = self.batch_size

- if isinstance(filters, list):
- if len(filters) != len(queries):


The lists need to be of the same length so I would prefer to keep this check. Could you please explain why we should remove it? Is it checked somewhere else already?

These checks were moved down to BaseDocumentStore.query_by_embedding_batch and SearchEngineDocumentStore.query_by_embedding_batch I hope I've checked any occurence. But I'll double check and add a comment on each removal to point out where the check is now happening.

tstadel · 2022-12-12T10:41:36Z

haystack/nodes/retriever/dense.py

- if isinstance(filters, list):
- if len(filters) != len(queries):
- raise HaystackError(
- "Number of filters does not match number of queries. Please provide as many filters"
- " as queries or a single filter that will be applied to each query."
- )


Moved down to BaseDocumentStore.query_by_embedding_batch and `SearchEngineDocumentStore.query_by_embedding_batch

tstadel · 2022-12-12T10:41:55Z

haystack/nodes/retriever/dense.py

- if isinstance(filters, list):
- if len(filters) != len(queries):
- raise HaystackError(
- "Number of filters does not match number of queries. Please provide as many filters"
- " as queries or a single filter that will be applied to each query."
- )


Moved down to BaseDocumentStore.query_by_embedding_batch and `SearchEngineDocumentStore.query_by_embedding_batch

tstadel · 2022-12-12T10:42:20Z

haystack/nodes/retriever/multimodal/retriever.py

- if len(filters) != len(queries):
- raise MultiModalRetrieverError(
- "The number of filters does not match the number of queries. Provide as many filters "
- "as queries, or a single filter that will be applied to all queries."
- )


Moved down to BaseDocumentStore.query_by_embedding_batch and `SearchEngineDocumentStore.query_by_embedding_batch

julian-risch · 2022-12-12T10:47:00Z

Alright, thanks for clarifying! 👍 I will approve once the only remaining change of the comments (# TODO: Adapt type once we allow extended filters in ...) is made.

tstadel · 2022-12-12T10:52:21Z

Alright, thanks for clarifying! +1 I will approve once the only remaining change of the comments (# TODO: Adapt type once we allow extended filters in ...) is made.

Yes sorry, I thought I've pushed these changes already. Now they are in. Thanks for the quick review!

julian-risch

Looks very good to me! 👍

tstadel added 2 commits December 8, 2022 12:05

consolidate filters type

c8461de

remove unnecessary optionals

88cf379

tstadel requested a review from a team as a code owner December 8, 2022 11:25

tstadel requested review from julian-risch and removed request for a team December 8, 2022 11:25

tstadel added the type:refactor Not necessarily visible to the users label Dec 8, 2022

tstadel added 3 commits December 8, 2022 12:35

fix mypy

46d0822

fix pylint

ec7867f

fix pylint

efbd71e

tstadel marked this pull request as draft December 8, 2022 13:26

tstadel added 9 commits December 9, 2022 09:16

move FilterType to schema

df471d2

remove Optional from FilterType

a23adc4

move to Dict[str, Any]

e8c561b

Revert "move to Dict[str, Any]"

83129b5

This reverts commit e8c561b.

fix mypy

f0f7d6a

fix pylint

70309a6

revert isort changes in elasticsearch

2a5fe20

remove todos in milvus.py

c9991c5

remove todos in sql.py

80b47b3

tstadel commented Dec 9, 2022

View reviewed changes

haystack/utils/labels.py Show resolved Hide resolved

add aggregate_labels tests

c04966b

tstadel added the type:bug Something isn't working label Dec 9, 2022

consolidate aggregate_labels tests

931f62d

tstadel marked this pull request as ready for review December 9, 2022 16:43

julian-risch reviewed Dec 12, 2022

View reviewed changes

haystack/document_stores/base.py Show resolved Hide resolved

julian-risch requested changes Dec 12, 2022

View reviewed changes

julian-risch added topic:retriever topic:document_store and removed topic:retriever labels Dec 12, 2022

tstadel commented Dec 12, 2022

View reviewed changes

tstadel requested a review from julian-risch December 12, 2022 10:46

remove superfluous type todos

a6a5014

remove ALL superfluous #todos

504400f

julian-risch approved these changes Dec 12, 2022

View reviewed changes

julian-risch added this to the 1.12.0 milestone Dec 12, 2022

Merge branch 'main' into refactor/filters_type

e24bbef

tstadel merged commit 600dc2d into main Dec 12, 2022

tstadel deleted the refactor/filters_type branch December 12, 2022 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: filters type #3682

refactor: filters type #3682

tstadel commented Dec 8, 2022 •

edited

Loading

julian-risch left a comment •

edited

Loading

julian-risch Dec 12, 2022

tstadel Dec 12, 2022

julian-risch Dec 12, 2022

tstadel Dec 12, 2022

tstadel Dec 12, 2022

tstadel Dec 12, 2022

tstadel Dec 12, 2022

julian-risch commented Dec 12, 2022

tstadel commented Dec 12, 2022

julian-risch left a comment

refactor: filters type #3682

refactor: filters type #3682

Conversation

tstadel commented Dec 8, 2022 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

julian-risch left a comment • edited Loading

Choose a reason for hiding this comment

julian-risch Dec 12, 2022

Choose a reason for hiding this comment

tstadel Dec 12, 2022

Choose a reason for hiding this comment

julian-risch Dec 12, 2022

Choose a reason for hiding this comment

tstadel Dec 12, 2022

Choose a reason for hiding this comment

tstadel Dec 12, 2022

Choose a reason for hiding this comment

tstadel Dec 12, 2022

Choose a reason for hiding this comment

tstadel Dec 12, 2022

Choose a reason for hiding this comment

julian-risch commented Dec 12, 2022

tstadel commented Dec 12, 2022

julian-risch left a comment

Choose a reason for hiding this comment

tstadel commented Dec 8, 2022 •

edited

Loading

julian-risch left a comment •

edited

Loading