Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change return types of indexing pipeline nodes #2342

Merged
merged 37 commits into from
Mar 29, 2022
Merged

Conversation

bogdankostic
Copy link
Contributor

@bogdankostic bogdankostic commented Mar 21, 2022

This PR changes the return types of the file converters, preprocessor and crawler such that they return a Document or a List[Document], respectively. Previously, these nodes returned Dict / List[Dict]. Furthermore, this PR removes the id_hash_keys property from the Document primitive, as it was never set to a value.

This PR also fixes wrong file paths in some of the Tutorials and allows to fetch .gz files in fetch_archive_from_http.

Breaking Changes

This PR introduces the following breaking changes:

  • The Crawler's run method will return a List[Document] instead of List[Dict]
  • The convert method of all Converters in the file_converter directory will return a List[Document] instead of List[Dict]
  • The PreProcessor's process, clean and split methods will return a List[Document] instead of List[Dict] or a Document instead of a Dict, respectively.
  • The convert_files_to_dicts and tika_convert_files_to_dicts methods in utils/preprocessing.py are renamed to convert_files_to_docs and tika_convert_files_to_docs, respectively, and will return a List[Document] instead of a List[Dict]

Closes #1859, closes #1920

@bogdankostic bogdankostic marked this pull request as ready for review March 23, 2022 11:20
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me so far. Only some very small changes would be nice before merging. content_type="text" is the default when initializing a Document so I'd say we don't explicitly set it (several occurrences in the code).
Other than that, please go through the tutorials and check whether they are running with your code changes. For example, tutorial 1 definitely needs some changes because convert_files_to_dicts is used, which is now convert_files_to_docs, the corresponding comment needs to be changed accordingly and the variable dicts should be called docs.
If the tutorials run without errors and the tests are passing feel free to merge! 👍

@@ -298,8 +298,6 @@ jobs:
pip install ui/

- name: Run tests
env:
PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain why we have this change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to remove this in the Pinecone PR. We don't need the API key here in these tests, as we don't test pinecone inside this job but inside the test-pinecone job. (The API Key is already used there:

PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }}
)

test/test_utils.py Outdated Show resolved Hide resolved
test/test_utils.py Outdated Show resolved Hide resolved
haystack/utils/preprocessing.py Outdated Show resolved Hide resolved
haystack/nodes/file_converter/image.py Outdated Show resolved Hide resolved
@@ -10,7 +11,7 @@ class BasePreProcessor(BaseComponent):
@abstractmethod
def process(
self,
documents: Union[dict, List[dict]],
documents: Union[dict, Document, List[Union[dict, Document]]],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can in future only support List instead of single items here? Seems unintuitive to me that documents can be a single dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a DeprecationWarning in the PreProcessor's process method that is triggered if the user does not supply a list.

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/file_converter/docx.py Outdated Show resolved Hide resolved
haystack/nodes/file_converter/azure.py Outdated Show resolved Hide resolved
@@ -483,25 +482,25 @@ def elasticsearch_index_to_document_store(
# Get content and metadata of current record
content = record["_source"].pop(original_content_field, "")
if content:
record_doc = {"content": content, "meta": {}}
record_doc = Document(content=content, meta={})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are ids handled here? Could it be that we want to set them explicitly just as they were before? For example, what if there were two documents with the same content but different ids in the elasticsearch index? Do we silently drop documents with the same content as duplicates because of no id/id_hash_keys being set here explicitly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the possibility to provide id_hash_keys here as well.

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, this is such a welcome improvement! All my comments are really minor and negligible.

The only thing missing is updating the tutorials I believe. They make use of convert_files_to_dicts in many places, for example Tutorial 1 does:

dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

Once this is done I think it's ready for merge.

)
file_text = text.content
for table in tables:
assert isinstance(table.content, pd.DataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather raise a proper HaystackError in here with a description of what's wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Agreed. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZanSara @julian-risch These assert statements are needed for the mypy checks. The Document's content field can be either of type str or pd.DataFrame. With these assert statements, we tell mypy that we are certain that table.content is of type pd.DataFrame. Otherwise, we would get a type error, because elements of type str don't have a method iterrows.

The alternative would be to use # type: ignore, but AFAIK we try to avoid these as much as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you observe the failure in mypy though? I just checked the following snippet:

from typing import Union

a: Union[str, int]
if True:
    a = 1
else:
    a = "aaa"

if not isinstance(a, str):
    raise ValueError()

a.startswith("a")

and mypy found no issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I changed the related line of code.

file_text = text + " ".join([cell for table in tables for row in table["content"] for cell in row])
file_text = text
for table in tables:
assert isinstance(table.content, pd.DataFrame)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the same holds for the other assert statements that were newly introduced. 👍

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
test/test_preprocessor.py Show resolved Hide resolved
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Property id_hash_keys of documents never set Change return type of File Converters to List[Document]
3 participants