Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Azure converter updates #7409

Merged
merged 13 commits into from
Apr 9, 2024
Merged

feat: Azure converter updates #7409

merged 13 commits into from
Apr 9, 2024

Conversation

vblagoje
Copy link
Member

Why:

The pull request introduces significant improvements and enhancements to the AzureOCRDocumentConverter. The motivation behind these changes is to provide greater flexibility, improved accuracy, and a wider set of features when using Azure's OCR capabilities. Specifically, it addresses the need for handling complex document layouts more effectively, including better management of tables, text paragraphs, and the overall page layout.

What:

  • Added new configuration options (preceding_context_len, following_context_len, merge_multiple_column_headers, page_layout, threshold_y) to customize the behavior of document conversion.
  • Improved handling of table conversion, including options for preceding and following context, and dealing with multiple column headers.
  • Enhanced text conversion process to support different page layouts (natural and single_column) and introduced logic to merge texts based on physical proximity.
  • Implemented the ability to convert pages to single-column format based on a threshold to group text lines.
  • Refactored conversion functions to improve readability and maintainability.

How can it be used:

  • Users can now specify the amount of context to include before and after tables in converted documents.
    converter = AzureOCRDocumentConverter(preceding_context_len=3, following_context_len=3)
  • It is possible to choose between natural order and single column layout for text pages, which is especially useful for documents with complex layouts.
    converter = AzureOCRDocumentConverter(page_layout="single_column", threshold_y=0.05)
  • Handling documents with tables that include multiple rows as headers is now more versatile, with options to merge these headers for a cleaner output.
    converter = AzureOCRDocumentConverter(merge_multiple_column_headers=True)

How did you test it:

  • Unit tests were extensively updated to cover the new functionality, reflecting the supported configurations and expected outcomes for different document structures.
  • Integration tests ensure that real-world PDFs and image files are processed accurately with the Azure AI Form Recognizer service, confirming the effectiveness of the changes in a practical context.
  • Additional test cases were included to specifically validate the handling of complex table structures, different page layouts, and the correct application of preceding and following context.

Notes for the reviewer:

  • Special attention should be given to the handling of page layouts and the merging logic for multiple column headers in tables, as these areas represent significant enhancements over the previous version.
  • Reviewers are encouraged to consider additional edge cases that may not be fully covered by the existing tests, especially in documents with highly irregular layouts or non-standard table structures.
  • Due to the introduction of new dependencies (networkx, pandas), compatibility and dependency management should be verified to avoid potential conflicts in projects using the Haystack library.

@vblagoje vblagoje requested review from a team as code owners March 22, 2024 14:02
@vblagoje vblagoje requested review from dfokina and anakin87 and removed request for a team March 22, 2024 14:02
@github-actions github-actions bot added topic:tests 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Mar 22, 2024
@vblagoje vblagoje requested review from sjrl and removed request for anakin87 March 22, 2024 14:02
@vblagoje
Copy link
Member Author

@anakin87 I'm handing over this one to @sjrl as he's the original author and has the context for this codebase

@anakin87
Copy link
Member

Ok! Let's just try to remove as many type:ignore as possible. 😉

@coveralls
Copy link
Collaborator

coveralls commented Mar 22, 2024

Pull Request Test Coverage Report for Build 8571134297

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 90 unchanged lines in 8 files lost coverage.
  • Overall coverage decreased (-0.08%) to 89.404%

Files with Coverage Reduction New Missed Lines %
components/embedders/hugging_face_tei_document_embedder.py 1 98.48%
components/embedders/hugging_face_tei_text_embedder.py 1 97.73%
components/websearch/searchapi.py 2 96.36%
core/pipeline/pipeline.py 4 93.85%
components/embedders/azure_text_embedder.py 12 65.12%
utils/device.py 21 89.9%
components/converters/azure.py 23 89.5%
components/embedders/azure_document_embedder.py 26 52.24%
Totals Coverage Status
Change from base Build 8439010873: -0.08%
Covered Lines: 5847
Relevant Lines: 6540

💛 - Coveralls

@vblagoje vblagoje changed the title feat: Azure converter feat: Azure converter updates Mar 22, 2024
@vblagoje
Copy link
Member Author

Ok! Let's just try to remove as many type:ignore as possible. 😉

Deal. And also add typing everywhere. I wanted to prepare it first so @sjrl can look at it and add some minor fixes. On Monday!

@pytest.mark.skip(
reason="fails because of non-unique column names, azure_sample_pdf_3.json has duplicate column names"
)
def test_azure_converter_with_multicolumn_header_table(self, mock_resolve_value, test_files_path) -> None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@silvanocerza this one fails

@vblagoje
Copy link
Member Author

vblagoje commented Apr 5, 2024

@sjrl I have resolved the issue of document id calculation by using custom id in these failing cases. Please have a look last two commits; cc-ing @silvanocerza whose recommendation I followed.

:returns: A hash of the DataFrame content.
"""
# take adaptive sample of rows to hash because we can have very large dataframes
hasher = hashlib.sha256()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we go with something faster like md5?
Or maybe even skip generating based on the dataframe completely and use an uuid?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just concerned using SHA256 could be too much in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed to use md5 and 5 samples, this should be ok. Thanks @silvanocerza
@sjrl would you please run your internal tests on this latest version

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, looks good!

@vblagoje vblagoje merged commit 988c360 into main Apr 9, 2024
23 checks passed
@vblagoje vblagoje deleted the azure_converter branch April 9, 2024 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature parity for AzureConverter v1.9 in v2.0 + Overall Table Extraction Options
5 participants