feat: Azure converter updates #7409

vblagoje · 2024-03-22T14:02:07Z

Why:

The pull request introduces significant improvements and enhancements to the AzureOCRDocumentConverter. The motivation behind these changes is to provide greater flexibility, improved accuracy, and a wider set of features when using Azure's OCR capabilities. Specifically, it addresses the need for handling complex document layouts more effectively, including better management of tables, text paragraphs, and the overall page layout.

fixes Feature parity for AzureConverter v1.9 in v2.0 + Overall Table Extraction Options #6468

What:

Added new configuration options (preceding_context_len, following_context_len, merge_multiple_column_headers, page_layout, threshold_y) to customize the behavior of document conversion.
Improved handling of table conversion, including options for preceding and following context, and dealing with multiple column headers.
Enhanced text conversion process to support different page layouts (natural and single_column) and introduced logic to merge texts based on physical proximity.
Implemented the ability to convert pages to single-column format based on a threshold to group text lines.
Refactored conversion functions to improve readability and maintainability.

How can it be used:

Users can now specify the amount of context to include before and after tables in converted documents.
```
converter = AzureOCRDocumentConverter(preceding_context_len=3, following_context_len=3)
```
It is possible to choose between natural order and single column layout for text pages, which is especially useful for documents with complex layouts.
```
converter = AzureOCRDocumentConverter(page_layout="single_column", threshold_y=0.05)
```
Handling documents with tables that include multiple rows as headers is now more versatile, with options to merge these headers for a cleaner output.
```
converter = AzureOCRDocumentConverter(merge_multiple_column_headers=True)
```

How did you test it:

Unit tests were extensively updated to cover the new functionality, reflecting the supported configurations and expected outcomes for different document structures.
Integration tests ensure that real-world PDFs and image files are processed accurately with the Azure AI Form Recognizer service, confirming the effectiveness of the changes in a practical context.
Additional test cases were included to specifically validate the handling of complex table structures, different page layouts, and the correct application of preceding and following context.

Notes for the reviewer:

Special attention should be given to the handling of page layouts and the merging logic for multiple column headers in tables, as these areas represent significant enhancements over the previous version.
Reviewers are encouraged to consider additional edge cases that may not be fully covered by the existing tests, especially in documents with highly irregular layouts or non-standard table structures.
Due to the introduction of new dependencies (networkx, pandas), compatibility and dependency management should be verified to avoid potential conflicts in projects using the Haystack library.

vblagoje · 2024-03-22T14:03:41Z

@anakin87 I'm handing over this one to @sjrl as he's the original author and has the context for this codebase

anakin87 · 2024-03-22T14:09:20Z

Ok! Let's just try to remove as many type:ignore as possible. 😉

coveralls · 2024-03-22T14:14:34Z

Pull Request Test Coverage Report for Build 8571134297

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
90 unchanged lines in 8 files lost coverage.
Overall coverage decreased (-0.08%) to 89.404%

Files with Coverage Reduction	New Missed Lines	%
components/embedders/hugging_face_tei_document_embedder.py	1	98.48%
components/embedders/hugging_face_tei_text_embedder.py	1	97.73%
components/websearch/searchapi.py	2	96.36%
core/pipeline/pipeline.py	4	93.85%
components/embedders/azure_text_embedder.py	12	65.12%
utils/device.py	21	89.9%
components/converters/azure.py	23	89.5%
components/embedders/azure_document_embedder.py	26	52.24%

Totals
Change from base Build 8439010873:	-0.08%
Covered Lines:	5847
Relevant Lines:	6540

💛 - Coveralls

vblagoje · 2024-03-22T14:50:12Z

Ok! Let's just try to remove as many type:ignore as possible. 😉

Deal. And also add typing everywhere. I wanted to prepare it first so @sjrl can look at it and add some minor fixes. On Monday!

haystack/components/converters/azure.py

vblagoje · 2024-04-04T13:47:48Z

test/components/converters/test_azure_ocr_doc_converter.py

+ @pytest.mark.skip(
+ reason="fails because of non-unique column names, azure_sample_pdf_3.json has duplicate column names"
+ )
+ def test_azure_converter_with_multicolumn_header_table(self, mock_resolve_value, test_files_path) -> None:


@silvanocerza this one fails

vblagoje · 2024-04-05T10:27:44Z

@sjrl I have resolved the issue of document id calculation by using custom id in these failing cases. Please have a look last two commits; cc-ing @silvanocerza whose recommendation I followed.

silvanocerza · 2024-04-05T10:31:32Z

haystack/components/converters/azure.py

+ :returns: A hash of the DataFrame content.
+ """
+ # take adaptive sample of rows to hash because we can have very large dataframes
+ hasher = hashlib.sha256()


Can't we go with something faster like md5?
Or maybe even skip generating based on the dataframe completely and use an uuid?

I'm just concerned using SHA256 could be too much in this case.

Fixed to use md5 and 5 samples, this should be ok. Thanks @silvanocerza
@sjrl would you please run your internal tests on this latest version

sjrl

Thanks for this, looks good!

vblagoje requested review from a team as code owners March 22, 2024 14:02

vblagoje requested review from dfokina and anakin87 and removed request for a team March 22, 2024 14:02

github-actions bot added topic:tests 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Mar 22, 2024

vblagoje requested review from sjrl and removed request for anakin87 March 22, 2024 14:02

vblagoje changed the title ~~feat: Azure converter~~ feat: Azure converter updates Mar 22, 2024

sjrl reviewed Mar 25, 2024

View reviewed changes

haystack/components/converters/azure.py Show resolved Hide resolved

sjrl reviewed Mar 25, 2024

View reviewed changes

haystack/components/converters/azure.py Show resolved Hide resolved

sjrl reviewed Mar 25, 2024

View reviewed changes

haystack/components/converters/azure.py Show resolved Hide resolved

vblagoje added 9 commits March 27, 2024 10:07

Initial commit

2169f82

Remove old mock tests

e5f994d

Fix current_last_page_number calculation

f1d2d24

Carry over unit tests from the other side

f267837

Update pydocs, skip failing tests

6968cf3

Fix pylint and mypy

60f7466

Minor adjustments

674fa17

Add release note

8917838

Minor touch ups

a062fbd

vblagoje force-pushed the azure_converter branch from 7bf464e to a062fbd Compare March 27, 2024 09:22

vblagoje mentioned this pull request Apr 2, 2024

Update Document id generation algorithm #7450

Closed

vblagoje commented Apr 4, 2024

View reviewed changes

vblagoje added 2 commits April 5, 2024 11:58

Resolve Document unique id issue by using custom id calculation

dc86fe2

Better hashing, add unit tests

caf58c8

silvanocerza reviewed Apr 5, 2024

View reviewed changes

vblagoje added 2 commits April 5, 2024 16:10

Small fixes

ca9c8b9

Merge branch 'main' into azure_converter

8347a14

sjrl approved these changes Apr 8, 2024

View reviewed changes

vblagoje merged commit 988c360 into main Apr 9, 2024
23 checks passed

vblagoje deleted the azure_converter branch April 9, 2024 07:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Azure converter updates #7409

feat: Azure converter updates #7409

vblagoje commented Mar 22, 2024

vblagoje commented Mar 22, 2024

anakin87 commented Mar 22, 2024

coveralls commented Mar 22, 2024 •

edited

vblagoje commented Mar 22, 2024

vblagoje Apr 4, 2024

vblagoje commented Apr 5, 2024

silvanocerza Apr 5, 2024

silvanocerza Apr 5, 2024

vblagoje Apr 5, 2024

sjrl left a comment

feat: Azure converter updates #7409

feat: Azure converter updates #7409

Conversation

vblagoje commented Mar 22, 2024

Why:

What:

How can it be used:

How did you test it:

Notes for the reviewer:

vblagoje commented Mar 22, 2024

anakin87 commented Mar 22, 2024

coveralls commented Mar 22, 2024 • edited

Pull Request Test Coverage Report for Build 8571134297

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

vblagoje commented Mar 22, 2024

vblagoje Apr 4, 2024

Choose a reason for hiding this comment

vblagoje commented Apr 5, 2024

silvanocerza Apr 5, 2024

Choose a reason for hiding this comment

silvanocerza Apr 5, 2024

Choose a reason for hiding this comment

vblagoje Apr 5, 2024

Choose a reason for hiding this comment

sjrl left a comment

Choose a reason for hiding this comment

coveralls commented Mar 22, 2024 •

edited