-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Azure converter updates #7409
Conversation
Ok! Let's just try to remove as many |
Pull Request Test Coverage Report for Build 8571134297Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Deal. And also add typing everywhere. I wanted to prepare it first so @sjrl can look at it and add some minor fixes. On Monday! |
7bf464e
to
a062fbd
Compare
@pytest.mark.skip( | ||
reason="fails because of non-unique column names, azure_sample_pdf_3.json has duplicate column names" | ||
) | ||
def test_azure_converter_with_multicolumn_header_table(self, mock_resolve_value, test_files_path) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@silvanocerza this one fails
@sjrl I have resolved the issue of document id calculation by using custom id in these failing cases. Please have a look last two commits; cc-ing @silvanocerza whose recommendation I followed. |
:returns: A hash of the DataFrame content. | ||
""" | ||
# take adaptive sample of rows to hash because we can have very large dataframes | ||
hasher = hashlib.sha256() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we go with something faster like md5
?
Or maybe even skip generating based on the dataframe completely and use an uuid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just concerned using SHA256 could be too much in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to use md5 and 5 samples, this should be ok. Thanks @silvanocerza
@sjrl would you please run your internal tests on this latest version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, looks good!
Why:
The pull request introduces significant improvements and enhancements to the
AzureOCRDocumentConverter
. The motivation behind these changes is to provide greater flexibility, improved accuracy, and a wider set of features when using Azure's OCR capabilities. Specifically, it addresses the need for handling complex document layouts more effectively, including better management of tables, text paragraphs, and the overall page layout.What:
preceding_context_len
,following_context_len
,merge_multiple_column_headers
,page_layout
,threshold_y
) to customize the behavior of document conversion.natural
andsingle_column
) and introduced logic to merge texts based on physical proximity.How can it be used:
How did you test it:
Notes for the reviewer:
networkx
,pandas
), compatibility and dependency management should be verified to avoid potential conflicts in projects using the Haystack library.