-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: added excel backend #334
Conversation
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
🟢 Require two reviewer for test updatesWonderful, this rule succeeded.When test data is updated, we require two reviewers
|
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
docling/backend/msexcel_backend.py
Outdated
return k | ||
return 0 | ||
|
||
def convert_workbook(self, doc: DoclingDocument) -> DoclingDocument: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All methods that are not meant for the public API (presumably this one and various ones below), should be start with a _
as per PEP 8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docling/backend/msexcel_backend.py
Outdated
|
||
return doc | ||
|
||
def get_level(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method used somewhere? If not, let's best remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docling/backend/msexcel_backend.py
Outdated
return { | ||
"beg_row": start_row, | ||
"beg_col": start_col, | ||
"end_row": max_row, | ||
"end_col": max_col, | ||
"num_rows": max_row + 1 - start_row, | ||
"num_cols": max_col + 1 - start_col, | ||
"data": data, | ||
}, visited_cells |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use pydantic classes instead of raw dictionaries, as this keeps way easier to maintain.
Also I think several of these keys (e.g. "beg_row", "beg_col", ...) appear to be unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/test_msexcel.py
Outdated
xlsx_path.parent.parent / "groundtruth" / "docling_v2" / xlsx_path.name | ||
) | ||
|
||
conv_result: ConversionResult = converter.convert(xlsx_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to test also with the bytes stream
docling/backend/msexcel_backend.py
Outdated
# Parses the DOCX into a structured document model. | ||
|
||
origin = DocumentOrigin( | ||
filename=self.file.name or "file", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a default, I would suggest file.xlsx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docling/backend/msexcel_backend.py
Outdated
from typing import Set, Tuple, Union | ||
|
||
from docling_core.types.doc import ( | ||
DocItemLabel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DocItemLabel - is unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Adding a backend to convert excel workbooks.
Resolves #258.
Checklist:
conventional commits.