Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added excel backend #334

Merged
merged 19 commits into from
Nov 19, 2024
Merged

feat: added excel backend #334

merged 19 commits into from
Nov 19, 2024

Conversation

PeterStaar-IBM
Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM commented Nov 14, 2024

Adding a backend to convert excel workbooks.

Resolves #258.

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the
    conventional commits.
  • Documentation has been updated, if necessary. -> not necessary
  • Examples have been added, if necessary. -> not necessary
  • Tests have been added, if necessary.

Signed-off-by: Peter Staar <[email protected]>
@PeterStaar-IBM PeterStaar-IBM self-assigned this Nov 14, 2024
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Copy link

mergify bot commented Nov 15, 2024

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
@PeterStaar-IBM PeterStaar-IBM marked this pull request as ready for review November 18, 2024 08:22
return k
return 0

def convert_workbook(self, doc: DoclingDocument) -> DoclingDocument:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All methods that are not meant for the public API (presumably this one and various ones below), should be start with a _ as per PEP 8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


return doc

def get_level(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method used somewhere? If not, let's best remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 257 to 265
return {
"beg_row": start_row,
"beg_col": start_col,
"end_row": max_row,
"end_col": max_col,
"num_rows": max_row + 1 - start_row,
"num_cols": max_col + 1 - start_col,
"data": data,
}, visited_cells
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use pydantic classes instead of raw dictionaries, as this keeps way easier to maintain.
Also I think several of these keys (e.g. "beg_row", "beg_col", ...) appear to be unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pyproject.toml Outdated Show resolved Hide resolved
docling/backend/msexcel_backend.py Outdated Show resolved Hide resolved
tests/test_msexcel.py Outdated Show resolved Hide resolved
xlsx_path.parent.parent / "groundtruth" / "docling_v2" / xlsx_path.name
)

conv_result: ConversionResult = converter.convert(xlsx_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting to test also with the bytes stream

docling/backend/msexcel_backend.py Outdated Show resolved Hide resolved
# Parses the DOCX into a structured document model.

origin = DocumentOrigin(
filename=self.file.name or "file",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a default, I would suggest file.xlsx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

from typing import Set, Tuple, Union

from docling_core.types.doc import (
DocItemLabel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocItemLabel - is unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
Signed-off-by: Peter Staar <[email protected]>
maxmnemonic
maxmnemonic previously approved these changes Nov 19, 2024
Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@PeterStaar-IBM PeterStaar-IBM merged commit 926dfd2 into main Nov 19, 2024
8 checks passed
@PeterStaar-IBM PeterStaar-IBM deleted the dev/add-excel-backend branch November 19, 2024 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Excel files
4 participants