Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add DocxToDocument converter #7838

Merged
merged 25 commits into from
Jun 12, 2024
Merged

Conversation

CarlosFerLo
Copy link
Contributor

@CarlosFerLo CarlosFerLo commented Jun 10, 2024

Related Issues

Proposed Changes:

Introducing the DocxFileToDocument converter component. It works using python-docx and a similar implementation to the one in v1.x.

How did you test it?

I have added a new test file containing tests to check it is functioning ok, I was inspired in the tests for PyPDFToDocument converter.

Notes for the reviewer

Currently, we have two issues:

  • I do not know how to add the 'python-docx' package to haystack, neither what to write in the lazy import.
  • I have found no way to add the page breaks to the resulting document, this makes a test brake.
  • The normal ByteStream, declared from a b-string and metadata, seems to make the python-docx library fail, as it only expects IO byte stream corresponding to a document, do not know how to proceed.

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes ✅
  • I added unit tests and updated the docstrings ✅
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:. ✅
  • I documented my code ✅
  • I ran pre-commit hooks and fixed any issue ✅

@CarlosFerLo CarlosFerLo requested a review from a team as a code owner June 10, 2024 20:32
@CarlosFerLo CarlosFerLo requested review from Amnah199 and removed request for a team June 10, 2024 20:32
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jun 10, 2024
@CarlosFerLo CarlosFerLo requested a review from a team as a code owner June 10, 2024 20:37
@CarlosFerLo CarlosFerLo requested review from dfokina and removed request for a team June 10, 2024 20:37
@sjrl
Copy link
Contributor

sjrl commented Jun 11, 2024

Thanks for the quick work on this @CarlosFerLo! Most of the comments are minor except for the page breaks. If you are willing to take a look into it that would be great! But given it's complexity I think leaving out page break counting is okay.

@CarlosFerLo CarlosFerLo requested a review from sjrl June 11, 2024 09:14
@sjrl sjrl changed the title feat: add DocxFIleToDocument converter feat: add DocxToDocument converter Jun 11, 2024
@sjrl
Copy link
Contributor

sjrl commented Jun 11, 2024

Hey @CarlosFerLo a few more requests:

  1. Update your base branch with main
  2. Add python-docx to the extra dependencies in the pyproject.toml to get tests to pass. See here
  3. Add docx to the modules in this docs list so your API docs will show up on the website.

@CarlosFerLo
Copy link
Contributor Author

@sjrl I believe everything is set now.

@CarlosFerLo CarlosFerLo requested a review from sjrl June 11, 2024 11:52
@CarlosFerLo
Copy link
Contributor Author

I don't know why, but this way to evaluate strings in warnings seems to be the cause for some tests to fail. I don't know why, as it is used in other components, and it works all right.

assert len(docs) == 1
assert "History" in docs[0].content

@pytest.mark.skip("For now, DocxToDocument does not preserve page brakes.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead and delete this test, and instead, we can open a feature request if we like for adding page break support.

Copy link
Contributor Author

@CarlosFerLo CarlosFerLo Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey, I will create an issue about it once this PR is resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And in the mean time can we delete this test?

CarlosFerLo and others added 2 commits June 11, 2024 15:16
Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Sebastian Husch Lee <[email protected]>
@coveralls
Copy link
Collaborator

coveralls commented Jun 11, 2024

Pull Request Test Coverage Report for Build 9466297399

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.05%) to 89.757%

Totals Coverage Status
Change from base Build 9451690597: -0.05%
Covered Lines: 6879
Relevant Lines: 7664

💛 - Coveralls

@coveralls
Copy link
Collaborator

coveralls commented Jun 11, 2024

Pull Request Test Coverage Report for Build 9466302370

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.05%) to 89.757%

Totals Coverage Status
Change from base Build 9451690597: -0.05%
Covered Lines: 6879
Relevant Lines: 7664

💛 - Coveralls

@CarlosFerLo CarlosFerLo requested a review from sjrl June 11, 2024 19:34
@coveralls
Copy link
Collaborator

coveralls commented Jun 11, 2024

Pull Request Test Coverage Report for Build 9471734026

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 51 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.03%) to 89.775%

Files with Coverage Reduction New Missed Lines %
core/pipeline/pipeline.py 51 65.48%
Totals Coverage Status
Change from base Build 9451690597: -0.03%
Covered Lines: 6892
Relevant Lines: 7677

💛 - Coveralls

@coveralls
Copy link
Collaborator

coveralls commented Jun 12, 2024

Pull Request Test Coverage Report for Build 9478875947

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 51 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.04%) to 89.763%

Files with Coverage Reduction New Missed Lines %
core/pipeline/pipeline.py 51 65.48%
Totals Coverage Status
Change from base Build 9451690597: -0.04%
Covered Lines: 6892
Relevant Lines: 7678

💛 - Coveralls

@coveralls
Copy link
Collaborator

coveralls commented Jun 12, 2024

Pull Request Test Coverage Report for Build 9478877690

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 51 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.04%) to 89.763%

Files with Coverage Reduction New Missed Lines %
core/pipeline/pipeline.py 51 65.48%
Totals Coverage Status
Change from base Build 9451690597: -0.04%
Covered Lines: 6892
Relevant Lines: 7678

💛 - Coveralls

@coveralls
Copy link
Collaborator

coveralls commented Jun 12, 2024

Pull Request Test Coverage Report for Build 9478934650

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 51 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.04%) to 89.763%

Files with Coverage Reduction New Missed Lines %
core/pipeline/pipeline.py 51 65.48%
Totals Coverage Status
Change from base Build 9451690597: -0.04%
Covered Lines: 6892
Relevant Lines: 7678

💛 - Coveralls

Copy link
Contributor

@sjrl sjrl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CarlosFerLo this looks good!

@sjrl sjrl merged commit c1c3399 into deepset-ai:main Jun 12, 2024
24 checks passed
@CarlosFerLo CarlosFerLo deleted the issue-7797 branch June 12, 2024 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: Port (and upgrade) DocxToDocument converter from Haystack v1
3 participants