feat: Add page number to Documents coming from PDFConverters and PreProcessor #2932

bogdankostic · 2022-07-30T00:06:36Z

Related Issue(s):
Closes: #2374

Proposed changes:
This PR adds the parameter add_page_number to ParsrConverter, AzureConverter and PreProcessor. In ParsrConverter and AzureConverter, setting this parameter to True has the effect of adding a meta field "page" to Documents of content_type table containing the page number the table occurs in.
In PreProcessor, setting this parameter to True has the adding a meta field "page" to Documents containing the page number the Document starts at. Page breaks are determined by "\f" character, which is added by PDFToTextConverter, ParsrConverter and AzureConverter in between pages.

Also, I noticed that we don't display API Documentation on our documentation website for ParsrConverter and AzureConverter, so I added those.

Pre-flight checklist

I have read the contributors guidelines
I have enabled actions on my fork
If this is a code change, I added tests or updated existing ones
If this is a code change, I updated the docstrings

masci

Took a frist pass

test/nodes/test_preprocessor.py

haystack/nodes/preprocessor/preprocessor.py

docs/_src/api/api/file_converter.md

haystack/nodes/file_converter/azure.py

haystack/nodes/file_converter/parsr.py

haystack/nodes/preprocessor/preprocessor.py

Co-authored-by: Agnieszka Marzec <[email protected]>

masci

One small nit but feel free to merge as it is

test/nodes/test_preprocessor.py

haystack/nodes/preprocessor/preprocessor.py

Co-authored-by: Agnieszka Marzec <[email protected]>

brunnurs · 2022-08-24T12:14:37Z

Thanks for implementing that feature @bogdankostic, it saved me from doing it myself :-) Unfortunately I found some wrong page numbers when using the PDFToTextConverter in combination with the PreProcessor and the new add_page_number feature. To be more specific, the pages seem to be incorrect for the files attached. I use the page number of the Document when delivering the search results and found that it's sometimes wrong by +- 3 pages. As an example search for "financial statement" in the tesla-document and you will find a Document on page 81, while the Document is actually on page 84.

tesla_annual_report.pdf
2020_deepmind_anual_report.pdf

I'm not a 100% sure if the bug is easy to fix or if the real solution is to store the page as meta data already when extracting the PDF, instead of doing it during pre processing. Should I open a bug ticket?

bogdankostic · 2022-08-24T12:52:51Z

HI @brunnurs, thanks for bringing this up! Please open an issue about this and provide some details about the configurations you made for the PDFToTextConverter and the PreProcessor, this would help me immensely to debug this and hopefully provide a fix :)

brunnurs · 2022-08-26T07:30:11Z

@bogdankostic I tried to make the bug replicable as easy as possible in the ticket above. let me know if you need more information! Many thanks for your effort

Add page number to Documents coming from PDFConverters and PreProcessor

6c41aa2

bogdankostic added the topic:preprocessing label Jul 30, 2022

bogdankostic added 7 commits July 30, 2022 02:14

Fix mypy

98580c6

Update API Docs

a2e7b82

Update API Docs

01eb7fd

Remove unused imports

60f926d

Generate JSON schema

d0fb2f6

Merge remote-tracking branch 'origin/master' into page_info_pdfs

b476f81

Generate JSON schema

a570da8

bogdankostic marked this pull request as ready for review July 30, 2022 08:18

bogdankostic requested review from masci and ju-gu July 30, 2022 09:05

sjrl added the type:feature New feature or request label Aug 1, 2022

agnieszka-m added the action:needs documentation label Aug 2, 2022

masci suggested changes Aug 5, 2022

View reviewed changes

bogdankostic added 5 commits August 8, 2022 16:21

Merge remote-tracking branch 'origin/master' into page_info_pdfs

85d3d9f

Make test variable shorter

2ea5041

Make regex a separate function

a75d5da

Move counting of page breaks to a function

745f0b6

Generate JSON schema

7296f50

bogdankostic requested review from a team as code owners August 8, 2022 23:41

bogdankostic requested a review from masci August 9, 2022 07:26

agnieszka-m requested changes Aug 9, 2022

View reviewed changes

bogdankostic and others added 2 commits August 9, 2022 09:50

Apply suggestions from code review

b8cce1f

Co-authored-by: Agnieszka Marzec <[email protected]>

Update API Documentation

cfac64e

bogdankostic changed the title ~~Add page number to Documents coming from PDFConverters and PreProcessor~~ feat: Add page number to Documents coming from PDFConverters and PreProcessor Aug 9, 2022

bogdankostic requested a review from agnieszka-m August 9, 2022 10:04

masci approved these changes Aug 9, 2022

View reviewed changes

test/nodes/test_preprocessor.py Outdated Show resolved Hide resolved

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved

Don't create instance for testing staticmethod

6795ce2

agnieszka-m reviewed Aug 9, 2022

View reviewed changes

haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved

Update haystack/nodes/preprocessor/preprocessor.py

0d87ab8

Co-authored-by: Agnieszka Marzec <[email protected]>

bogdankostic requested a review from agnieszka-m August 9, 2022 10:54

agnieszka-m approved these changes Aug 9, 2022

View reviewed changes

bogdankostic merged commit 5c3bfad into master Aug 9, 2022

bogdankostic deleted the page_info_pdfs branch August 9, 2022 13:55

brunnurs mentioned this pull request Aug 26, 2022

Page number in document meta data not correct #3106

Closed

1 task

brunnurs mentioned this pull request Oct 5, 2022

fix: Fix the error of wrong page numbers when documents contain empty pages. #3330

Merged

6 tasks

anakin87 mentioned this pull request Nov 18, 2022

fix: ParsrConverter fails on pages without text #3605

Merged

6 tasks

ZanSara mentioned this pull request Nov 24, 2022

Improve document conversion #3308

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add page number to Documents coming from PDFConverters and PreProcessor #2932

feat: Add page number to Documents coming from PDFConverters and PreProcessor #2932

bogdankostic commented Jul 30, 2022

masci left a comment

masci left a comment

brunnurs commented Aug 24, 2022

bogdankostic commented Aug 24, 2022

brunnurs commented Aug 26, 2022 •

edited

Loading

feat: Add page number to Documents coming from PDFConverters and PreProcessor #2932

feat: Add page number to Documents coming from PDFConverters and PreProcessor #2932

Conversation

bogdankostic commented Jul 30, 2022

Pre-flight checklist

masci left a comment

Choose a reason for hiding this comment

masci left a comment

Choose a reason for hiding this comment

brunnurs commented Aug 24, 2022

bogdankostic commented Aug 24, 2022

brunnurs commented Aug 26, 2022 • edited Loading

brunnurs commented Aug 26, 2022 •

edited

Loading