Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add page number to Documents coming from PDFConverters and PreProcessor #2932

Merged
merged 17 commits into from
Aug 9, 2022

Conversation

bogdankostic
Copy link
Contributor

Related Issue(s):
Closes: #2374

Proposed changes:
This PR adds the parameter add_page_number to ParsrConverter, AzureConverter and PreProcessor. In ParsrConverter and AzureConverter, setting this parameter to True has the effect of adding a meta field "page" to Documents of content_type table containing the page number the table occurs in.
In PreProcessor, setting this parameter to True has the adding a meta field "page" to Documents containing the page number the Document starts at. Page breaks are determined by "\f" character, which is added by PDFToTextConverter, ParsrConverter and AzureConverter in between pages.

Also, I noticed that we don't display API Documentation on our documentation website for ParsrConverter and AzureConverter, so I added those.

Pre-flight checklist

@bogdankostic bogdankostic marked this pull request as ready for review July 30, 2022 08:18
@bogdankostic bogdankostic requested review from masci and ju-gu July 30, 2022 09:05
@sjrl sjrl added the type:feature New feature or request label Aug 1, 2022
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a frist pass

test/nodes/test_preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
@bogdankostic bogdankostic requested review from a team as code owners August 8, 2022 23:41
@bogdankostic bogdankostic requested a review from masci August 9, 2022 07:26
docs/_src/api/api/file_converter.md Show resolved Hide resolved
docs/_src/api/api/file_converter.md Show resolved Hide resolved
docs/_src/api/api/file_converter.md Show resolved Hide resolved
docs/_src/api/api/file_converter.md Show resolved Hide resolved
docs/_src/api/api/file_converter.md Show resolved Hide resolved
docs/_src/api/api/file_converter.md Show resolved Hide resolved
docs/_src/api/api/file_converter.md Show resolved Hide resolved
haystack/nodes/file_converter/azure.py Outdated Show resolved Hide resolved
haystack/nodes/file_converter/parsr.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
@bogdankostic bogdankostic changed the title Add page number to Documents coming from PDFConverters and PreProcessor feat: Add page number to Documents coming from PDFConverters and PreProcessor Aug 9, 2022
Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small nit but feel free to merge as it is

test/nodes/test_preprocessor.py Outdated Show resolved Hide resolved
haystack/nodes/preprocessor/preprocessor.py Outdated Show resolved Hide resolved
@bogdankostic bogdankostic merged commit 5c3bfad into master Aug 9, 2022
@bogdankostic bogdankostic deleted the page_info_pdfs branch August 9, 2022 13:55
@brunnurs
Copy link
Contributor

Thanks for implementing that feature @bogdankostic, it saved me from doing it myself :-) Unfortunately I found some wrong page numbers when using the PDFToTextConverter in combination with the PreProcessor and the new add_page_number feature. To be more specific, the pages seem to be incorrect for the files attached. I use the page number of the Document when delivering the search results and found that it's sometimes wrong by +- 3 pages. As an example search for "financial statement" in the tesla-document and you will find a Document on page 81, while the Document is actually on page 84.

tesla_annual_report.pdf
2020_deepmind_anual_report.pdf

I'm not a 100% sure if the bug is easy to fix or if the real solution is to store the page as meta data already when extracting the PDF, instead of doing it during pre processing. Should I open a bug ticket?

@bogdankostic
Copy link
Contributor Author

HI @brunnurs, thanks for bringing this up! Please open an issue about this and provide some details about the configurations you made for the PDFToTextConverter and the PreProcessor, this would help me immensely to debug this and hopefully provide a fix :)

@brunnurs
Copy link
Contributor

brunnurs commented Aug 26, 2022

@bogdankostic I tried to make the bug replicable as easy as possible in the ticket above. let me know if you need more information! Many thanks for your effort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add page information to Document metadata when converting PDF files
5 participants