Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change default encoding for PDFToTextConverter from Latin 1 to UTF-8 #2420

Merged
merged 26 commits into from
May 4, 2022

Conversation

ZanSara
Copy link
Contributor

@ZanSara ZanSara commented Apr 13, 2022

Problem
The default encoding for pdf2text is Latin1 which we used as a default for PDFToTextConverter as well. However, such encoding does not support most special characters, like accented letters, umlauts, Cyrillic, etc.

Further, the encoding parameter moved from PDFToTextConverter.convert to PDFToTextConverter.init so that the choice can be made in a pipeline yaml.

Solution
We change the default to UTF-8

Additional context
The documentation mentioned some special situations in which UTF-8 would fail to interpret some glyphs correctly, unlike Latin 1. It also mentioned that that specific case is covered by the tests. So, if tests pass, we can consider the issue solved and ignore the warning.

@ZanSara ZanSara requested a review from mkkuemmel April 13, 2022 17:21
@ZanSara ZanSara marked this pull request as ready for review April 13, 2022 17:21
@mkkuemmel
Copy link
Member

lgtm :)

@julian-risch
Copy link
Member

I don't think that our tests contain a check for ligatures or any other potential issues with UTF-8 / LATIN1 encoding. So the docstring is misleading. The challenge is that LATIN1 encodes and f‌i the same way, while UTF-8 doesn't. We usually would like to have the former behavior.

@ZanSara
Copy link
Contributor Author

ZanSara commented Apr 19, 2022

I interpreted the docstring as saying "the PDF used in the test used to contain letter pairs that were interpreted as ligatures", so I assumed that if the PDF would be parsed correctly, the issue won't hold anymore. However you make me question: do we ever want to detect ligatures? Shall I write a specific test for it?

In addition, I'm really puzzled about the failing preprocessor test. I don't see how could it ever work. My best guess is that the PDF has always been parsed incorrectly, which made "footer" never appear in the text at all, while now that the whole text is parsed correctly, the footer is found and the test fails. I have yet to test this hypothesis, so if you have a better guess let me know.

@julian-risch
Copy link
Member

julian-risch commented Apr 19, 2022

I just checked on master branch what happens if clean_header_footer is turned off:

    preprocessor = PreProcessor(clean_header_footer=False, split_by=None)
    documents = preprocessor.process(document)

    assert len(documents) == 1

    assert "This is a header." in documents[0].content
    assert "footer" in documents[0].content

Passes. So the footer is contained in the converted document.

@julian-risch
Copy link
Member

Only the final footer is not removed when the encoding is set to UTF-8. It is removed with the encoding set to Latin1 (as are all the headers in both cases).

@julian-risch
Copy link
Member

As the _find_and_remove_header_footer function searches for exact matches, maybe the final footer (footer on the final page) has a different line ending with UTF-8 encoding but not with Latin1 encoding. That would explain the problem.

@ZanSara
Copy link
Contributor Author

ZanSara commented Apr 19, 2022

I wrote this test to have pytest render the difference in text conversion:

from haystack.nodes.file_converter import PDFToTextConverter

def test_difference_from_encoding():
    converter = PDFToTextConverter()
    document_latin = converter.convert("test/samples/pdf/sample_pdf_2.pdf", encoding="Latin1")[0]
    document_utf = converter.convert("test/samples/pdf/sample_pdf_2.pdf")[0]

    assert document_latin.content == document_utf.content

Running this code with pytest test.py -vvv will render the diff. I can see that there are no differences in the footer, so the failure is even more mysterious. However, I see that ligatures are still an issue. I will see if I can disable or at least discourage ligature detection.

@julian-risch
Copy link
Member

There are two learnings I can add. First, setting n_last_pages_to_ignore in _find_and_remove_header_footer to 0 instead of the current value 1 doesn't work. There is a bug, which leads to no headers or footers found and we should fix it in a separate PR. Second, adding a blank page to the pdf sample_pdf_2.pdf makes the test pass as expected, presumably because the second last page of the pdf (containing a footer) is then taken into account when searching for a common footer string in all pages (except for the last, blank page).
So my hypothesis is still that the footer on the last page of the pdf is slightly different. I'll continue to check later.

@julian-risch
Copy link
Member

julian-risch commented Apr 29, 2022

@ZanSara Based on feedback from @Timoeller we should make the choice of Latin1 vs. UTF-8 as a parameter of the PDFToTextConverter and not only of the PDFToTextConverter.convert function so that the choice can be made in a pipeline yaml. I made that change in the commit below.

Copy link
Contributor

@Timoeller Timoeller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@brandenchan
Copy link
Contributor

brandenchan commented May 3, 2022

In tutorial 8, we make the suggestion to change the encoding to UTF-8 if the results aren't good. With this PR, we should be able to remove that. Could you make this change?

@julian-risch
Copy link
Member

@brandenchan thanks for the hint. I will do that.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants