Change default encoding for `PDFToTextConverter` from `Latin 1` to `UTF-8` #2420

ZanSara · 2022-04-13T17:05:49Z

Problem
The default encoding for pdf2text is Latin1 which we used as a default for PDFToTextConverter as well. However, such encoding does not support most special characters, like accented letters, umlauts, Cyrillic, etc.

Further, the encoding parameter moved from PDFToTextConverter.convert to PDFToTextConverter.init so that the choice can be made in a pipeline yaml.

Solution
We change the default to UTF-8

Additional context
The documentation mentioned some special situations in which UTF-8 would fail to interpret some glyphs correctly, unlike Latin 1. It also mentioned that that specific case is covered by the tests. So, if tests pass, we can consider the issue solved and ignore the warning.

mkkuemmel · 2022-04-14T05:45:27Z

lgtm :)

julian-risch · 2022-04-19T06:47:28Z

I don't think that our tests contain a check for ligatures or any other potential issues with UTF-8 / LATIN1 encoding. So the docstring is misleading. The challenge is that LATIN1 encodes ﬁ and f‌i the same way, while UTF-8 doesn't. We usually would like to have the former behavior.

ZanSara · 2022-04-19T07:26:57Z

I interpreted the docstring as saying "the PDF used in the test used to contain letter pairs that were interpreted as ligatures", so I assumed that if the PDF would be parsed correctly, the issue won't hold anymore. However you make me question: do we ever want to detect ligatures? Shall I write a specific test for it?

In addition, I'm really puzzled about the failing preprocessor test. I don't see how could it ever work. My best guess is that the PDF has always been parsed incorrectly, which made "footer" never appear in the text at all, while now that the whole text is parsed correctly, the footer is found and the test fails. I have yet to test this hypothesis, so if you have a better guess let me know.

julian-risch · 2022-04-19T13:09:36Z

I just checked on master branch what happens if clean_header_footer is turned off:

    preprocessor = PreProcessor(clean_header_footer=False, split_by=None)
    documents = preprocessor.process(document)

    assert len(documents) == 1

    assert "This is a header." in documents[0].content
    assert "footer" in documents[0].content

Passes. So the footer is contained in the converted document.

julian-risch · 2022-04-19T13:29:11Z

Only the final footer is not removed when the encoding is set to UTF-8. It is removed with the encoding set to Latin1 (as are all the headers in both cases).

julian-risch · 2022-04-19T13:31:59Z

As the _find_and_remove_header_footer function searches for exact matches, maybe the final footer (footer on the final page) has a different line ending with UTF-8 encoding but not with Latin1 encoding. That would explain the problem.

ZanSara · 2022-04-19T13:46:32Z

I wrote this test to have pytest render the difference in text conversion:

from haystack.nodes.file_converter import PDFToTextConverter

def test_difference_from_encoding():
    converter = PDFToTextConverter()
    document_latin = converter.convert("test/samples/pdf/sample_pdf_2.pdf", encoding="Latin1")[0]
    document_utf = converter.convert("test/samples/pdf/sample_pdf_2.pdf")[0]

    assert document_latin.content == document_utf.content

Running this code with pytest test.py -vvv will render the diff. I can see that there are no differences in the footer, so the failure is even more mysterious. However, I see that ligatures are still an issue. I will see if I can disable or at least discourage ligature detection.

julian-risch · 2022-04-19T14:17:02Z

There are two learnings I can add. First, setting n_last_pages_to_ignore in _find_and_remove_header_footer to 0 instead of the current value 1 doesn't work. There is a bug, which leads to no headers or footers found and we should fix it in a separate PR. Second, adding a blank page to the pdf sample_pdf_2.pdf makes the test pass as expected, presumably because the second last page of the pdf (containing a footer) is then taken into account when searching for a common footer string in all pages (except for the last, blank page).
So my hypothesis is still that the footer on the last page of the pdf is slightly different. I'll continue to check later.

…h list at need

…stack into pdf2text_default_encoding

julian-risch · 2022-04-29T11:47:27Z

@ZanSara Based on feedback from @Timoeller we should make the choice of Latin1 vs. UTF-8 as a parameter of the PDFToTextConverter and not only of the PDFToTextConverter.convert function so that the choice can be made in a pipeline yaml. I made that change in the commit below.

Timoeller

❤️

brandenchan · 2022-05-03T14:42:39Z

In tutorial 8, we make the suggestion to change the encoding to UTF-8 if the results aren't good. With this PR, we should be able to remove that. Could you make this change?

julian-risch · 2022-05-03T15:12:46Z

@brandenchan thanks for the hint. I will do that.

…stack into pdf2text_default_encoding

review-notebook-app · 2022-05-03T15:16:59Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

ZanSara and others added 2 commits April 13, 2022 19:02

Change default encoding for PDFToTextConverter

62748be

Update Documentation & Code Style

7592f1a

ZanSara requested a review from mkkuemmel April 13, 2022 17:21

ZanSara marked this pull request as ready for review April 13, 2022 17:21

ZanSara added topic:file_converter type:refactor Not necessarily visible to the users journey:first steps labels Apr 13, 2022

ZanSara and others added 2 commits April 14, 2022 10:48

Improve docstring

110ff0c

Update Documentation & Code Style

e92e308

julian-risch added the breaking change label Apr 19, 2022

ZanSara and others added 8 commits April 19, 2022 16:44

Add list of ligatures to ignore and add the possibility to modify suc…

e6ce28b

…h list at need

Add docstring

0ee8054

Add tests

d4c7de0

Rename parameter

3807c3d

Update Documentation & Code Style

40ee347

Move implementation into the base converter to make mypy happier

698f97b

Merge branch 'pdf2text_default_encoding' of github.com:deepset-ai/hay…

2cec4c5

…stack into pdf2text_default_encoding

Update Documentation & Code Style

b28f1e7

ZanSara requested a review from julian-risch April 20, 2022 07:26

ZanSara added 3 commits April 20, 2022 09:44

mypy and pylint

754dcd4

Merge branch 'pdf2text_default_encoding' of github.com:deepset-ai/hay…

0794cff

…stack into pdf2text_default_encoding

mypy

3653900

move encoding parameter to init of PDFToTextConverter

32a0c48

Timoeller approved these changes Apr 29, 2022

View reviewed changes

github-actions bot and others added 5 commits April 29, 2022 12:13

Update Documentation & Code Style

fca3b2a

make utf8 default and fix mypy

2c3fe4d

Update Documentation & Code Style

9e4323e

Merge branch 'master' into pdf2text_default_encoding

9034f7b

Update Documentation & Code Style

3309451

julian-risch added 2 commits May 3, 2022 17:14

remove note on encoding in tutorial8

4698ab7

Merge branch 'pdf2text_default_encoding' of github.com:deepset-ai/hay…

142b4f9

…stack into pdf2text_default_encoding

github-actions bot and others added 3 commits May 3, 2022 15:30

Update Documentation & Code Style

c41f4dc

skip OCRConverter and test converter.run

7d549b9

Update Documentation & Code Style

fec23b3

julian-risch merged commit 01ea4bf into master May 4, 2022

julian-risch deleted the pdf2text_default_encoding branch May 4, 2022 15:01

This was referenced May 4, 2022

PDFToTextOCRConverter returns empty document #2497

Closed

Add run_batch method to all nodes and Pipeline to allow batch querying #2481

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default encoding for `PDFToTextConverter` from `Latin 1` to `UTF-8` #2420

Change default encoding for `PDFToTextConverter` from `Latin 1` to `UTF-8` #2420

ZanSara commented Apr 13, 2022 •

edited by julian-risch

Loading

mkkuemmel commented Apr 14, 2022

julian-risch commented Apr 19, 2022

ZanSara commented Apr 19, 2022

julian-risch commented Apr 19, 2022 •

edited

Loading

julian-risch commented Apr 19, 2022

julian-risch commented Apr 19, 2022

ZanSara commented Apr 19, 2022

julian-risch commented Apr 19, 2022

julian-risch commented Apr 29, 2022 •

edited

Loading

Timoeller left a comment

brandenchan commented May 3, 2022 •

edited

Loading

julian-risch commented May 3, 2022

review-notebook-app bot commented May 3, 2022

Change default encoding for PDFToTextConverter from Latin 1 to UTF-8 #2420

Change default encoding for PDFToTextConverter from Latin 1 to UTF-8 #2420

Conversation

ZanSara commented Apr 13, 2022 • edited by julian-risch Loading

mkkuemmel commented Apr 14, 2022

julian-risch commented Apr 19, 2022

ZanSara commented Apr 19, 2022

julian-risch commented Apr 19, 2022 • edited Loading

julian-risch commented Apr 19, 2022

julian-risch commented Apr 19, 2022

ZanSara commented Apr 19, 2022

julian-risch commented Apr 19, 2022

julian-risch commented Apr 29, 2022 • edited Loading

Timoeller left a comment

Choose a reason for hiding this comment

brandenchan commented May 3, 2022 • edited Loading

julian-risch commented May 3, 2022

review-notebook-app bot commented May 3, 2022

Change default encoding for `PDFToTextConverter` from `Latin 1` to `UTF-8` #2420

Change default encoding for `PDFToTextConverter` from `Latin 1` to `UTF-8` #2420

ZanSara commented Apr 13, 2022 •

edited by julian-risch

Loading

julian-risch commented Apr 19, 2022 •

edited

Loading

julian-risch commented Apr 29, 2022 •

edited

Loading

brandenchan commented May 3, 2022 •

edited

Loading