-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change default encoding for PDFToTextConverter
from Latin 1
to UTF-8
#2420
Conversation
lgtm :) |
I interpreted the docstring as saying "the PDF used in the test used to contain letter pairs that were interpreted as ligatures", so I assumed that if the PDF would be parsed correctly, the issue won't hold anymore. However you make me question: do we ever want to detect ligatures? Shall I write a specific test for it? In addition, I'm really puzzled about the failing preprocessor test. I don't see how could it ever work. My best guess is that the PDF has always been parsed incorrectly, which made "footer" never appear in the text at all, while now that the whole text is parsed correctly, the footer is found and the test fails. I have yet to test this hypothesis, so if you have a better guess let me know. |
I just checked on master branch what happens if
Passes. So the footer is contained in the converted document. |
Only the final footer is not removed when the encoding is set to UTF-8. It is removed with the encoding set to Latin1 (as are all the headers in both cases). |
As the |
I wrote this test to have pytest render the difference in text conversion: from haystack.nodes.file_converter import PDFToTextConverter
def test_difference_from_encoding():
converter = PDFToTextConverter()
document_latin = converter.convert("test/samples/pdf/sample_pdf_2.pdf", encoding="Latin1")[0]
document_utf = converter.convert("test/samples/pdf/sample_pdf_2.pdf")[0]
assert document_latin.content == document_utf.content Running this code with |
There are two learnings I can add. First, setting |
…stack into pdf2text_default_encoding
…stack into pdf2text_default_encoding
@ZanSara Based on feedback from @Timoeller we should make the choice of Latin1 vs. UTF-8 as a parameter of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
In tutorial 8, we make the suggestion to change the encoding to UTF-8 if the results aren't good. With this PR, we should be able to remove that. Could you make this change? |
@brandenchan thanks for the hint. I will do that. |
…stack into pdf2text_default_encoding
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Problem
The default encoding for pdf2text is Latin1 which we used as a default for
PDFToTextConverter
as well. However, such encoding does not support most special characters, like accented letters, umlauts, Cyrillic, etc.Further, the encoding parameter moved from
PDFToTextConverter.convert
toPDFToTextConverter.init
so that the choice can be made in a pipeline yaml.Solution
We change the default to UTF-8
Additional context
The documentation mentioned some special situations in which UTF-8 would fail to interpret some glyphs correctly, unlike Latin 1. It also mentioned that that specific case is covered by the tests. So, if tests pass, we can consider the issue solved and ignore the warning.