Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New lines no longer included in extract_text() on 4.3 for a specific PDF file #2777

Closed
supertassu opened this issue Jul 27, 2024 · 2 comments
Labels
is-regression Regression introduced as a side-effect of another change workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@supertassu
Copy link

Hi! The ATP rankings are published as a PDF that I'm trying to parse, but since pypdf 4.3 calling extract_text() no longer includes new line characters.

This worked fine on pypdf 4.2, so I did a git bisect. That suggests that this issue was introduced in commit 23a81ba.

Environment

This is with Python 3.12.4 in a venv on Debian testing.

$ venv/bin/python3 -m platform
Linux-6.9.9-amd64-x86_64-with-glibc2.39

$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0

Code + PDF

The following PDF is the first page of the published results for Jul 22, 2024:
singles_entry_numerical_2024_07_22_firstpage.pdf

import pypdf

parser = pypdf.PdfReader("singles_entry_numerical_2024_07_22_firstpage.pdf")
page = parser.pages[0]
text = page.extract_text()

print(text)

When running this with pypdf 4.2, the extracted text contains new line characters just fine:

$ venv/bin/pip3 install pypdf==4.2.0
$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0
Rankings Date: 
Rank # Player Jul 22, 2024 Grand Slam 
Natl. Points 
Dropping Next 
Best Tourns. 
Played Points Masters 
1000 Points Other 
Points Points Total 
1 Sinner, Jannik (ITA) 9570 3380 0 0 18 3190 3000 
[...]
Page 1 of 42 Rankings/ Numerical Order/ Complete/ Singles Report as of Jul 22, 2024 

But on 4.3, new lines are no longer included:

$ venv/bin/pip3 install pypdf==4.3.1
$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0
$ venv/bin/python3 test.py 
Rankings Date: Rank # Player Jul 22, 2024 Grand Slam Natl. Points Dropping Next Best Tourns. Played Points Masters 1000 Points Other Points Points Total 1 Sinner, Jannik (ITA) 9570 3380 0 018 3190 3000 [...] Page 1 of 42 Rankings/ Numerical Order/ Complete/ Singles Report as of Jul 22, 2024 

Traceback

N/A

@stefan6419846
Copy link
Collaborator

Thanks for the report. This seems indeed like a regression which should only affect content streams encoded with Ascii85 due to 23a81ba#diff-185702ddcfbf2e4a9ef7106622bb77505eacae032966bba39c65ffb9cd0f9bc7R504-R505

@stefan6419846 stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-regression Regression introduced as a side-effect of another change labels Jul 27, 2024
@stefan6419846
Copy link
Collaborator

According to #2882 (comment), this has just been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-regression Regression introduced as a side-effect of another change workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants