New lines no longer included in extract_text() on 4.3 for a specific PDF file #2777

supertassu · 2024-07-27T09:38:11Z

Hi! The ATP rankings are published as a PDF that I'm trying to parse, but since pypdf 4.3 calling extract_text() no longer includes new line characters.

This worked fine on pypdf 4.2, so I did a git bisect. That suggests that this issue was introduced in commit 23a81ba.

Environment

This is with Python 3.12.4 in a venv on Debian testing.

$ venv/bin/python3 -m platform
Linux-6.9.9-amd64-x86_64-with-glibc2.39

$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0

Code + PDF

The following PDF is the first page of the published results for Jul 22, 2024:
singles_entry_numerical_2024_07_22_firstpage.pdf

import pypdf

parser = pypdf.PdfReader("singles_entry_numerical_2024_07_22_firstpage.pdf")
page = parser.pages[0]
text = page.extract_text()

print(text)

When running this with pypdf 4.2, the extracted text contains new line characters just fine:

$ venv/bin/pip3 install pypdf==4.2.0
$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.2.0, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0
Rankings Date: 
Rank # Player Jul 22, 2024 Grand Slam 
Natl. Points 
Dropping Next 
Best Tourns. 
Played Points Masters 
1000 Points Other 
Points Points Total 
1 Sinner, Jannik (ITA) 9570 3380 0 0 18 3190 3000 
[...]
Page 1 of 42 Rankings/ Numerical Order/ Complete/ Singles Report as of Jul 22, 2024

But on 4.3, new lines are no longer included:

$ venv/bin/pip3 install pypdf==4.3.1
$ venv/bin/python3 -c "import pypdf;print(pypdf._debug_versions)"
pypdf==4.3.1, crypt_provider=('local_crypt_fallback', '0.0.0'), PIL=10.4.0
$ venv/bin/python3 test.py 
Rankings Date: Rank # Player Jul 22, 2024 Grand Slam Natl. Points Dropping Next Best Tourns. Played Points Masters 1000 Points Other Points Points Total 1 Sinner, Jannik (ITA) 9570 3380 0 018 3190 3000 [...] Page 1 of 42 Rankings/ Numerical Order/ Complete/ Singles Report as of Jul 22, 2024

Traceback

N/A

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2024-07-27T15:14:35Z

Thanks for the report. This seems indeed like a regression which should only affect content streams encoded with Ascii85 due to 23a81ba#diff-185702ddcfbf2e4a9ef7106622bb77505eacae032966bba39c65ffb9cd0f9bc7R504-R505

stefan6419846 · 2024-10-03T13:15:44Z

According to #2882 (comment), this has just been fixed.

stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-regression Regression introduced as a side-effect of another change labels Jul 27, 2024

ssjkamei mentioned this issue Oct 2, 2024

BUG: Issue in text extraction (spaces) (#1153) #2882

Merged

stefan6419846 closed this as completed Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New lines no longer included in extract_text() on 4.3 for a specific PDF file #2777

New lines no longer included in extract_text() on 4.3 for a specific PDF file #2777

supertassu commented Jul 27, 2024

stefan6419846 commented Jul 27, 2024

stefan6419846 commented Oct 3, 2024

New lines no longer included in extract_text() on 4.3 for a specific PDF file #2777

New lines no longer included in extract_text() on 4.3 for a specific PDF file #2777

Comments

supertassu commented Jul 27, 2024

Environment

Code + PDF

Traceback

stefan6419846 commented Jul 27, 2024

stefan6419846 commented Oct 3, 2024