Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Handle indirect objects in font width calculations #2967

Merged
merged 8 commits into from
Nov 29, 2024

Conversation

nsw42
Copy link
Contributor

@nsw42 nsw42 commented Nov 25, 2024

Closes #2966

This handles indirect objects when calculating font widths

@nsw42 nsw42 changed the title Handle indirect objects in font width calculations BUG: Handle indirect objects in font width calculations Nov 25, 2024
@stefan6419846
Copy link
Collaborator

Thanks for the report and PR. Could you please add a corresponding test as well?

Regarding your other remark:

_get_acutual_font_widths (sic)

Feel free to fix this name. There should be no API implications of doing so due to being internal anyway.

@nsw42
Copy link
Contributor Author

nsw42 commented Nov 26, 2024

Thanks for the report and PR. Could you please add a corresponding test as well?

Good point. I'll spend some time familiarising myself with the tests in the repo over the next couple of days and update the PR.

Regarding your other remark:

_get_acutual_font_widths (sic)

Feel free to fix this name. There should be no API implications of doing so due to being internal anyway.

Will do.

Copy link

codecov bot commented Nov 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.36%. Comparing base (5b80cbb) to head (56ca08d).
Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2967   +/-   ##
=======================================
  Coverage   96.36%   96.36%           
=======================================
  Files          52       52           
  Lines        8746     8747    +1     
  Branches     1590     1591    +1     
=======================================
+ Hits         8428     8429    +1     
  Misses        190      190           
  Partials      128      128           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefan6419846
Copy link
Collaborator

Mmh, the exact issue reported in #2966 apparently is not covered by the tests. Shouldn't it reproduce it and thus find the IndirectObject there as you used your example document - or am I missing something?

@nsw42
Copy link
Contributor Author

nsw42 commented Nov 27, 2024

Mmh, the exact issue reported in #2966 apparently is not covered by the tests. Shouldn't it reproduce it and thus find the IndirectObject there as you used your example document - or am I missing something?

That's what should happen - and the code coverage shows that that is happening, except for one edge case (see below). I also confirmed that the test reproduces the problem by reverting _cmap.py to the version on main, and the test failed with the exception I reported in #2966.

Re the edge case: I added three checks against IndirectObject, one in compute_font_width and two in compute_space_width. pycov is highlighting the second code path in compute_space_width, which is the exception handler for when there's no entry for space_char in font_width_map. The code used to return font_width_map["default"] / 2 for that path, and I assumed that font_width_map["default"] could be an IndirectObject, in the same way that the exception demonstrated that font_width_map[space_char] could be. I therefore added a test against IndirectObject within the exception handler. I don't have a way of constructing a PDF that would trigger that code path, hence the lack of coverage for that line. The other two tests against IndirectObject were covered by the new test.

Hope that makes sense, but let me know if I'm wrong about the possibility of font_width_map["default"] being an IndirectObject.

@stefan6419846
Copy link
Collaborator

Skimming through the code, the only place where font_width_map["default"] is written and could be an IndirectObject seems to be https://github.com/nsw42/pypdf/blob/795be0a421cb57b3a6883573f46330d1f812fe04/pypdf/_cmap.py#L409 My recommendation would be to rewrite this section/exception handling instead and revert the currently not covered line:

        if "/DW" in ft1:
            font_width_map["default"] = cast(float, ft1["/DW"].get_object())
        else:
            font_width_map["default"] = default_font_width

Please note that I have not tested this myself, but IMHO Exception can only be KeyError here which we can prevent directly.

@nsw42
Copy link
Contributor Author

nsw42 commented Nov 28, 2024

... the only place where font_width_map["default"] is written and could be an IndirectObject seems to be https://github.com/nsw42/pypdf/blob/795be0a421cb57b3a6883573f46330d1f812fe04/pypdf/_cmap.py#L409
My recommendation would be to rewrite this section/exception handling instead and revert the currently not covered line
...

Thanks for the extra insight. Happy to revert the not covered line.

Indeed, by pointing at build_font_width_map, you've made me realise that the problem is actually in there, not in the places I changed: font_width_map (accessed in the functions changed in my original PR) has been assigned the return value from build_font_width_map, and that return value is not complying with its declared type hint of Dict[Any, float], because some of the values in the dict are IndirectObjects. So, the better approach is probably to rework the PR to avoid indirect objects in the dictionary values.

Another thing I noticed looking at build_font_width_map: the assignment to default_font_width is missing the parentheses to call get_object, meaning that the lookup in _default_fonts_space_width will always result in a KeyError. So, I will fix that.

Since raising the bug report and opening the PR, I've since seen #2286, so a variant of this problem has been seen and fixed before; so it looks like a recent regression. But, nonetheless, it seems challenging to provoke PDF libraries to create files that cover all code paths through that function. I'm therefore going to change only the code path(s) that I can see encounter an IndirectObject.

@stefan6419846
Copy link
Collaborator

Since raising the bug report and opening the PR, I've since seen #2286, so a variant of this problem has been seen and fixed before; so it looks like a recent regression. But, nonetheless, it seems challenging to provoke PDF libraries to create files that cover all code paths through that function.

Unresolved IndirectObjects indeed tend to be a recurring theme and I just stumbled upon another location this week as well (#2965). We can only solve this gradually once we see such cases in the wild and add corresponding tests to avoid these in the future. You can find lots of PDF files with use special constructs or partially violate the standard as well. Thus covering all cases without exposing confidential data is a common topic and sometimes requires extended knowledge of the specification and/or manually editing files with a text editor and similar.

About the regression part, I am not sure - this would require verification with older pypdf versions. It might just be one of the aforementioned cases we never had test coverage for (despite the corresponding line possibly be marked as covered for other reasons).

@nsw42
Copy link
Contributor Author

nsw42 commented Nov 29, 2024

Unresolved IndirectObjects indeed tend to be a recurring theme and I just stumbled upon another location this week as well (#2965). We can only solve this gradually once we see such cases in the wild and add corresponding tests to avoid these in the future.

Makes sense.

I've compromised slightly on my previous stance of only adding checks for IndirectObject that I could demonstrate with a test. The updated PR has now got two checks for IndirectObject; the isinstance(second, int) code path is covered by the test. Although I can't provoke it, it seemed sensible to add the corresponding handling of IndirectObject in the isinstance(second, list) branch a few lines later in the same loop.

@stefan6419846
Copy link
Collaborator

Thanks. I have committed some further simplifications of the code and will probably merge it later on.

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the report and fix.

@stefan6419846 stefan6419846 merged commit 1383234 into py-pdf:main Nov 29, 2024
16 checks passed
@nsw42
Copy link
Contributor Author

nsw42 commented Nov 29, 2024

Thanks for the report and fix.

Thanks for your patience as I learned about PDF document internal structure and the library implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Exception on indirect object during text extraction
2 participants