Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong letters in pdf #250

Open
ThomasCartier opened this issue Oct 29, 2023 · 2 comments
Open

Wrong letters in pdf #250

ThomasCartier opened this issue Oct 29, 2023 · 2 comments

Comments

@ThomasCartier
Copy link

Hi,

The following code reports wrong letters (they are added by 4 for some reasons)

The culprit:
https://www.dropbox.com/scl/fi/6a8zuy70s05pntvxm0vae/test.pdf?rlkey=ylju1wbavr8rff10jp621u6bo&dl=0

It reports DEF instead of ABC.

    #[cfg(any(feature = "pom_parser", feature = "nom_parser"))] // same result with "pom"

    let doc_res = Document::load("/path/to/test.pdf");

    let mut doc = match doc_res {
        Ok(v) => v,
        Err(_) => return,
    };

    doc.decompress();
    let mut page_id: u32 = 0;
    for x in doc.get_pages().iter() {

        let t = doc.extract_text(&[*x.0]);
        match t {
            Ok(b) => {
                println!("{}", b);
            }
            Err(e) => println!("Nope {}", e),
        }
    }

    return;

Any idea why ? anything wrong with my code ?
Thanks

@Angr1st
Copy link

Angr1st commented Jun 19, 2024

Your PDF example file contains DEF text content but show ABC when opened in a PDF viewer. This seems to be done by using rg/RG operators to manually draw ABC and avoiding to use Text streams or something like it (I am no PDF expert).

@Heinenen
Copy link
Collaborator

Heinenen commented Aug 9, 2024

This behavior is caused by a bug/missing feature of lopdf.

I am relatively sure that the rg/RG operators have nothing to do with this, as they seem only set the color for whatever comes next in the content stream.
image
(Taken from PDF2.0 spec, Annex A: Operator Summary)

The relevant part of the PDF that is responsible for rendering the A is only

/Fo0S0 12.00000 Tf
<44> Tj

The first line sets the font (which is defined in the resource dictionary of the page), <44> defines the glyph that is to be rendered, and Tj tells the reader to render that glyph.

What lopdf would now need to do, which isn't implemented yet, is:

  • lookup where the font is defined (in this case 9 0 R)
  • either parse the Encoding (at 7 0 R) or the ToUnicode cmap (at 8 0 R)

At least in this case, both things contain the information that is needed to properly map <44> to the correct character.
As we can see in line 20516 of the PDF, the glyph <44> is indeed mapped to the Unicode <0041>, which is an A`.

A solution to #125 may lay some groundwork for this issue to be solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants