Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulties with Multi-line headers. Rows shifted down. #470

Open
poetaster opened this issue Dec 14, 2023 · 5 comments
Open

Difficulties with Multi-line headers. Rows shifted down. #470

poetaster opened this issue Dec 14, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@poetaster
Copy link

poetaster commented Dec 14, 2023

Describe the bug
This pdf, https://poetaster.de/misc/118.pdf (which I'm not uploading here since it may be a copyright issue) is read well but camelot shifts the rows under the multi-header controllability, down.

Steps to reproduce the bug

Load the above file and try both stream and lattice reading. I tried a lot of variations:

stream with different row tolerances:
dfs = camelot.read_pdf('118.pdf', flavor='stream', row_tol=20,flag_size=True)

and lattice with many scale and shift variations.

dfs = camelot.read_pdf('118.pdf', flavor='lattice', shift_text=['r','t', 'r', 't'], line_scale=20)

dataframe

Lattice appears to get it right:

camelot.plot(dfs[0], kind='grid').show()

lattice

Which seems correct. But it always shifts the rows in the controllability part.

Expected behavior

Rows should not be shifted.

Code

Began with:

import camelot
dfs = camelot.read_pdf('118.pdf') 

And tried many variation, most recent lattice being:
dfs = camelot.read_pdf('118.pdf', flavor='lattice', shift_text=['r','t', 'r', 't'], line_scale=20)

PDF
See above.

Screenshots
See above.

Environment

  • OS: ubuntu 22
  • Python version: 3.10.12
  • Numpy version: 1.24.0
  • OpenCV version: 4.8.1.78
  • Ghostscript version: 0.7
  • Camelot version: 0.11.0

Additional context

@poetaster poetaster added the bug Something isn't working label Dec 14, 2023
@wolfassi123
Copy link

@poetaster Did you manage to find a way to fix the issues with multi-row headers?

@paulobarrera14
Copy link

@poetaster Pinging this again, where you able to find a fix for the multi-row headers?

@poetaster
Copy link
Author

I hadn't worked on this (ended up reading excel files directly for that project) since then. I've looked now, but thought i should probably update camelot? What version would be best to test with?

@poetaster
Copy link
Author

poetaster commented Jul 11, 2024

I wasn't sure if I had done the original on my PC or on my jupyter lab server. On this pc, camelot is at 0.9.0 and the results are the same.
camelot

table

@poetaster
Copy link
Author

Ok, updated to 0.11.0 and same same. I'm not sure if it's just that I haven't understood the shifting 'foo', but even without, camelot get's the grid correct, but shifts the content of the 'controlability' columns 2 down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants