Hacker News new | past | comments | ask | show | jobs | submit login

how does camelot extract tables from pdf? does it convert to image and then does OCR?



Hey! Camelot maintainer here. You can check out this doc for details on how Camelot extracts tables from PDFs: https://camelot-py.readthedocs.io/en/master/user/how-it-work...

As pointed out in this thread, right now it only works with text-based PDFs. But there's a PR[1] which will add OCR support (using EasyOCR) for image-based PDFs in some time.

[1] https://github.com/camelot-dev/camelot/pull/209


From the link: "Camelot only works with text-based PDFs and not scanned documents." If you have character data, using it is almost always going to be more accurate than OCR.

I don't know how OP uses it with images converted to PDFs though, as that would be just like a scan, and ImageMagick doesn't do OCR as far as I can tell.


It uses pytesseract and Open-CV, so there is image processing.


Looks like it's a bit in-progress: https://github.com/camelot-dev/camelot/pull/209

"Update docs" isn't checked, and that's what I was going on.


Yes I need to work on that PR, haven't been getting a lot of free time these days. It adds OCR support using EasyOCR, which I found on HN some time ago!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: