🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
python
pdf
machine-learning
ocr
pipeline
text-extraction
pdf-to-text
language-model
extract-text
parsr
pd3f
-
Updated
Oct 13, 2023 - HTML
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF
📑 Python Package to reconstruct the original continuous text from PDFs with language models
Add a description, image, and links to the pd3f topic page so that developers can more easily learn about it.
To associate your repository with the pd3f topic, visit your repo's landing page and select "manage topics."