RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
-
Updated
Nov 1, 2024 - Python
RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Table structure recognition dataset of the paper: Complicated Table Structure Recognition
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
A Multi Purpose PDF Toolkit
Sample code for the Datalogics C++, Java, and .NET interfaces of the Adobe PDF Library
PDF text data extraction web app with OCR for scanned documents
OCR library to extract text & tables from PDF files and images. Convert any image or PDF to CSV / TXT / JSON / Searchable PDF.
cli for extracting text from PDF files (and maybe possibly tables)
C# and VB.NET samples for Docotic.Pdf library
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
Standalone .NET Converter library, not require Adobe Acrobat component nor Microsoft Office Interop Assemblies, to convert PDF, DOCX, XLSX, HTML, Image, CSV, RTF, TXT in .NET framework
The code base of the front-end of nocodefunctions.com
Batch-convert pdf to text, extract data from pdf in python
Simple pdf to text with python using PDFtk and PyPDF2
[Eng] API for obtaining data from the Tide Table, using web scraping. [Pt-Br] API para Obtenção da Tábua de Maré diária, usando web scraping com PHP.
Graphlit Platform
Converts a pdf file into a text file while keeping the layout of the original pdf.
Build a RAG preprocessing pipeline
Add a description, image, and links to the pdf-to-text topic page so that developers can more easily learn about it.
To associate your repository with the pdf-to-text topic, visit your repo's landing page and select "manage topics."