Skip to content

extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.

License

Notifications You must be signed in to change notification settings

mylxsw/extractor

Repository files navigation

Extractor

Extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.

Demo: https://extractor.gulu.ai

The access speed may be slow. It is for trial use only and should not be used in production.

Installation

Start the application server using Docker

docker run -d --restart=always --name extractor \
     -p 8080:80 \
     mylxsw/extractor:1.0.0

API

Convert PDF document to plain text

curl -s -X POST http:https://127.0.0.1:8080/v1/extractor/file -F file=@'test.pdf'

Automatically download the document of the URL and convert it to plain text

curl -s -X POST http:https://127.0.0.1:8080/v1/extractor/url -d 'url=https://example.com/test.pdf'

License

MIT

Copyright (c) 2024,mylxsw

About

extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages