Extractor

Extractor is an HTTP service used to convert PDF, Markdown, HTML, Docx, Xlsx, CSV and other files into plain text output. It is used in RAG implementation to read external documents for vectorization.

Demo: https://extractor.gulu.ai

The access speed may be slow. It is for trial use only and should not be used in production.

Installation

Start the application server using Docker

docker run -d --restart=always --name extractor \
     -p 8080:80 \
     mylxsw/extractor:1.0.0

API

Convert PDF document to plain text

curl -s -X POST http:https://127.0.0.1:8080/v1/extractor/file -F file=@'test.pdf'

Automatically download the document of the URL and convert it to plain text

curl -s -X POST http:https://127.0.0.1:8080/v1/extractor/url -d 'url=https://example.com/test.pdf'

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
core		core
lib		lib
web		web
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
app.py		app.py
docker-build.sh		docker-build.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extractor

Installation

API

License

About

Releases

Packages

Languages

License

mylxsw/extractor

Folders and files

Latest commit

History

Repository files navigation

Extractor

Installation

API

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages