GitHub - admariner/unstructured-api

General Pre-Processing Pipeline for Documents

This repo implements a pre-processing pipeline for the following documents. Currently, the pipeline is capable of recognizing the file type and choosing the relevant partition function to process the file.

Plaintext: .txt, .eml, .html, .md, .json, .rtf
Images: .jpeg, .png
Documents: .doc, .docx, .ppt, .pptx, .pdf, .odt, .epub, .csv

🚀 Unstructured API

Try our hosted API! It's freely available to use with any of the filetypes listed above. This is the easiest way to get started. If you'd like to host your own version of the API, jump down to the Developer Quickstart Guide.

 curl -X 'POST' \
  'https://api.unstructured.io/general/v0/general' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@sample-docs/family-day.eml' \
  | jq -C . | less -R

Parameters

PDF Strategies

Three strategies are available for processing PDF files: hi_res, fast, and auto. fast is the default strategy and works well for documents that do not have text embedded in images.

On the other hand, hi_res is the better choice for PDF's that may have text within embedded images, or for achieving greater precision of element types in the response JSON. Please be aware that, as of writing, hi_res requests may take 20 times longer to process compared to thefast option. See the example below for making a hi_res request.

For the best of both worlds, auto will determine when a page can be extracted using fast mode, otherwise it will fall back to hi_res.

 curl -X 'POST' \
  'https://api.unstructured.io/general/v0/general' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@sample-docs/layout-parser-paper.pdf' \
  -F 'strategy=hi_res' \
  | jq -C . | less -R

Coordinates

When elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set the coordinates parameter to true to add this field to the elements in the response.

 curl -X 'POST' \
  'https://api.unstructured.io/general/v0/general' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@sample-docs/layout-parser-paper.pdf' \
  -F 'coordinates=true' \
  | jq -C . | less -R

Developer Quick Start

Using pyenv to manage virtualenv's is recommended
- Mac install instructions. See here for more detailed instructions.
  - brew install pyenv-virtualenv
  - pyenv install 3.8.15
- Linux instructions are available here.
- Create a virtualenv to work in and activate it, e.g. for one named document-processing:
  
  pyenv virtualenv 3.8.15 document-processing
  pyenv activate document-processing

See the Unstructured Quick Start for the many OS dependencies that are required, if the ability to process all file types is desired.

Run make install
If image and high resolution pdf extraction is required, also run make install-high
Start a local jupyter notebook server with make run-jupyter
OR
just start the fast-API locally with make run-web-app

Using the API locally

After running make run-web-app (or make docker-start-api to run in the container), you can now hit the API locally at port 8000. The sample-docs directory has a number of example file types that are currently supported.

For example:

 curl -X 'POST' \
  'http:https://localhost:8000/general/v0/general' \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -F 'files=@sample-docs/family-day.eml' \
  | jq -C . | less -R

The response will be a list of the extracted elements:

[
  {
    "element_id": "db1ca22813f01feda8759ff04a844e56",
    "coordinates": null,
    "text": "Hi All,",
    "type": "UncategorizedText",
    "metadata": {
      "date": "2022-12-21T10:28:53-06:00",
      "sent_from": [
        "Mallori Harrell <[email protected]>"
      ],
      "sent_to": [
        "Mallori Harrell <[email protected]>"
      ],
      "subject": "Family Day",
      "filename": "family-day.eml"
    }
  },
...
...

Parallel Mode for PDFs

As mentioned above, processing a pdf using hi_res is currently a slow operation. One workaround is to split the pdf into smaller files, process these asynchronously, and merge the results. You can enable parallel processing mode with the following env variables:

UNSTRUCTURED_PARALLEL_MODE_ENABLED - set to true to process individual pdf pages remotely
UNSTRUCTURED_PARALLEL_MODE_URL - the location to send pdf page asynchronously

Generating Python files from the pipeline notebooks

You can generate the FastAPI APIs from your pipeline notebooks by running make generate-api.

💫 Instructions for using the Docker image

The following instructions are intended to help you get up and running using Docker to interact with unstructured-api. See here if you don't already have docker installed on your machine.

NOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware. Docker pull should download the corresponding image for your architecture, but you can specify with --platform (e.g. --platform linux/amd64) if needed.

We build Docker images for all pushes to main. We tag each image with the corresponding short commit hash (e.g. fbc7a69) and the application version (e.g. 0.5.5-dev1). We also tag the most recent image with latest. To leverage this, docker pull from our image repository.

docker pull quay.io/unstructured-io/unstructured-api:latest

Once pulled, you can launch the container as a web app on localhost:8000.

docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0

Security Policy

See our security policy for information on how to report security vulnerabilities.

Learn more

Section	Description
Unstructured Community Github	Information about Unstructured.io community projects
Unstructured Github	Unstructured.io open source repositories
Company Website	Unstructured.io product and company info

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github		.github
exploration-notebooks		exploration-notebooks
img		img
pipeline-notebooks		pipeline-notebooks
prepline_general		prepline_general
requirements		requirements
sample-docs		sample-docs
scripts		scripts
test_general/api		test_general/api
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
logger_config.yaml		logger_config.yaml
preprocessing-pipeline-family.yaml		preprocessing-pipeline-family.yaml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General Pre-Processing Pipeline for Documents

🚀 Unstructured API

Parameters

PDF Strategies

Coordinates

Developer Quick Start

Using the API locally

Parallel Mode for PDFs

Generating Python files from the pipeline notebooks

💫 Instructions for using the Docker image

Security Policy

Learn more

About

Releases

Packages

Languages

License

admariner/unstructured-api

Folders and files

Latest commit

History

Repository files navigation

General Pre-Processing Pipeline for Documents

🚀 Unstructured API

Parameters

PDF Strategies

Coordinates

Developer Quick Start

Using the API locally

Parallel Mode for PDFs

Generating Python files from the pipeline notebooks

💫 Instructions for using the Docker image

Security Policy

Learn more

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages