[pull] main from Unstructured-IO:main #22

pull · 2023-06-15T03:11:00Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

* Update the api code in the notebook * Successfully generate new FastApi code * Add smoke and unit tests * Update CHANGELOG * bump version in yaml * Update README * Update unit tests

* changelog and version * version bump * staged * shorter test and lint * pip compile * remove detectron * convert fast to hi_res for image * image works locally * convert startegy to auto if fast is passed in * add in changelog * yay bump unstructured * remove convert * remove tensorboard as well * ci tesseract install kor * had to sync for tesseract * readmo done * bump version, rm detectron2 from Makefile * version sync --------- Co-authored-by: Crag Wolfe <[email protected]>

* markdown * markdown smoketest * ppt * pptx + sort unnittest * changelog sort smoke test * clean up * clean up & version bump * make note -> 4 support and 1 error * don't support empty file * add content to xml testfile ->notempty * install pandoc in ci * add more types * lint * no need to comment out csv * docnit * unsturctured bump * Revert "unsturctured bump" This reverts commit 439db0c. * linux only check for install pandoc * should use shebang * make a table for readme * doc nit on table name

* add smoke test to compare parallel and single mode * chore: update partition_file_via_api to use requests.post Replace partition_via_api with a raw request, and pass on any relevant headers using the FastAPI request. * Make sure coordinates match in smoke test * bump version * Fixes to smoke test curl calls * Add a fix for partition_image not liking strategy=fast * generate api again * Remove partition_image workaround * Pull parallel smoke test into a separate script * Remove jpeg from tests until strategy bug is fixed (CORE-1294) * Refactor and update docstrings * make tidy * shellcheck fixes * sync version again * Remove unneeded parallel mode startup * tidy notebooks * Note on use_parallel_mode in the readme * Remove use_parallel_mode parameter * Update parallel test to run in two containers * tidy-notebooks and shellcheck * Remove xfailed jpeg test * regenerate to get this line that keeps disappearing * Remove bandaid for partition_via_api metadata * Update test shebang * Fix test upon change to elements_from_json * Bump release version

* result is empty for .msg * changelog * covert in notebook * version bump

* api key announcement * updating copy * Update README.md --------- Co-authored-by: Amanda Cameron <[email protected]>

Remove outdate make target in readme

* added encoding parameter and test for invalid input * fix: param should start with m_ . ran make generate-api * corrected passing the encoding parameter into partition calls * added tests for both valid and invalid encodings * merged and updated changelog and readme parameters section * reflecting comment feedback: bumped version, used api in readme, added test for different encodings * readme nit * lint * update yaml version * bump unstructured version and lints * added encoding parameter to parallel mode smoke test

* chore: Add variable to set the number of threads in parallel mode * chore: add variable to set the pdf split size

* merge with encoding param * wrote xml keep tags param test * update changelog and readme * bump requirements * remove spaces in readme curl sample

Fixes issue where arm64 docker builds were failing and preventing images from being published.

* change and tidy notebook * changelog version bump * gernate api * friendly input * adds test * stick with valid input * move param location * Revert "adds test" This reverts commit 5d8296a. * Revert "change and tidy notebook" This reverts commit 56e2bf1. * move to unittest * add readme * Revert "move to unittest" This reverts commit 4319718. * Revert "Revert "adds test"" This reverts commit 54b55ab. * stage * bump ust 0.7.10->11 * ah readme * tidy * remove content type param * Revert "remove content type param" This reverts commit 74d9040. * note to content type

* revert setup-buildx-action version * test driver * revert build on pr

* version * retry parameters * retry update and test * readme * mocker install * pip compile * remove enable env * non retry able error code + test * specify default setting in readme * forgot to reuse variable

* Add in latest unstructed dependency as a requirement * Add update to api code to expose the model name as an optional parameter * Update docker to use multistage builds and initialize the chipper model * Remove unneeded logger configs in notebook * Fix smoketest * regenerate api * tidy notebooks * Bump api tools version * regenerate api * fix unit test * update response code to 400 from 403

* refactor logging from multiline to formatted json * regenerate api * regenerate api

* add page break param to api * test inclusion of page breaks * add test file * fix: include page breaks test * rebase * update changelog and readme * added include page breaks param to smoke test

Update README with Chipper model beta version announcement.

* Change strategy back to auto

* pip compile expect don't bump to 0.8.0 * Revert "pip compile expect don't bump to 0.8.0" This reverts commit 7a62df5. * only do pycryptodome install * disable test * note * stage for 400 * password protect 400 error * add test file * lint * lint.. * put back coordinate test * name nit: pdf_page_splits -> pdf_pages * new pip-compile * changelog

This test is hanging and blocking the docker publish.

Follow up from #413. These tests hang in CI and are blocking the image build.

One more off of #414 and #413...

We no longer need to mirror gh issues in the internal jira.

### Summary Version bumps for regular maintenance and to address moderate CVEs from security scans. Also updates the `unstructured` extra from `local-inference` to `all-docs` to keep up with latest best practices for the `unstructured` library. Includes an update for appropriately setting `pdf_infer_table_structure` depending on the value of `skip_infer_table_types` and adds a test.

### Summary Version bumps for regular maintenance and to address moderate CVEs from security scans. - bump `unstructured` to `0.14.6` - bump `unstructured-inference` to `0.7.35`

### Summary Updates the Dockerfile to use the Chainguard wolfi-base image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. ### Testing Run `make docker-build` and `make docker-start-api`, then try: ``` from unstructured.partition.api import partition_via_api elements = partition_via_api( filename=filename, api_url="https://localhost:8000/general/v0/general", api_key="<API-KEY>", strategy="hi_res", ) print("\n\n".join([str(el) for el in elements])) ```

…dx build` command (#425) I noticed that images on main branch are failing to build (and push) due to missing `-f` parameter in `docker buildx build`. By default it expects `Dockerfile` to exist, but we only have `Dockerfile-amd64` and `Dockerfile-arm64` ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/4527165a-909e-498d-b0ee-8bba4b1a13e4) --------- Co-authored-by: christinestraub <[email protected]>

…images` repo update (#427) build and publish CI steps are failing, because the base images have changed in quay (their SHAs) ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fc4e9aac-0820-4c90-9ad9-68cc6d9aad03) ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fafe2ca4-dab2-4610-a26b-a7a4d56723a5)

unnecessary SHA update introduced in #427 that needs to be reverted

shell syntax error occurs in docker-publish.yml workflow

bug introduced in previous PR causing build failure on main

…H` env var (#431)

### Summary Bumps dependency versions for the API. Closes #432.

# Changes **Fix for docx and other office files returning `{"detail":"File type None is not supported."}`** After moving to the wolfi base image, the `mimetypes` lib no longer knows about these file extensions. To avoid issues like this, let's add an explicit mapping for all the file extensions we care about. I added a `filetypes.py` and moved `get_validated_mimetype` over. When this file is imported, we'll call `mimetypes.add_type` for all file extensions we support. **Update smoke test coverage** This bug snuck past because we were already providing the mimetype in the docker smoke test. I updated `test_happy_path` to test against the container with and without passing `content_type`. I added some missing filetypes, and sorted the test params by extension so we can see when new types are missing. # Testing The new smoke test will verify that all filetypes are working. You can also `make docker-build && make docker-start-api`, and test out the docx in the sample docs dir. On `main`, this file will give you the error above. ``` curl 'https://localhost:8000/general/v0/general' \ --form 'files=@"fake.docx"' ```

### Summary Bumps to `unstructured==0.14.10`.

### Summary Updates the `arm64` image to use `wolfi-base` instead of `rockylinux` and consolidates the `amd64` and `arm64` images into the same Dockerfile. As of this PR, the `amd64` and `arm64` images for the API are at parity. ### Testing Successful docker build on the feature branch can be seen in [this job](https://github.com/Unstructured-IO/unstructured-api/actions/runs/9875409234/job/27272072089).

### Summary Bumps dependencies and prepares files for the `0.0.74` release.

### Summary Removes a constraint on `safetensors` from version `0.0.38` that was preventing us from resolving a low CVE in `transformers`.

# Use the library for filetype detection The mimetype detection has always been very naive in the API - we rely on the file extension. If the user doesn't include a filename, we return an error that `Filetype None is not supported`. The library has a detect_filetype that actually inspects the file bytes, so let's reuse this. # Add a `content_type` param to override filetype detection Add an optional `content_type` param that allows the user to override the filetype detection. We'll use this value if it's set, or take the `file.content_type` which is based on the multipart `Content-Type` header. This provides an alternative when clients are unable to modify the header. # Testing The important thing is that `test_happy_path_all_types` passes in the docker smoke test - this contains all filetypes that we want the API to support. To test manually, you can try sending files to the server with and without the filename/content_type defined. Check out this branch and run `make run-web-app`. Example sending with no extension in filename. This correctly processes a pdf. ``` import requests filename = "sample-docs/layout-parser-paper-fast.pdf" url = "https://localhost:8000/general/v0/general" with open(filename, 'rb') as f: files = {'files': ("sample-doc", f)} response = requests.post(url, files=files) print(response.text) ``` For the new param, you can try modifying the content type for a text based file. Verify that you can change the `metadata.filetype` of the response using the new param: ``` curl --location 'https://localhost:8000/general/v0/general' \ --form 'files=@"sample-docs/family-day.eml"' \ --form 'content_type="text/plain"' [ { "type": "UncategorizedText", "element_id": "5cafe1ce2b0a96f8e3eba232e790db19", "text": "MIME-Version: 1.0 Date: Wed, 21 Dec 2022 10:28:53 -0600 Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com> Subject: Family Day From: Mallori Harrell <[email protected]> To: Mallori Harrell <[email protected]> Content-Type: multipart/alternative; boundary=\"0000000000005c115405f0590ce4\"", "metadata": { "filename": "family-day.eml", "languages": [ "eng" ], "filetype": "text/plain" } }, ... ] ```

### Summary Bumps to `unstructured==0.15.5`. Also pulls in the latest version of the `wolfi` base image.

### Summary Bumps to `unstructured==0.15.6`. Resolve CVE related to the `nltk` library.

### Summary Bumps to `unstructured==0.15.7`.

## Description * Added `include_slide_notes` parameter, default is `True`. Works for `.ppt` and `.pptx` file extensions. * Added two new files in `sample-docs`: `sample-docs/notes.ppt`, `sample-docs/notes.pptx` that include notes on their slides. This is to easily test the functionality, as there are no existing PowerPoint files that include slide notes. ## Testing ``` # using default value (True) returns additional NarrativeText element that contains notes curl -X 'POST' 'https://localhost:8000/general/v0/general' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -F 'files=@sample-docs/notes.pptx' -F 'output_format="text/csv"' # explicit include_slide_notes=True returns additional NarrativeText element that contains notes curl -X 'POST' 'https://localhost:8000/general/v0/general' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -F 'files=@sample-docs/notes.pptx' -F 'output_format="text/csv"' -F 'include_slide_notes=True' # explicit include_slide_notes=False returns no NarrativeText element curl -X 'POST' 'https://localhost:8000/general/v0/general' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -F 'files=@sample-docs/notes.pptx' -F 'output_format="text/csv"' -F 'include_slide_notes=False' ``` Same with file `notes.ppt`

### Summary Bumps to `unstructured==0.15.10`.

…457) # Improve input validation to allow quotes for strategy parameter Updated the strategy parameter to allow `"` or `'` as input wrapped around valid input to reduce 4xx errors. # Testing Invoked CURL and REST requests passing values such as `'fast'`, `"fast"`, and `fast` for parameter values for the strategy. ``` curl -X POST "PATH_TO_API" \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: KEY' \ -F "files=@/path_to_file.pdf" \ -F "strategy='fast'" \ -F "split-pdf-page=True" \ -F "split-pdf-allow-failed=True" \ -F "split-pdf-concurrency-level=15" ``` Added unit tests (passing).

Fixes misspellings identified by the [check-spelling action](https://github.com/marketplace/actions/check-spelling). The misspellings have been reported at https://github.com/jsoref/unstructured-api/actions/runs/10822287753#summary-30025895524 The action will report that the changes in this PR would make it happy: https://github.com/jsoref/unstructured-api/actions/runs/10822287924#summary-30025895935 --------- Signed-off-by: Josh Soref <[email protected]>

## Changes - set uvicorn workers, default 1

### Summary Bumps to `unstructured==0.15.13` to apply security patches.

Allow for CSV response (#125)

eb4439c

* Update the api code in the notebook * Successfully generate new FastApi code * Add smoke and unit tests * Update CHANGELOG * bump version in yaml * Update README * Update unit tests

pull bot added the ⤵️ pull label Jun 15, 2023

cragwolfe and others added 28 commits June 16, 2023 11:12

chore: bump python to 3.8.17 (#126)

022ca4e

deprecated pypdf2 (#128)

22ea541

chore: update all bash scripts to use shebang: /usr/bin/env bash (#129)

8570872

chore: bump unstructured to 0.7.8 (#131)

66fa35c

Chore: support msg files again (#133)

ae473dd

* result is empty for .msg * changelog * covert in notebook * version bump

chore: updating readme api-keys (#130)

e8c94dd

* api key announcement * updating copy * Update README.md --------- Co-authored-by: Amanda Cameron <[email protected]>

fix: update get API key header (#134)

54c034d

Update README.md (#135)

95f65fd

Remove outdate make target in readme

chore: Add more env variables to tune parallel mode (#137)

e70e00d

* chore: Add variable to set the number of threads in parallel mode * chore: add variable to set the pdf split size

Chore: add support forxml_keep_tags param (#136)

5a851a0

* merge with encoding param * wrote xml keep tags param test * update changelog and readme * bump requirements * remove spaces in readme curl sample

base image update with security patches (#140)

8659d90

build(image): docker build tweak for arm64 (#141)

611f3ba

Fixes issue where arm64 docker builds were failing and preventing images from being published.

Fix(ci): docker build driver (#143)

0a15f71

* revert setup-buildx-action version * test driver * revert build on pr

Chore: add retry mechanism to fanout request (#142)

ef40417

* version * retry parameters * retry update and test * readme * mocker install * pip compile * remove enable env * non retry able error code + test * specify default setting in readme * forgot to reuse variable

Omit coordinates unless requested (#149)

74680fc

refactor logging from multiline to formatted json (#154)

2cd5ff9

* refactor logging from multiline to formatted json * regenerate api * regenerate api

Chore: add support for include_page_breaks param (#153)

d6cff33

* add page break param to api * test inclusion of page breaks * add test file * fix: include page breaks test * rebase * update changelog and readme * added include page breaks param to smoke test

Update README with Chipper model announcement (#155)

d869365

Update README with Chipper model beta version announcement.

Chore: update require api key in readme (#159)

58284e7

fix: Change strategy back to auto from fast (#152)

60333f9

* Change strategy back to auto

updating readme link (#160)

b23908d

awalker4 and others added 29 commits May 10, 2024 22:10

Skip a pdf in make docker-test with linux/arm64 (#413)

c504295

This test is hanging and blocking the docker publish.

fix/skip another test pdf on linux/arm64 (#414)

f9aa5a4

Follow up from #413. These tests hang in CI and are blocking the image build.

fix/skip docker parallel test on emulated hardware (#415)

20238c2

One more off of #414 and #413...

chore: Remove the create_issue workflow (#419)

65a344d

We no longer need to mirror gh issues in the internal jira.

build(deps): version bumps for maintenance (#424)

fbdc6af

### Summary Version bumps for regular maintenance and to address moderate CVEs from security scans. - bump `unstructured` to `0.14.6` - bump `unstructured-inference` to `0.7.35`

fix: revert to rockylinux SHA that works (arm64) (#428)

d3564b6

unnecessary SHA update introduced in #427 that needs to be reverted

fix: re-add DOCKER_IMAGE env var in Test image step (#429)

5b604b2

shell syntax error occurs in docker-publish.yml workflow

fix: invalid env var setting in docker-publish workflow (#430)

2f482e8

bug introduced in previous PR causing build failure on main

fix: docker-publish workflow failing on main due to inexisting `ARC…

d7acffe

…H` env var (#431)

build(deps): bump dependency versions (#434)

d5a878f

### Summary Bumps dependency versions for the API. Closes #432.

build(deps): bump to unstructured==0.14.10 (#438)

35d5b37

### Summary Bumps to `unstructured==0.14.10`.

build: bump to 0.0.74; bump dependencies (#442)

119e9bd

### Summary Bumps dependencies and prepares files for the `0.0.74` release.

build(deps): remove dependency constraint on safetensors (#443)

d5502d0

### Summary Removes a constraint on `safetensors` from version `0.0.38` that was preventing us from resolving a low CVE in `transformers`.

build: 0.0.77 release; bump to unstructured==0.15.5 (#450)

f91ce3a

### Summary Bumps to `unstructured==0.15.5`. Also pulls in the latest version of the `wolfi` base image.

build: 0.0.78 release; bump to unstructured==0.15.6 (#453)

e510f26

### Summary Bumps to `unstructured==0.15.6`. Resolve CVE related to the `nltk` library.

version 0.0.79; bump to unstructured 0.0.79 (#454)

843d68a

### Summary Bumps to `unstructured==0.15.7`.

version 0.0.80; bump to unstructured 0.15.10 (#458)

c52a2d1

### Summary Bumps to `unstructured==0.15.10`.

feat - set async workers (#452)

a189c87

## Changes - set uvicorn workers, default 1

version 0.0.81; bump to unstructured==0.15.13 (#463)

d42a6cf

### Summary Bumps to `unstructured==0.15.13` to apply security patches.

pull bot merged commit d42a6cf into admariner:main Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from Unstructured-IO:main #22

[pull] main from Unstructured-IO:main #22

pull bot commented Jun 15, 2023 •

edited

Loading

[pull] main from Unstructured-IO:main #22

[pull] main from Unstructured-IO:main #22

Conversation

pull bot commented Jun 15, 2023 • edited Loading

pull bot commented Jun 15, 2023 •

edited

Loading