forked from Unstructured-IO/unstructured-api
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] main from Unstructured-IO:main #22
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Update the api code in the notebook * Successfully generate new FastApi code * Add smoke and unit tests * Update CHANGELOG * bump version in yaml * Update README * Update unit tests
* changelog and version * version bump * staged * shorter test and lint * pip compile * remove detectron * convert fast to hi_res for image * image works locally * convert startegy to auto if fast is passed in * add in changelog * yay bump unstructured * remove convert * remove tensorboard as well * ci tesseract install kor * had to sync for tesseract * readmo done * bump version, rm detectron2 from Makefile * version sync --------- Co-authored-by: Crag Wolfe <[email protected]>
* markdown * markdown smoketest * ppt * pptx + sort unnittest * changelog sort smoke test * clean up * clean up & version bump * make note -> 4 support and 1 error * don't support empty file * add content to xml testfile ->notempty * install pandoc in ci * add more types * lint * no need to comment out csv * docnit * unsturctured bump * Revert "unsturctured bump" This reverts commit 439db0c. * linux only check for install pandoc * should use shebang * make a table for readme * doc nit on table name
* add smoke test to compare parallel and single mode * chore: update partition_file_via_api to use requests.post Replace partition_via_api with a raw request, and pass on any relevant headers using the FastAPI request. * Make sure coordinates match in smoke test * bump version * Fixes to smoke test curl calls * Add a fix for partition_image not liking strategy=fast * generate api again * Remove partition_image workaround * Pull parallel smoke test into a separate script * Remove jpeg from tests until strategy bug is fixed (CORE-1294) * Refactor and update docstrings * make tidy * shellcheck fixes * sync version again * Remove unneeded parallel mode startup * tidy notebooks * Note on use_parallel_mode in the readme * Remove use_parallel_mode parameter * Update parallel test to run in two containers * tidy-notebooks and shellcheck * Remove xfailed jpeg test * regenerate to get this line that keeps disappearing * Remove bandaid for partition_via_api metadata * Update test shebang * Fix test upon change to elements_from_json * Bump release version
* result is empty for .msg * changelog * covert in notebook * version bump
* api key announcement * updating copy * Update README.md --------- Co-authored-by: Amanda Cameron <[email protected]>
Remove outdate make target in readme
* added encoding parameter and test for invalid input * fix: param should start with m_ . ran make generate-api * corrected passing the encoding parameter into partition calls * added tests for both valid and invalid encodings * merged and updated changelog and readme parameters section * reflecting comment feedback: bumped version, used api in readme, added test for different encodings * readme nit * lint * update yaml version * bump unstructured version and lints * added encoding parameter to parallel mode smoke test
* chore: Add variable to set the number of threads in parallel mode * chore: add variable to set the pdf split size
* merge with encoding param * wrote xml keep tags param test * update changelog and readme * bump requirements * remove spaces in readme curl sample
Fixes issue where arm64 docker builds were failing and preventing images from being published.
* change and tidy notebook * changelog version bump * gernate api * friendly input * adds test * stick with valid input * move param location * Revert "adds test" This reverts commit 5d8296a. * Revert "change and tidy notebook" This reverts commit 56e2bf1. * move to unittest * add readme * Revert "move to unittest" This reverts commit 4319718. * Revert "Revert "adds test"" This reverts commit 54b55ab. * stage * bump ust 0.7.10->11 * ah readme * tidy * remove content type param * Revert "remove content type param" This reverts commit 74d9040. * note to content type
* revert setup-buildx-action version * test driver * revert build on pr
* version * retry parameters * retry update and test * readme * mocker install * pip compile * remove enable env * non retry able error code + test * specify default setting in readme * forgot to reuse variable
* Add in latest unstructed dependency as a requirement * Add update to api code to expose the model name as an optional parameter * Update docker to use multistage builds and initialize the chipper model * Remove unneeded logger configs in notebook * Fix smoketest * regenerate api * tidy notebooks * Bump api tools version * regenerate api * fix unit test * update response code to 400 from 403
* refactor logging from multiline to formatted json * regenerate api * regenerate api
* add page break param to api * test inclusion of page breaks * add test file * fix: include page breaks test * rebase * update changelog and readme * added include page breaks param to smoke test
Update README with Chipper model beta version announcement.
* Change strategy back to auto
* pip compile expect don't bump to 0.8.0 * Revert "pip compile expect don't bump to 0.8.0" This reverts commit 7a62df5. * only do pycryptodome install * disable test * note * stage for 400 * password protect 400 error * add test file * lint * lint.. * put back coordinate test * name nit: pdf_page_splits -> pdf_pages * new pip-compile * changelog
This test is hanging and blocking the docker publish.
Follow up from #413. These tests hang in CI and are blocking the image build.
We no longer need to mirror gh issues in the internal jira.
### Summary Version bumps for regular maintenance and to address moderate CVEs from security scans. Also updates the `unstructured` extra from `local-inference` to `all-docs` to keep up with latest best practices for the `unstructured` library. Includes an update for appropriately setting `pdf_infer_table_structure` depending on the value of `skip_infer_table_types` and adds a test.
### Summary Version bumps for regular maintenance and to address moderate CVEs from security scans. - bump `unstructured` to `0.14.6` - bump `unstructured-inference` to `0.7.35`
### Summary Updates the Dockerfile to use the Chainguard wolfi-base image to reduce CVEs. Also adds a step in the docker publish job that scans the images and checks for CVEs before publishing. ### Testing Run `make docker-build` and `make docker-start-api`, then try: ``` from unstructured.partition.api import partition_via_api elements = partition_via_api( filename=filename, api_url="https://localhost:8000/general/v0/general", api_key="<API-KEY>", strategy="hi_res", ) print("\n\n".join([str(el) for el in elements])) ```
…dx build` command (#425) I noticed that images on main branch are failing to build (and push) due to missing `-f` parameter in `docker buildx build`. By default it expects `Dockerfile` to exist, but we only have `Dockerfile-amd64` and `Dockerfile-arm64` ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/4527165a-909e-498d-b0ee-8bba4b1a13e4) --------- Co-authored-by: christinestraub <[email protected]>
…images` repo update (#427) build and publish CI steps are failing, because the base images have changed in quay (their SHAs) ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fc4e9aac-0820-4c90-9ad9-68cc6d9aad03) ![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/fafe2ca4-dab2-4610-a26b-a7a4d56723a5)
unnecessary SHA update introduced in #427 that needs to be reverted
shell syntax error occurs in docker-publish.yml workflow
bug introduced in previous PR causing build failure on main
### Summary Bumps dependency versions for the API. Closes #432.
# Changes **Fix for docx and other office files returning `{"detail":"File type None is not supported."}`** After moving to the wolfi base image, the `mimetypes` lib no longer knows about these file extensions. To avoid issues like this, let's add an explicit mapping for all the file extensions we care about. I added a `filetypes.py` and moved `get_validated_mimetype` over. When this file is imported, we'll call `mimetypes.add_type` for all file extensions we support. **Update smoke test coverage** This bug snuck past because we were already providing the mimetype in the docker smoke test. I updated `test_happy_path` to test against the container with and without passing `content_type`. I added some missing filetypes, and sorted the test params by extension so we can see when new types are missing. # Testing The new smoke test will verify that all filetypes are working. You can also `make docker-build && make docker-start-api`, and test out the docx in the sample docs dir. On `main`, this file will give you the error above. ``` curl 'https://localhost:8000/general/v0/general' \ --form 'files=@"fake.docx"' ```
### Summary Bumps to `unstructured==0.14.10`.
### Summary Updates the `arm64` image to use `wolfi-base` instead of `rockylinux` and consolidates the `amd64` and `arm64` images into the same Dockerfile. As of this PR, the `amd64` and `arm64` images for the API are at parity. ### Testing Successful docker build on the feature branch can be seen in [this job](https://github.com/Unstructured-IO/unstructured-api/actions/runs/9875409234/job/27272072089).
### Summary Bumps dependencies and prepares files for the `0.0.74` release.
### Summary Removes a constraint on `safetensors` from version `0.0.38` that was preventing us from resolving a low CVE in `transformers`.
# Use the library for filetype detection The mimetype detection has always been very naive in the API - we rely on the file extension. If the user doesn't include a filename, we return an error that `Filetype None is not supported`. The library has a detect_filetype that actually inspects the file bytes, so let's reuse this. # Add a `content_type` param to override filetype detection Add an optional `content_type` param that allows the user to override the filetype detection. We'll use this value if it's set, or take the `file.content_type` which is based on the multipart `Content-Type` header. This provides an alternative when clients are unable to modify the header. # Testing The important thing is that `test_happy_path_all_types` passes in the docker smoke test - this contains all filetypes that we want the API to support. To test manually, you can try sending files to the server with and without the filename/content_type defined. Check out this branch and run `make run-web-app`. Example sending with no extension in filename. This correctly processes a pdf. ``` import requests filename = "sample-docs/layout-parser-paper-fast.pdf" url = "https://localhost:8000/general/v0/general" with open(filename, 'rb') as f: files = {'files': ("sample-doc", f)} response = requests.post(url, files=files) print(response.text) ``` For the new param, you can try modifying the content type for a text based file. Verify that you can change the `metadata.filetype` of the response using the new param: ``` curl --location 'https://localhost:8000/general/v0/general' \ --form 'files=@"sample-docs/family-day.eml"' \ --form 'content_type="text/plain"' [ { "type": "UncategorizedText", "element_id": "5cafe1ce2b0a96f8e3eba232e790db19", "text": "MIME-Version: 1.0 Date: Wed, 21 Dec 2022 10:28:53 -0600 Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com> Subject: Family Day From: Mallori Harrell <[email protected]> To: Mallori Harrell <[email protected]> Content-Type: multipart/alternative; boundary=\"0000000000005c115405f0590ce4\"", "metadata": { "filename": "family-day.eml", "languages": [ "eng" ], "filetype": "text/plain" } }, ... ] ```
### Summary Bumps to `unstructured==0.15.5`. Also pulls in the latest version of the `wolfi` base image.
### Summary Bumps to `unstructured==0.15.6`. Resolve CVE related to the `nltk` library.
### Summary Bumps to `unstructured==0.15.7`.
## Description * Added `include_slide_notes` parameter, default is `True`. Works for `.ppt` and `.pptx` file extensions. * Added two new files in `sample-docs`: `sample-docs/notes.ppt`, `sample-docs/notes.pptx` that include notes on their slides. This is to easily test the functionality, as there are no existing PowerPoint files that include slide notes. ## Testing ``` # using default value (True) returns additional NarrativeText element that contains notes curl -X 'POST' 'https://localhost:8000/general/v0/general' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -F 'files=@sample-docs/notes.pptx' -F 'output_format="text/csv"' # explicit include_slide_notes=True returns additional NarrativeText element that contains notes curl -X 'POST' 'https://localhost:8000/general/v0/general' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -F 'files=@sample-docs/notes.pptx' -F 'output_format="text/csv"' -F 'include_slide_notes=True' # explicit include_slide_notes=False returns no NarrativeText element curl -X 'POST' 'https://localhost:8000/general/v0/general' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -F 'files=@sample-docs/notes.pptx' -F 'output_format="text/csv"' -F 'include_slide_notes=False' ``` Same with file `notes.ppt`
### Summary Bumps to `unstructured==0.15.10`.
…457) # Improve input validation to allow quotes for strategy parameter Updated the strategy parameter to allow `"` or `'` as input wrapped around valid input to reduce 4xx errors. # Testing Invoked CURL and REST requests passing values such as `'fast'`, `"fast"`, and `fast` for parameter values for the strategy. ``` curl -X POST "PATH_TO_API" \ -H 'accept: application/json' \ -H 'Content-Type: multipart/form-data' \ -H 'unstructured-api-key: KEY' \ -F "files=@/path_to_file.pdf" \ -F "strategy='fast'" \ -F "split-pdf-page=True" \ -F "split-pdf-allow-failed=True" \ -F "split-pdf-concurrency-level=15" ``` Added unit tests (passing).
Fixes misspellings identified by the [check-spelling action](https://github.com/marketplace/actions/check-spelling). The misspellings have been reported at https://github.com/jsoref/unstructured-api/actions/runs/10822287753#summary-30025895524 The action will report that the changes in this PR would make it happy: https://github.com/jsoref/unstructured-api/actions/runs/10822287924#summary-30025895935 --------- Signed-off-by: Josh Soref <[email protected]>
## Changes - set uvicorn workers, default 1
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )