Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from Unstructured-IO:main #22

Merged
merged 214 commits into from
Oct 24, 2024
Merged

Conversation

pull[bot]
Copy link

@pull pull bot commented Jun 15, 2023

See Commits and Changes for more details.


Created by pull[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

* Update the api code in the notebook

* Successfully generate new FastApi code

* Add smoke and unit tests

* Update CHANGELOG

* bump version in yaml

* Update README

* Update unit tests
@pull pull bot added the ⤵️ pull label Jun 15, 2023
cragwolfe and others added 28 commits June 16, 2023 11:12
* changelog and version

* version bump

* staged

* shorter test and lint

* pip compile

* remove detectron

* convert fast to hi_res for image

* image works locally

* convert startegy to auto if fast is passed in

* add in changelog

* yay bump unstructured

* remove convert

* remove tensorboard as well

* ci tesseract install kor

* had to sync for tesseract

* readmo done

* bump version, rm detectron2 from Makefile

* version sync

---------

Co-authored-by: Crag Wolfe <[email protected]>
* markdown

* markdown smoketest

* ppt

* pptx + sort unnittest

* changelog sort smoke test

* clean up

* clean up & version bump

* make note -> 4 support and 1 error

* don't support empty file

* add content to xml testfile ->notempty

* install pandoc in ci

* add more types

* lint

* no need to comment out csv

* docnit

* unsturctured bump

* Revert "unsturctured bump"

This reverts commit 439db0c.

* linux only check for install pandoc

* should use shebang

* make a table for readme

* doc nit on table name
* add smoke test to compare parallel and single mode

* chore: update partition_file_via_api to use requests.post

Replace partition_via_api with a raw request, and pass on any relevant headers using the FastAPI
request.

* Make sure coordinates match in smoke test

* bump version

* Fixes to smoke test curl calls

* Add a fix for partition_image not liking strategy=fast

* generate api again

* Remove partition_image workaround

* Pull parallel smoke test into a separate script

* Remove jpeg from tests until strategy bug is fixed (CORE-1294)

* Refactor and update docstrings

* make tidy

* shellcheck fixes

* sync version again

* Remove unneeded parallel mode startup

* tidy notebooks

* Note on use_parallel_mode in the readme

* Remove use_parallel_mode parameter

* Update parallel test to run in two containers

* tidy-notebooks and shellcheck

* Remove xfailed jpeg test

* regenerate to get this line that keeps disappearing

* Remove bandaid for partition_via_api metadata

* Update test shebang

* Fix test upon change to elements_from_json

* Bump release version
* result is empty for .msg

* changelog

* covert in notebook

* version bump
* api key announcement

* updating copy

* Update README.md

---------

Co-authored-by: Amanda Cameron <[email protected]>
Remove outdate make target in readme
* added encoding parameter and test for invalid input

* fix: param should start with m_ . ran make generate-api

* corrected passing the encoding parameter into partition calls

* added tests for both valid and invalid encodings

* merged and updated changelog and readme parameters section

* reflecting comment feedback: bumped version, used api in readme, added test for different encodings

* readme nit

* lint

* update yaml version

* bump unstructured version and lints

* added encoding parameter to parallel mode smoke test
* chore: Add variable to set the number of threads in parallel mode

* chore: add variable to set the pdf split size
* merge with encoding param

* wrote xml keep tags param test

* update changelog and readme

* bump requirements

* remove spaces in readme curl sample
Fixes issue where arm64 docker builds were failing and preventing images from being published.
* change and tidy notebook

* changelog version bump

* gernate api

* friendly input

* adds test

* stick with valid input

* move param location

* Revert "adds test"

This reverts commit 5d8296a.

* Revert "change and tidy notebook"

This reverts commit 56e2bf1.

* move to unittest

* add readme

* Revert "move to unittest"

This reverts commit 4319718.

* Revert "Revert "adds test""

This reverts commit 54b55ab.

* stage

* bump ust 0.7.10->11

* ah readme

* tidy

* remove content type param

* Revert "remove content type param"

This reverts commit 74d9040.

* note to content type
* revert setup-buildx-action version

* test driver

* revert build on pr
* version

* retry parameters

* retry update and test

* readme

* mocker install

* pip compile

* remove enable env

* non retry able error code + test

* specify default setting in readme

* forgot to reuse variable
* Add in latest unstructed dependency as a requirement

* Add update to api code to expose the model name as an optional parameter

* Update docker to use multistage builds and initialize the chipper model

* Remove unneeded logger configs in notebook

* Fix smoketest

* regenerate api

* tidy notebooks

* Bump api tools version

* regenerate api

* fix unit test

* update response code to 400 from 403
* refactor logging from multiline to formatted json

* regenerate api

* regenerate api
* add page break param to api

* test inclusion of page breaks

* add test file

* fix: include page breaks test

* rebase

* update changelog and readme

* added include page breaks param to smoke test
Update README with Chipper model beta version announcement.
* Change strategy back to auto
* pip compile expect don't bump to 0.8.0

* Revert "pip compile expect don't bump to 0.8.0"

This reverts commit 7a62df5.

* only do pycryptodome install

* disable test

* note

* stage for 400

* password protect  400 error

* add test file

* lint

* lint..

* put back coordinate test

* name nit:  pdf_page_splits -> pdf_pages

* new pip-compile

* changelog
awalker4 and others added 29 commits May 10, 2024 22:10
This test is hanging and blocking the docker publish.
Follow up from #413. These tests hang in CI and are blocking the image
build.
We no longer need to mirror gh issues in the internal jira.
### Summary

Version bumps for regular maintenance and to address moderate CVEs from
security scans. Also updates the `unstructured` extra from
`local-inference` to `all-docs` to keep up with latest best practices
for the `unstructured` library.

Includes an update for appropriately setting `pdf_infer_table_structure`
depending on the value of `skip_infer_table_types` and adds a test.
### Summary
Version bumps for regular maintenance and to address moderate CVEs from
security scans.
- bump `unstructured` to `0.14.6`
- bump `unstructured-inference` to `0.7.35`
### Summary
Updates the Dockerfile to use the Chainguard wolfi-base image to reduce
CVEs. Also adds a step in the docker publish job that scans the images
and checks for CVEs before publishing.

### Testing
Run `make docker-build` and  `make docker-start-api`, then try:
```
from unstructured.partition.api import partition_via_api

elements = partition_via_api(
    filename=filename,
    api_url="https://localhost:8000/general/v0/general",
    api_key="<API-KEY>",
    strategy="hi_res",
)

print("\n\n".join([str(el) for el in elements]))
```
…dx build` command (#425)

I noticed that images on main branch are failing to build (and push) due
to missing `-f` parameter in `docker buildx build`. By default it
expects `Dockerfile` to exist, but we only have `Dockerfile-amd64` and
`Dockerfile-arm64`


![image](https://github.com/Unstructured-IO/unstructured-api/assets/64484917/4527165a-909e-498d-b0ee-8bba4b1a13e4)

---------

Co-authored-by: christinestraub <[email protected]>
unnecessary SHA update introduced in
#427 that needs
to be reverted
shell syntax error occurs in docker-publish.yml workflow
bug introduced in previous PR causing build failure on main
### Summary

Bumps dependency versions for the API. Closes #432.
# Changes
**Fix for docx and other office files returning `{"detail":"File type
None is not supported."}`**
After moving to the wolfi base image, the `mimetypes` lib no longer
knows about these file extensions. To avoid issues like this, let's add
an explicit mapping for all the file extensions we care about. I added a
`filetypes.py` and moved `get_validated_mimetype` over. When this file
is imported, we'll call `mimetypes.add_type` for all file extensions we
support.

**Update smoke test coverage**
This bug snuck past because we were already providing the mimetype in
the docker smoke test. I updated `test_happy_path` to test against the
container with and without passing `content_type`. I added some missing
filetypes, and sorted the test params by extension so we can see when
new types are missing.

# Testing
The new smoke test will verify that all filetypes are working. You can
also `make docker-build && make docker-start-api`, and test out the docx
in the sample docs dir. On `main`, this file will give you the error
above.
```
curl 'https://localhost:8000/general/v0/general' \
--form 'files=@"fake.docx"'
```
### Summary

Bumps to `unstructured==0.14.10`.
### Summary

Updates the `arm64` image to use `wolfi-base` instead of `rockylinux`
and consolidates the `amd64` and `arm64` images into the same
Dockerfile. As of this PR, the `amd64` and `arm64` images for the API
are at parity.

### Testing

Successful docker build on the feature branch can be seen in [this
job](https://github.com/Unstructured-IO/unstructured-api/actions/runs/9875409234/job/27272072089).
### Summary

Bumps dependencies and prepares files for the `0.0.74` release.
### Summary

Removes a constraint on `safetensors` from version `0.0.38` that was
preventing us from resolving a low CVE in `transformers`.
# Use the library for filetype detection 

The mimetype detection has always been very naive in the API - we rely
on the file extension. If the user doesn't include a filename, we return
an error that `Filetype None is not supported`. The library has a
detect_filetype that actually inspects the file bytes, so let's reuse
this.

# Add a `content_type` param to override filetype detection

Add an optional `content_type` param that allows the user to override
the filetype detection. We'll use this value if it's set, or take the
`file.content_type` which is based on the multipart `Content-Type`
header. This provides an alternative when clients are unable to modify
the header.

# Testing

The important thing is that `test_happy_path_all_types` passes in the
docker smoke test - this contains all filetypes that we want the API to
support.

To test manually, you can try sending files to the server with and
without the filename/content_type defined.

Check out this branch and run `make run-web-app`.

Example sending with no extension in filename. This correctly processes
a pdf.
```
import requests

filename = "sample-docs/layout-parser-paper-fast.pdf"
url = "https://localhost:8000/general/v0/general"

with open(filename, 'rb') as f:
    files = {'files': ("sample-doc", f)}
    response = requests.post(url, files=files)
    print(response.text)
```

For the new param, you can try modifying the content type for a text
based file.

Verify that you can change the `metadata.filetype` of the response using
the new param:

```
 curl --location 'https://localhost:8000/general/v0/general' \
--form 'files=@"sample-docs/family-day.eml"' \
--form 'content_type="text/plain"'

[
    {
        "type": "UncategorizedText",
        "element_id": "5cafe1ce2b0a96f8e3eba232e790db19",
        "text": "MIME-Version: 1.0 Date: Wed, 21 Dec 2022 10:28:53 -0600 Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com> Subject: Family Day From: Mallori Harrell <[email protected]> To: Mallori Harrell <[email protected]> Content-Type: multipart/alternative; boundary=\"0000000000005c115405f0590ce4\"",
        "metadata": {
            "filename": "family-day.eml",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    ...
]
```
### Summary

Bumps to `unstructured==0.15.5`. Also pulls in the latest version of the
`wolfi` base image.
### Summary

Bumps to `unstructured==0.15.6`. Resolve CVE related to the `nltk`
library.
### Summary

Bumps to `unstructured==0.15.7`.
## Description
* Added `include_slide_notes` parameter, default is `True`. Works for
`.ppt` and `.pptx` file extensions.
* Added two new files in `sample-docs`: `sample-docs/notes.ppt`,
`sample-docs/notes.pptx` that include notes on their slides. This is to
easily test the functionality, as there are no existing PowerPoint files
that include slide notes.

## Testing
```
#  using default value (True) returns additional NarrativeText element that contains notes
curl -X 'POST'   'https://localhost:8000/general/v0/general'   -H 'accept: application/json'   -H 'Content-Type: multipart/form-data'   -F 'files=@sample-docs/notes.pptx'   -F 'output_format="text/csv"' 

# explicit include_slide_notes=True returns additional NarrativeText element that contains notes
curl -X 'POST'   'https://localhost:8000/general/v0/general'   -H 'accept: application/json'   -H 'Content-Type: multipart/form-data'   -F 'files=@sample-docs/notes.pptx'   -F 'output_format="text/csv"' -F 'include_slide_notes=True'

# explicit include_slide_notes=False returns no NarrativeText element 
curl -X 'POST'   'https://localhost:8000/general/v0/general'   -H 'accept: application/json'   -H 'Content-Type: multipart/form-data'   -F 'files=@sample-docs/notes.pptx'   -F 'output_format="text/csv"' -F 'include_slide_notes=False'
```

Same with file `notes.ppt`
### Summary

Bumps to `unstructured==0.15.10`.
…457)

# Improve input validation to allow quotes for strategy parameter
Updated the strategy parameter to allow `"` or `'` as input wrapped
around valid input to reduce 4xx errors.

# Testing

Invoked CURL and REST requests passing values such as `'fast'`,
`"fast"`, and `fast` for parameter values for the strategy.
```
curl -X POST "PATH_TO_API" \
  -H 'accept: application/json' \
  -H 'Content-Type: multipart/form-data' \
  -H 'unstructured-api-key: KEY' \
  -F "files=@/path_to_file.pdf" \
  -F "strategy='fast'" \
  -F "split-pdf-page=True" \
  -F "split-pdf-allow-failed=True" \
  -F "split-pdf-concurrency-level=15"
  ```
  
  Added unit tests (passing).
Fixes misspellings identified by the [check-spelling
action](https://github.com/marketplace/actions/check-spelling).

The misspellings have been reported at
https://github.com/jsoref/unstructured-api/actions/runs/10822287753#summary-30025895524

The action will report that the changes in this PR would make it happy:
https://github.com/jsoref/unstructured-api/actions/runs/10822287924#summary-30025895935

---------

Signed-off-by: Josh Soref <[email protected]>
## Changes
- set uvicorn workers, default 1
### Summary

Bumps to `unstructured==0.15.13` to apply security patches.
@pull pull bot merged commit d42a6cf into admariner:main Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.