Skip to content

Tags: Unstructured-IO/unstructured-api

Tags

0.0.81

Toggle 0.0.81's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
version `0.0.81`; bump to `unstructured==0.15.13` (#463)

### Summary

Bumps to `unstructured==0.15.13` to apply security patches.

0.0.80

Toggle 0.0.80's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
version 0.0.80; bump to unstructured 0.15.10 (#458)

### Summary

Bumps to `unstructured==0.15.10`.

0.0.79

Toggle 0.0.79's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
version 0.0.79; bump to unstructured 0.0.79 (#454)

### Summary

Bumps to `unstructured==0.15.7`.

0.0.76

Toggle 0.0.76's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat: enhance API filetype detection (#445)

# Use the library for filetype detection 

The mimetype detection has always been very naive in the API - we rely
on the file extension. If the user doesn't include a filename, we return
an error that `Filetype None is not supported`. The library has a
detect_filetype that actually inspects the file bytes, so let's reuse
this.

# Add a `content_type` param to override filetype detection

Add an optional `content_type` param that allows the user to override
the filetype detection. We'll use this value if it's set, or take the
`file.content_type` which is based on the multipart `Content-Type`
header. This provides an alternative when clients are unable to modify
the header.

# Testing

The important thing is that `test_happy_path_all_types` passes in the
docker smoke test - this contains all filetypes that we want the API to
support.

To test manually, you can try sending files to the server with and
without the filename/content_type defined.

Check out this branch and run `make run-web-app`.

Example sending with no extension in filename. This correctly processes
a pdf.
```
import requests

filename = "sample-docs/layout-parser-paper-fast.pdf"
url = "https://localhost:8000/general/v0/general"

with open(filename, 'rb') as f:
    files = {'files': ("sample-doc", f)}
    response = requests.post(url, files=files)
    print(response.text)
```

For the new param, you can try modifying the content type for a text
based file.

Verify that you can change the `metadata.filetype` of the response using
the new param:

```
 curl --location 'https://localhost:8000/general/v0/general' \
--form 'files=@"sample-docs/family-day.eml"' \
--form 'content_type="text/plain"'

[
    {
        "type": "UncategorizedText",
        "element_id": "5cafe1ce2b0a96f8e3eba232e790db19",
        "text": "MIME-Version: 1.0 Date: Wed, 21 Dec 2022 10:28:53 -0600 Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com> Subject: Family Day From: Mallori Harrell <[email protected]> To: Mallori Harrell <[email protected]> Content-Type: multipart/alternative; boundary=\"0000000000005c115405f0590ce4\"",
        "metadata": {
            "filename": "family-day.eml",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    ...
]
```

0.0.75

Toggle 0.0.75's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
build(deps): remove dependency constraint on `safetensors` (#443)

### Summary

Removes a constraint on `safetensors` from version `0.0.38` that was
preventing us from resolving a low CVE in `transformers`.

0.0.74

Toggle 0.0.74's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
build: bump to `0.0.74`; bump dependencies (#442)

### Summary

Bumps dependencies and prepares files for the `0.0.74` release.

0.0.73

Toggle 0.0.73's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
build(deps): bump to `unstructured==0.14.10` (#438)

### Summary

Bumps to `unstructured==0.14.10`.

0.0.72

Toggle 0.0.72's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix/Fix MS Office filetype errors and harden docker smoketest (#436)

# Changes
**Fix for docx and other office files returning `{"detail":"File type
None is not supported."}`**
After moving to the wolfi base image, the `mimetypes` lib no longer
knows about these file extensions. To avoid issues like this, let's add
an explicit mapping for all the file extensions we care about. I added a
`filetypes.py` and moved `get_validated_mimetype` over. When this file
is imported, we'll call `mimetypes.add_type` for all file extensions we
support.

**Update smoke test coverage**
This bug snuck past because we were already providing the mimetype in
the docker smoke test. I updated `test_happy_path` to test against the
container with and without passing `content_type`. I added some missing
filetypes, and sorted the test params by extension so we can see when
new types are missing.

# Testing
The new smoke test will verify that all filetypes are working. You can
also `make docker-build && make docker-start-api`, and test out the docx
in the sample docs dir. On `main`, this file will give you the error
above.
```
curl 'https://localhost:8000/general/v0/general' \
--form 'files=@"fake.docx"'
```

0.0.71

Toggle 0.0.71's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
build(deps): bump dependency versions (#434)

### Summary

Bumps dependency versions for the API. Closes #432.

0.0.70

Toggle 0.0.70's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
build(deps): version bumps for maintenance (#424)

### Summary
Version bumps for regular maintenance and to address moderate CVEs from
security scans.
- bump `unstructured` to `0.14.6`
- bump `unstructured-inference` to `0.7.35`