Tags · Unstructured-IO/unstructured-api

0.0.81

version `0.0.81`; bump to `unstructured==0.15.13` (#463)

### Summary

Bumps to `unstructured==0.15.13` to apply security patches.

Sep 23, 2024
d42a6cf
zip
tar.gz
Notes

0.0.80

version 0.0.80; bump to unstructured 0.15.10 (#458)

### Summary

Bumps to `unstructured==0.15.10`.

Sep 10, 2024
c52a2d1
zip
tar.gz
Notes

0.0.79

version 0.0.79; bump to unstructured 0.0.79 (#454)

### Summary

Bumps to `unstructured==0.15.7`.

Aug 20, 2024
843d68a
zip
tar.gz
Notes

0.0.76

feat: enhance API filetype detection (#445)

# Use the library for filetype detection 

The mimetype detection has always been very naive in the API - we rely
on the file extension. If the user doesn't include a filename, we return
an error that `Filetype None is not supported`. The library has a
detect_filetype that actually inspects the file bytes, so let's reuse
this.

# Add a `content_type` param to override filetype detection

Add an optional `content_type` param that allows the user to override
the filetype detection. We'll use this value if it's set, or take the
`file.content_type` which is based on the multipart `Content-Type`
header. This provides an alternative when clients are unable to modify
the header.

# Testing

The important thing is that `test_happy_path_all_types` passes in the
docker smoke test - this contains all filetypes that we want the API to
support.

To test manually, you can try sending files to the server with and
without the filename/content_type defined.

Check out this branch and run `make run-web-app`.

Example sending with no extension in filename. This correctly processes
a pdf.
```
import requests

filename = "sample-docs/layout-parser-paper-fast.pdf"
url = "https://localhost:8000/general/v0/general"

with open(filename, 'rb') as f:
    files = {'files': ("sample-doc", f)}
    response = requests.post(url, files=files)
    print(response.text)
```

For the new param, you can try modifying the content type for a text
based file.

Verify that you can change the `metadata.filetype` of the response using
the new param:

```
 curl --location 'https://localhost:8000/general/v0/general' \
--form 'files=@"sample-docs/family-day.eml"' \
--form 'content_type="text/plain"'

[
    {
        "type": "UncategorizedText",
        "element_id": "5cafe1ce2b0a96f8e3eba232e790db19",
        "text": "MIME-Version: 1.0 Date: Wed, 21 Dec 2022 10:28:53 -0600 Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com> Subject: Family Day From: Mallori Harrell <[email protected]> To: Mallori Harrell <[email protected]> Content-Type: multipart/alternative; boundary=\"0000000000005c115405f0590ce4\"",
        "metadata": {
            "filename": "family-day.eml",
            "languages": [
                "eng"
            ],
            "filetype": "text/plain"
        }
    },
    ...
]
```

Aug 6, 2024
7468938
zip
tar.gz
Notes

0.0.75

build(deps): remove dependency constraint on `safetensors` (#443)

### Summary

Removes a constraint on `safetensors` from version `0.0.38` that was
preventing us from resolving a low CVE in `transformers`.

Jul 24, 2024
d5502d0
zip
tar.gz
Notes

0.0.74

build: bump to `0.0.74`; bump dependencies (#442)

### Summary

Bumps dependencies and prepares files for the `0.0.74` release.

Jul 22, 2024
119e9bd
zip
tar.gz
Notes

0.0.73

build(deps): bump to `unstructured==0.14.10` (#438)

### Summary

Bumps to `unstructured==0.14.10`.

Jul 9, 2024
35d5b37
zip
tar.gz
Notes

0.0.72

fix/Fix MS Office filetype errors and harden docker smoketest (#436)

# Changes
**Fix for docx and other office files returning `{"detail":"File type
None is not supported."}`**
After moving to the wolfi base image, the `mimetypes` lib no longer
knows about these file extensions. To avoid issues like this, let's add
an explicit mapping for all the file extensions we care about. I added a
`filetypes.py` and moved `get_validated_mimetype` over. When this file
is imported, we'll call `mimetypes.add_type` for all file extensions we
support.

**Update smoke test coverage**
This bug snuck past because we were already providing the mimetype in
the docker smoke test. I updated `test_happy_path` to test against the
container with and without passing `content_type`. I added some missing
filetypes, and sorted the test params by extension so we can see when
new types are missing.

# Testing
The new smoke test will verify that all filetypes are working. You can
also `make docker-build && make docker-start-api`, and test out the docx
in the sample docs dir. On `main`, this file will give you the error
above.
```
curl 'https://localhost:8000/general/v0/general' \
--form 'files=@"fake.docx"'
```

Jun 28, 2024
6710df0
zip
tar.gz
Notes

0.0.71

build(deps): bump dependency versions (#434)

### Summary

Bumps dependency versions for the API. Closes #432.

Jun 24, 2024
d5a878f
zip
tar.gz
Notes

0.0.70

build(deps): version bumps for maintenance (#424)

### Summary
Version bumps for regular maintenance and to address moderate CVEs from
security scans.
- bump `unstructured` to `0.14.6`
- bump `unstructured-inference` to `0.7.35`

Jun 14, 2024
fbdc6af
zip
tar.gz
Notes

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.0.81

0.0.80

0.0.79

0.0.76

0.0.75

0.0.74

0.0.73

0.0.72

0.0.71

0.0.70

Tags: Unstructured-IO/unstructured-api