Tags: Unstructured-IO/unstructured-api
Tags
feat: enhance API filetype detection (#445) # Use the library for filetype detection The mimetype detection has always been very naive in the API - we rely on the file extension. If the user doesn't include a filename, we return an error that `Filetype None is not supported`. The library has a detect_filetype that actually inspects the file bytes, so let's reuse this. # Add a `content_type` param to override filetype detection Add an optional `content_type` param that allows the user to override the filetype detection. We'll use this value if it's set, or take the `file.content_type` which is based on the multipart `Content-Type` header. This provides an alternative when clients are unable to modify the header. # Testing The important thing is that `test_happy_path_all_types` passes in the docker smoke test - this contains all filetypes that we want the API to support. To test manually, you can try sending files to the server with and without the filename/content_type defined. Check out this branch and run `make run-web-app`. Example sending with no extension in filename. This correctly processes a pdf. ``` import requests filename = "sample-docs/layout-parser-paper-fast.pdf" url = "https://localhost:8000/general/v0/general" with open(filename, 'rb') as f: files = {'files': ("sample-doc", f)} response = requests.post(url, files=files) print(response.text) ``` For the new param, you can try modifying the content type for a text based file. Verify that you can change the `metadata.filetype` of the response using the new param: ``` curl --location 'https://localhost:8000/general/v0/general' \ --form 'files=@"sample-docs/family-day.eml"' \ --form 'content_type="text/plain"' [ { "type": "UncategorizedText", "element_id": "5cafe1ce2b0a96f8e3eba232e790db19", "text": "MIME-Version: 1.0 Date: Wed, 21 Dec 2022 10:28:53 -0600 Message-ID: <CAPgNNXQKR=o6AsOTr74VMrsDNhUJW0Keou9n3vLa2UO_Nv+tZw@mail.gmail.com> Subject: Family Day From: Mallori Harrell <[email protected]> To: Mallori Harrell <[email protected]> Content-Type: multipart/alternative; boundary=\"0000000000005c115405f0590ce4\"", "metadata": { "filename": "family-day.eml", "languages": [ "eng" ], "filetype": "text/plain" } }, ... ] ```
fix/Fix MS Office filetype errors and harden docker smoketest (#436) # Changes **Fix for docx and other office files returning `{"detail":"File type None is not supported."}`** After moving to the wolfi base image, the `mimetypes` lib no longer knows about these file extensions. To avoid issues like this, let's add an explicit mapping for all the file extensions we care about. I added a `filetypes.py` and moved `get_validated_mimetype` over. When this file is imported, we'll call `mimetypes.add_type` for all file extensions we support. **Update smoke test coverage** This bug snuck past because we were already providing the mimetype in the docker smoke test. I updated `test_happy_path` to test against the container with and without passing `content_type`. I added some missing filetypes, and sorted the test params by extension so we can see when new types are missing. # Testing The new smoke test will verify that all filetypes are working. You can also `make docker-build && make docker-start-api`, and test out the docx in the sample docs dir. On `main`, this file will give you the error above. ``` curl 'https://localhost:8000/general/v0/general' \ --form 'files=@"fake.docx"' ```
PreviousNext