Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API presign multipart upload experimenal support for S3 #7246

Merged
merged 11 commits into from
Jan 8, 2024

Conversation

nopcoder
Copy link
Contributor

@nopcoder nopcoder commented Jan 4, 2024

OpenAPI s3 presign multipart upload experimental for s3 base on the following proposal:

https://github.com/treeverse/lakeFS/blob/master/design/accepted/openapi-multipart-upload.md

Close #7253

@nopcoder nopcoder self-assigned this Jan 4, 2024
Copy link

github-actions bot commented Jan 4, 2024

♻️ PR Preview 8b45119 has been successfully destroyed since this PR has been closed.

🤖 By surge-preview

@nopcoder nopcoder requested a review from a team January 4, 2024 15:44
@nopcoder nopcoder changed the title API multipart upload experimenal support for s3 API presign multipart upload experimenal support for s3 Jan 4, 2024
Copy link

github-actions bot commented Jan 4, 2024

E2E Test Results - DynamoDB Local - Local Block Adapter

10 passed

@nopcoder nopcoder added area/API Improvements or additions to the API include-changelog PR description should be included in next release changelog labels Jan 4, 2024
Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this so quickly! Please add tests, other than that mostly minor comments

Initiates a multipart upload and returns an upload ID with presigned URLs for each part (optional).
Part numbers starts with 1. Each part except the last one has minimum size depends on the underlying blockstore implementation.
For example working with S3 blockstore, minimum size is 5MB (excluding the last part).
responses:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need 400 too. For example, 'parts' is 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the missing 400 status code for all routes.
The SDK will fail the request in case of bad request.
The number of parts for create is optional, in order to support additional API to provide presigned parts URL request for streaming.

required: true
schema:
type: string
- in: query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the path needed when uploadID is given?
Can 2 mpus be available for the same part?
isn't uploadID unique between different paths?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We require path like S3 protocol requires key. Same for upload ID, even though it seems unique, s3 protocol requires bucket and key on each api call.

When we actually work with physical address (which we also require) we use the path for permission check on create and complete.
For each part the request is signed with all the information which include the upload id and the path.

There can be two mpu to the same logical address which means that there can be two parts upload in parallel. In our case because we provide different physical addresses these parts will have different addresses.
If you will try to provide such path to different upload request the complete will not find this part. Hope I understood the concern.

application/json:
schema:
$ref: "#/components/schemas/CompletePresignMultipartUpload"
responses:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

400 is needed. for example. wrong parts order

application/json:
schema:
$ref: "#/components/schemas/AbortPresignMultipartUpload"
responses:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

400 is needed. For example, uploadID mismatch with path

@@ -609,6 +609,9 @@ func newCache(cfg CacheConfig) cache.Cache {
}

func (m *Manager) SetLinkAddress(ctx context.Context, repository *graveler.RepositoryRecord, physicalAddress string) error {
if physicalAddress == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a bug? If so, can you add tests for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just input validation to make sure we will not store value for empty path.

const DefaultUploadPartSize = MinUploadPartSize

// DefaultUploadConcurrency is the default number of goroutines to spin up when uploading a multipart upload
const DefaultUploadConcurrency = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it enough? Seems too low for me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is the default aws sdk uses - I'll make it configurable for lakectl fs upload

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's AWS defaults then it's good enough

pkg/api/helpers/upload.go Outdated Show resolved Hide resolved
pkg/api/helpers/upload.go Outdated Show resolved Hide resolved
Comment on lines 284 to 288
if uploader != nil {
stats, err = uploader.Upload(ctx)
} else {
stats, err = clientUploadPreSignHelper(ctx, client, repoID, branchID, objPath, metadata, contentType, contents, contentLength)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. uploader is PreSignUploader. If it's nil, use clientUploadPreSignHelper instead. Can you rename vars or use comment to help clarify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll update the name or integrate them together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integrated presign upload of single object into uploader functionality.

const MinUploadPartSize int64 = 1024 * 1024 * 5

// DefaultUploadPartSize is the default part size to buffer chunks of a payload into
const DefaultUploadPartSize = MinUploadPartSize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this optimal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are AWS SDK defaults

@nopcoder nopcoder changed the title API presign multipart upload experimenal support for s3 API presign multipart upload experimenal support for S3 Jan 6, 2024
Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Only concern is with the upload file different flows test coverage. Given the urgency, it could be handled in a separate PR.

Copy link
Member

@N-o-Z N-o-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome!

Some comments and also missing crucial tests for the uploader

@@ -253,6 +253,7 @@ type Config struct {
PreSignedExpiry time.Duration `mapstructure:"pre_signed_expiry"`
DisablePreSigned bool `mapstructure:"disable_pre_signed"`
DisablePreSignedUI bool `mapstructure:"disable_pre_signed_ui"`
DisablePreSignedMultipart bool `mapstructure:"disable_pre_signed_multipart"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this under "experimental" or "beta" please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added configuration reference doc line with experimental - didn't want to configure it outside the block adapter.
Let me know if you think we have better place.

return
}

repo, err := c.Catalog.GetRepository(ctx, repository)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can skip this - BranchExists will already do that for you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad it is cached, we need it for BlockAdapter.ResolveNamespace

pkg/api/controller.go Outdated Show resolved Hide resolved
cmd/lakectl/cmd/fs_upload.go Outdated Show resolved Hide resolved
func (u *presignUpload) initMultipart(ctx context.Context) (*apigen.PresignMultipartUpload, error) {
// adjust part size
u.partSize = DefaultUploadPartSize
if u.size/u.partSize >= int64(MaxUploadParts) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of awkward. I think we should enforce that at least (MaxUploadParts + x) * u.partSize > u.size, where x is a constant size.
Otherwise, this can lead to bad performance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used the same logic the s3 sdk uses to calculate the part size the client will use. It uses default part size and try to use it until the max number of parts allowed.
In the case the size can't fit the parts it will adjust the size.
The cli usually gives a way to configure the part size, but for now I kept it as the default which is the min size.

@@ -422,6 +431,42 @@ func (a *Adapter) GetPreSignedURL(ctx context.Context, obj block.ObjectPointer,
return req.URL, expiry, nil
}

func (a *Adapter) GetPresignUploadPartURL(ctx context.Context, obj block.ObjectPointer, uploadID string, partNumber int) (string, error) {
if a.disablePreSigned {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if a.disablePreSigned {
if a.disablePreSigned || a.disablePreSignedMultipart {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking about this one - if I like to block the presign because of the multipart capability or not. but I kept it bound to the feature it is based on. so if we support presign you can have part presign, in case you need it for some other feature.

@nopcoder
Copy link
Contributor Author

nopcoder commented Jan 7, 2024

LGTM Only concern is with the upload file different flows test coverage. Given the urgency, it could be handled in a separate PR.

added esti tests for presin multipart upload.
will add lakectl tests in separate pr.

@nopcoder nopcoder requested a review from N-o-Z January 7, 2024 20:56
Copy link
Member

@N-o-Z N-o-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions / requests regarding tests

require.NotNil(t, resp.JSON201)
require.NotEmpty(t, resp.JSON201.UploadId)
require.NotEmpty(t, resp.JSON201.PhysicalAddress)
if tt.parts != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong but this code is never reached

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relevant for no_parts case

})
require.NoError(t, err, "CompletePresignMultipartUpload should succeed")
require.Equalf(t, http.StatusOK, resp.StatusCode(), "CompletePresignMultipartUpload status code mismatch: %s - %s", resp.Status(), resp.Body)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we validate that the object actually exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding

@@ -0,0 +1,246 @@
package esti
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for u.size / u.partSize >= maxParts?

Copy link
Contributor Author

@nopcoder nopcoder Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D sorry - I can manually test this one but it will be a big file for a test.

@nopcoder nopcoder requested a review from N-o-Z January 8, 2024 10:15
Copy link
Member

@N-o-Z N-o-Z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nopcoder nopcoder merged commit 11c07ff into master Jan 8, 2024
37 checks passed
@nopcoder nopcoder deleted the feature/api-presign-multipart-upload branch January 8, 2024 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API Improvements or additions to the API include-changelog PR description should be included in next release changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: experimental open api presign multipart upload support for s3
3 participants