Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pikpak: implement custom hash to replace wrong sha1 #7905

Merged
merged 3 commits into from
Jun 19, 2024

Conversation

wiserain
Copy link
Contributor

@wiserain wiserain commented Jun 12, 2024

What is the purpose of this change?

This PR improves PikPak's file integrity verification by implementing a custom hash function named gcid and replacing the previously used SHA-1 hash.

Was the change discussed in an issue or in the forum before?

Fixes #7838

Checklist

  • I have read the contribution guidelines.
  • I have added tests for all changes in this PR if appropriate.
  • I have added documentation for the changes if appropriate.
  • All commit messages are in house style.
  • I'm done, this Pull Request is ready for review :-)

Copy link
Member

@ncw ncw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the hash is quite right yet - or there is something I don't undersdand!

Please see inline.

Thank you :-)

@@ -128,6 +129,7 @@ func pikpakAuthorize(ctx context.Context, opt *Options, name string, m configmap

// Register with Fs
func init() {
pphashType = hash.RegisterHash("pikpak", "PikPakHash", 40, sha1.New)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just the same as hash.SHA1 isn't it?

In order for hash checking to work with I think you need to implement hash.Hash here and pass that in instead for sha1.New.

There are some examples in the rclone code

Either that or I don't understand what is going on here!

This is probably working becuase rclone will choose MD5 by default.

Check the hashes match properly with rclone hashsum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick review! You're absolutely right, I'm having some difficulty registering a new hash type that requires a hash.New().

In the calcGcid() function, Pikpak's custom hash algorithm needs an additional variable, the file size, to calculate the size of each block within the file. While the Dropbox hash seems like the closest comparison in this case, it uses a fixed value of 4MB for bytesPerBlock.

My question is, how can I modify the custom hasher to be compatible with hash.RegisterHash()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashes don't take a length function so short answer would be that you can't :-(

You could read all the data in your hash function before choosing the block size - that doesn't sound like a good solution to me though.

However I see that there are only 8 possible block sizes from 0x40000 to 0x200000 so you could register 8 hash functions but I don't think this will work well with the rclone core as it won't know which hash function to choose.

We've thought about size dependent hashes before when thinking about multipart Etags for S3. We decided not to use them though.

I don't think this is possible with rclone's hashing framework as it stands today :-(

Looking at the code for pikpak I see we have an MD5 which seems to work, so I'd say we don't need the support for the pikpak hash in the rclone core as MD5 is very good for detecting errors.

What is this hash used for? If it is just a content hash then I think we should probably remove the broken sha1 implementation from pikpak and just use MD5. It may have other uses I'm not aware of though.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the revised version and the comment.

@wiserain
Copy link
Contributor Author

As suggested, we have opted to make gcid for internal use only, primarily during the upload process. While MD5 may often be empty, we have maintained its functionality for consistency with the previous implementation.

Copy link
Member

@ncw ncw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine now :-)

Happy for you to squash to one commit and merge when it passes the CI and the integration tests.

@wiserain
Copy link
Contributor Author

Unfortunately, many fails in the integration tests. This might be a breaking change to PUT. Give me some time to investigate and finish the job.

@wiserain
Copy link
Contributor Author

2024/06/20 00:25:49 PASS: All tests finished OK in 6m14.431740163s

Merging it now!

@wiserain wiserain merged commit 300851e into rclone:master Jun 19, 2024
10 checks passed
@wiserain wiserain deleted the pikpak-gcid branch June 19, 2024 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problems with pikpak storing hash after uploads
2 participants