-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC Decode/Encode with multiple bytes per chunk #84
base: main
Are you sure you want to change the base?
The head ref may contain hidden characters: "pr\u012Bmum"
Conversation
The speedup is nice, but I wouldn't want to introduce any additional unsafe code to get it. I wonder what the speedup would be like using just |
- Vec::from_raw_parts is much stricter than anticipated and requires deallocation to happen with the same alignment
I removed the UB My machine shows no practical difference in the speedup. I imagine that almost all allocations schemes ultimately use memory at least aligned to word size but maybe rust runtime does something more specific... |
- slightly more complex expansion from limb to output bytes but pretty solid perf gains
The reason I'd prefer to use bytemuck is so that there's one place proving the safety of the cast, I would like to eventually make this crate |
Sounds good, I made the change. Any thoughts on the API for exposing this? |
- unrolling apparently does not help with performance. this simplifies a lot of the indexing and is also seemingly faster
@Nemo157 any thoughts? Also removed the manual unrolling and that seemed to both improve perf as well and be simpler codewise anyway |
My thought that was by not relying on the alignment of the allocation this algorithm could replace the existing single-byte algorithm and give speedups in all cases. One additional requirement to do that would be adding in testing with a misaligned buffer to ensure that does not get broken. |
I think the current api of allowing arbitrary slices as output buffers makes that tricky. Specifically, it's significantly faster for short buffers to decode everything in word-sized chunks but we would need to switch to byte-sized chunks for the slop. We could add some extra checks do a special case dispatch in that case but that also sounds awkward. |
I haven’t looked at the code so this may be dumb comment, but alignment could be solved by copying data to and from a properly-aligned buffer on stack. Hopefully no one uses base58 for large inputs so fixed-size stack buffer should handle all sane cases. |
I didn't really spend any time thinking about an API so looking for guidance on where the library wants to go with that. This is still an
O(n^2)
algorithm but with a better constant factor https://gmplib.org/manual/Radix-to-Binary. We could potentially implement a better algorithm for very large numbers but that's overkill for my use case.Benchmark on my M1 laptop is roughly