Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Decode/Encode with multiple bytes per chunk #84

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

lwus
Copy link

@lwus lwus commented May 11, 2022

I didn't really spend any time thinking about an API so looking for guidance on where the library wants to go with that. This is still an O(n^2) algorithm but with a better constant factor https://gmplib.org/manual/Radix-to-Binary. We could potentially implement a better algorithm for very large numbers but that's overkill for my use case.

Benchmark on my M1 laptop is roughly

  32_bytes/decode_bs58          time:   [414.08 ns 415.13 ns 416.33 ns]
  32_bytes/decode_bs58_unsafe   time:   [99.003 ns 99.456 ns 99.953 ns]
  256_bytes/decode_bs58         time:   [26.940 us 27.022 us 27.118 us]
  256_bytes/decode_bs58_unsafe  time:   [1.3279 us 1.3282 us 1.3286 us]
...
  256_bytes/encode_bs58_vec         time:   [86.649 us 86.698 us 86.763 us]
  256_bytes/encode_bs58_vec_unsafe  time:   [3.5615 us 3.5681 us 3.5762 us]
  32_bytes/encode_bs58_vec          time:   [1.1578 us 1.1612 us 1.1649 us]
  32_bytes/encode_bs58_vec_unsafe   time:   [236.90 ns 237.01 ns 237.13 ns]

src/decode.rs Outdated Show resolved Hide resolved
@Nemo157
Copy link
Member

Nemo157 commented May 11, 2022

The speedup is nice, but I wouldn't want to introduce any additional unsafe code to get it. I wonder what the speedup would be like using just bytemuck::pod_align_to_mut and not changing the allocation size. That would allow applying the faster algorithm for the inner portion of a longer run (or the whole thing if the allocation happens to be aligned) while maintaining the current API.

lwus added 2 commits May 11, 2022 16:56
- Vec::from_raw_parts is much stricter than anticipated and requires
  deallocation to happen with the same alignment
@lwus lwus changed the title RFC Decode with multiple bytes per chunk RFC Decode/Encode with multiple bytes per chunk May 11, 2022
@lwus
Copy link
Author

lwus commented May 11, 2022

I removed the UB from_raw_parts in 1382308. Bytemuck seems to just defer to unsafe { vals.align_to_mut... } anyway so not sure if there's much advantage (since we're just going between u8 and u32).

My machine shows no practical difference in the speedup. I imagine that almost all allocations schemes ultimately use memory at least aligned to word size but maybe rust runtime does something more specific...

- slightly more complex expansion from limb to output bytes but pretty
  solid perf gains
@Nemo157
Copy link
Member

Nemo157 commented May 11, 2022

The reason I'd prefer to use bytemuck is so that there's one place proving the safety of the cast, I would like to eventually make this crate forbid(unsafe_code) by pulling out the &mut str helper into another crate.

@lwus
Copy link
Author

lwus commented May 12, 2022

Sounds good, I made the change. Any thoughts on the API for exposing this?

lwus added 3 commits May 17, 2022 09:24
- unrolling apparently does not help with performance. this simplifies a
  lot of the indexing and is also seemingly faster
@lwus
Copy link
Author

lwus commented May 17, 2022

@Nemo157 any thoughts?

Also removed the manual unrolling and that seemed to both improve perf as well and be simpler codewise anyway

@Nemo157
Copy link
Member

Nemo157 commented May 23, 2022

My thought that was by not relying on the alignment of the allocation this algorithm could replace the existing single-byte algorithm and give speedups in all cases.

One additional requirement to do that would be adding in testing with a misaligned buffer to ensure that does not get broken.

@lwus
Copy link
Author

lwus commented May 23, 2022

I think the current api of allowing arbitrary slices as output buffers makes that tricky. Specifically, it's significantly faster for short buffers to decode everything in word-sized chunks but we would need to switch to byte-sized chunks for the slop. We could add some extra checks do a special case dispatch in that case but that also sounds awkward.

@mina86
Copy link
Contributor

mina86 commented Sep 4, 2022

I haven’t looked at the code so this may be dumb comment, but alignment could be solved by copying data to and from a properly-aligned buffer on stack. Hopefully no one uses base58 for large inputs so fixed-size stack buffer should handle all sane cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants