-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bloom size issues with Windows serialization and in general #38
Conversation
Before this commit, serialization used the array module, but the size of units in that varies depending on the host OS. As a result data serialized on Windows would use 32-bit units instead of 64-bit ones. This commit uses struct for serialization instead of array, eliminating the ambiguity issue. Additionally, it detects if the serialized data used the Windows size, and loads it in that size if necessary, so old data will still work. (Even if the data was serialized into smaller containers on Windows no data was lost due to a separate issue, which will be fixed in a further commit.) Note that this issue only affected the serialization code, and in memory containers were consistent on platforms already.
This fixes math issues with bloom filters so that they take up the proper amount of memory instead of 8x too much. This issue stemmed from confusion around which values were in bits and which in bytes. It also adds a header and version tag to serialized data, to differentiate data created with new math from data created with old math. Deserialization should load old data but the tests for this aren't written yet.
Windows one still needs a little work
This is pretty complicated! The tests seem OK though. The sample Windows bytes were generated on Linux, I need to check that they are actually like that on Windows.
Includes a test to check this works. While working on the backwards compatability, it came up that in current releases of preshed it's possible to specify 0 for the size or hash function count. If the hcount is 0 the add/contains functions do nothing, but do not fail. If size is 0 and hcount is not you'll get a core dump.
I believe Windows is working now, though I'm leaving this a draft while I verify the result of serializing data on Windows. Going through backwards compatability, one other issue came up - you could create a BloomFilter with size/hash function count params of 0. If the hash count is 0 add/contains checks do nothing. If size is 0 and hcount is not, then it will crash if you try to add anything (unsurprisingly). I added asserts to check for this. I had one question come up while doing this - what's the right way to declare a const value in Cython? For some of the arithmetic I need the size of a container type in bits, so I'm assigning that at the top of the file. |
The 8 here is "the size of a byte in bits", not the size of the key (which happens to be the same value). Math...
I verified the Windows data is correct on real Windows. The test failures here are unexpected - it looks like there is a segfault only on 3.6 on Linux. I'll look into that. |
I am unable to reproduce the segfault locally, even with the exact Python version used in the tests (3.6.15). I am not sure what else could explain it, I'll see if I can figure something out. |
Expect the tests to fail but without a segfault.
Expect the tests to pass but segfault, given the non-windows legacy test segfaulted before.
The issue was that there was no check that length was a multiple of KEY_BITS. Based on the init code, that should always be the case, but that shouldn't be relied on.
Objects made with the current init code will always have a bitfield size that's a multiple of container size. But if data is deserialized from legacy data the old size may not be a multiple, so that should be handled too.
math.ceil behaves a little unpredictably because sometimes normal division can give a float and sometimes it doesn't with Cython types. I'm not sure this is the best way, but relying strictly on integer division avoids any magic with floating point numbers and lets the math be consistent.
buflen = bloom.length // KEY_BITS | ||
if bloom.length % KEY_BITS > 0: | ||
buflen += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use this pattern for ceiling division in several places, but I guess upside-down floor division would be better? That would look like this:
buflen = -(bloom.length // -KEY_BITS)
It's more terse, and perhaps faster, but also rather weird looking.
There's a few things I'd like feedback on, but this is functional and has no issues I'm aware of at this point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice improvements! First round of comments.
- use constants where possible - make decode_len calculation uniform
Tests were failing because the Linux image no longer has Python 3.6 available, so I have removed Python 3.6 from testing. |
Thanks for the feedback, it's good to have this reviewed! I think I have addressed all the points that have come up at this point. The details of the repacking are definitely complicated, so let me know if that's still not clear, whether from my explanation or in code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small nitpicks.
Co-authored-by: Daniël de Kok <[email protected]>
Thanks for the feedback! I think my most recent commits should 1. clarify what I meant by "signficant" 2. raise an exception instead of using an assert for version mismatch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Daniël has reviewed this in detail, it'd be good for me to merge if he's happy.
As I understand it, care has been taken to make sure old data can still be read. Nevertheless, I wonder whether we shouldn't have this PR in a minor release of preshed
instead of a bugfix release?
I think it would definitely make sense to make this a minor release of preshed. |
Before this commit, serialization used the array module, but the size of units in that varies depending on the host OS. As a result data serialized on Windows would use 32-bit units instead of 64-bit ones.
This commit uses
struct
for serialization instead ofarray
, eliminating the ambiguity issue. Additionally, it detects if the serialized data used the Windows size, and loads it in that size if necessary, so old data will still work. (Even if the data was serialized into smaller containers on Windows no data was lost due to a separate issue, which will be fixed in a further commit.)Note that this issue only affected the serialization code, and in memory containers were consistent on platforms already.
This is a draft until the further bloom size issues are addressed.