Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bloom size issues with Windows serialization and in general #38

Merged
merged 30 commits into from
Feb 23, 2023

Conversation

polm
Copy link
Contributor

@polm polm commented Nov 9, 2022

Before this commit, serialization used the array module, but the size of units in that varies depending on the host OS. As a result data serialized on Windows would use 32-bit units instead of 64-bit ones.

This commit uses struct for serialization instead of array, eliminating the ambiguity issue. Additionally, it detects if the serialized data used the Windows size, and loads it in that size if necessary, so old data will still work. (Even if the data was serialized into smaller containers on Windows no data was lost due to a separate issue, which will be fixed in a further commit.)

Note that this issue only affected the serialization code, and in memory containers were consistent on platforms already.

This is a draft until the further bloom size issues are addressed.

Before this commit, serialization used the array module, but the size of
units in that varies depending on the host OS. As a result data
serialized on Windows would use 32-bit units instead of 64-bit ones.

This commit uses struct for serialization instead of array, eliminating
the ambiguity issue. Additionally, it detects if the serialized data
used the Windows size, and loads it in that size if necessary, so old
data will still work. (Even if the data was serialized into smaller
containers on Windows no data was lost due to a separate issue, which
will be fixed in a further commit.)

Note that this issue only affected the serialization code, and in memory
containers were consistent on platforms already.
@polm polm added the bug label Nov 9, 2022
This fixes math issues with bloom filters so that they take up the
proper amount of memory instead of 8x too much. This issue stemmed from
confusion around which values were in bits and which in bytes.

It also adds a header and version tag to serialized data, to
differentiate data created with new math from data created with old
math. Deserialization should load old data but the tests for this aren't
written yet.
Windows one still needs a little work
This is pretty complicated! The tests seem OK though.

The sample Windows bytes were generated on Linux, I need to check that
they are actually like that on Windows.
Includes a test to check this works.

While working on the backwards compatability, it came up that in current
releases of preshed it's possible to specify 0 for the size or hash
function count. If the hcount is 0 the add/contains functions do
nothing, but do not fail. If size is 0 and hcount is not you'll get a
core dump.
@polm
Copy link
Contributor Author

polm commented Nov 11, 2022

I believe Windows is working now, though I'm leaving this a draft while I verify the result of serializing data on Windows.

Going through backwards compatability, one other issue came up - you could create a BloomFilter with size/hash function count params of 0. If the hash count is 0 add/contains checks do nothing. If size is 0 and hcount is not, then it will crash if you try to add anything (unsurprisingly). I added asserts to check for this.

I had one question come up while doing this - what's the right way to declare a const value in Cython? For some of the arithmetic I need the size of a container type in bits, so I'm assigning that at the top of the file.

The 8 here is "the size of a byte in bits", not the size of the key
(which happens to be the same value). Math...
@polm
Copy link
Contributor Author

polm commented Nov 14, 2022

I verified the Windows data is correct on real Windows.

The test failures here are unexpected - it looks like there is a segfault only on 3.6 on Linux. I'll look into that.

@polm
Copy link
Contributor Author

polm commented Nov 14, 2022

I am unable to reproduce the segfault locally, even with the exact Python version used in the tests (3.6.15). I am not sure what else could explain it, I'll see if I can figure something out.

Expect the tests to fail but without a segfault.
Expect the tests to pass but segfault, given the non-windows legacy test
segfaulted before.
The issue was that there was no check that length was a multiple of
KEY_BITS. Based on the init code, that should always be the case, but
that shouldn't be relied on.
Objects made with the current init code will always have a bitfield size
that's a multiple of container size. But if data is deserialized from
legacy data the old size may not be a multiple, so that should be
handled too.
math.ceil behaves a little unpredictably because sometimes normal
division can give a float and sometimes it doesn't with Cython types.
I'm not sure this is the best way, but relying strictly on integer
division avoids any magic with floating point numbers and lets the math
be consistent.
preshed/bloom.pyx Outdated Show resolved Hide resolved
Comment on lines +67 to +69
buflen = bloom.length // KEY_BITS
if bloom.length % KEY_BITS > 0:
buflen += 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use this pattern for ceiling division in several places, but I guess upside-down floor division would be better? That would look like this:

buflen = -(bloom.length // -KEY_BITS)

It's more terse, and perhaps faster, but also rather weird looking.

@polm
Copy link
Contributor Author

polm commented Nov 16, 2022

There's a few things I'd like feedback on, but this is functional and has no issues I'm aware of at this point.

@polm polm marked this pull request as ready for review November 16, 2022 06:14
Copy link
Contributor

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvements! First round of comments.

preshed/bloom.pyx Show resolved Hide resolved
preshed/bloom.pyx Show resolved Hide resolved
preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Show resolved Hide resolved
preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Show resolved Hide resolved
preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Show resolved Hide resolved
preshed/bloom.pyx Show resolved Hide resolved
@polm
Copy link
Contributor Author

polm commented Feb 7, 2023

Tests were failing because the Linux image no longer has Python 3.6 available, so I have removed Python 3.6 from testing.

@polm
Copy link
Contributor Author

polm commented Feb 7, 2023

Thanks for the feedback, it's good to have this reviewed! I think I have addressed all the points that have come up at this point. The details of the repacking are definitely complicated, so let me know if that's still not clear, whether from my explanation or in code.

Copy link
Contributor

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small nitpicks.

preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Outdated Show resolved Hide resolved
preshed/bloom.pyx Show resolved Hide resolved
@polm
Copy link
Contributor Author

polm commented Feb 21, 2023

Thanks for the feedback! I think my most recent commits should 1. clarify what I meant by "signficant" 2. raise an exception instead of using an assert for version mismatch.

Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Daniël has reviewed this in detail, it'd be good for me to merge if he's happy.

As I understand it, care has been taken to make sure old data can still be read. Nevertheless, I wonder whether we shouldn't have this PR in a minor release of preshed instead of a bugfix release?

@polm
Copy link
Contributor Author

polm commented Feb 22, 2023

I think it would definitely make sense to make this a minor release of preshed.

@danieldk danieldk merged commit 3d0981b into explosion:master Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants