Fix bloom size issues with Windows serialization and in general #38

polm · 2022-11-09T12:03:26Z

Before this commit, serialization used the array module, but the size of units in that varies depending on the host OS. As a result data serialized on Windows would use 32-bit units instead of 64-bit ones.

This commit uses struct for serialization instead of array, eliminating the ambiguity issue. Additionally, it detects if the serialized data used the Windows size, and loads it in that size if necessary, so old data will still work. (Even if the data was serialized into smaller containers on Windows no data was lost due to a separate issue, which will be fixed in a further commit.)

Note that this issue only affected the serialization code, and in memory containers were consistent on platforms already.

This is a draft until the further bloom size issues are addressed.

Before this commit, serialization used the array module, but the size of units in that varies depending on the host OS. As a result data serialized on Windows would use 32-bit units instead of 64-bit ones. This commit uses struct for serialization instead of array, eliminating the ambiguity issue. Additionally, it detects if the serialized data used the Windows size, and loads it in that size if necessary, so old data will still work. (Even if the data was serialized into smaller containers on Windows no data was lost due to a separate issue, which will be fixed in a further commit.) Note that this issue only affected the serialization code, and in memory containers were consistent on platforms already.

This fixes math issues with bloom filters so that they take up the proper amount of memory instead of 8x too much. This issue stemmed from confusion around which values were in bits and which in bytes. It also adds a header and version tag to serialized data, to differentiate data created with new math from data created with old math. Deserialization should load old data but the tests for this aren't written yet.

Windows one still needs a little work

This is pretty complicated! The tests seem OK though. The sample Windows bytes were generated on Linux, I need to check that they are actually like that on Windows.

Includes a test to check this works. While working on the backwards compatability, it came up that in current releases of preshed it's possible to specify 0 for the size or hash function count. If the hcount is 0 the add/contains functions do nothing, but do not fail. If size is 0 and hcount is not you'll get a core dump.

polm · 2022-11-11T06:18:55Z

I believe Windows is working now, though I'm leaving this a draft while I verify the result of serializing data on Windows.

Going through backwards compatability, one other issue came up - you could create a BloomFilter with size/hash function count params of 0. If the hash count is 0 add/contains checks do nothing. If size is 0 and hcount is not, then it will crash if you try to add anything (unsurprisingly). I added asserts to check for this.

I had one question come up while doing this - what's the right way to declare a const value in Cython? For some of the arithmetic I need the size of a container type in bits, so I'm assigning that at the top of the file.

The 8 here is "the size of a byte in bits", not the size of the key (which happens to be the same value). Math...

polm · 2022-11-14T08:01:09Z

I verified the Windows data is correct on real Windows.

The test failures here are unexpected - it looks like there is a segfault only on 3.6 on Linux. I'll look into that.

polm · 2022-11-14T08:28:59Z

I am unable to reproduce the segfault locally, even with the exact Python version used in the tests (3.6.15). I am not sure what else could explain it, I'll see if I can figure something out.

preshed/bloom.pyx

Expect the tests to fail but without a segfault.

Expect the tests to pass but segfault, given the non-windows legacy test segfaulted before.

The issue was that there was no check that length was a multiple of KEY_BITS. Based on the init code, that should always be the case, but that shouldn't be relied on.

Objects made with the current init code will always have a bitfield size that's a multiple of container size. But if data is deserialized from legacy data the old size may not be a multiple, so that should be handled too.

math.ceil behaves a little unpredictably because sometimes normal division can give a float and sometimes it doesn't with Cython types. I'm not sure this is the best way, but relying strictly on integer division avoids any magic with floating point numbers and lets the math be consistent.

preshed/bloom.pyx

polm · 2022-11-16T06:13:16Z

preshed/bloom.pyx

+ buflen = bloom.length // KEY_BITS
+ if bloom.length % KEY_BITS > 0:
+ buflen += 1


I use this pattern for ceiling division in several places, but I guess upside-down floor division would be better? That would look like this:

buflen = -(bloom.length // -KEY_BITS)

It's more terse, and perhaps faster, but also rather weird looking.

polm · 2022-11-16T06:14:29Z

There's a few things I'd like feedback on, but this is functional and has no issues I'm aware of at this point.

danieldk

Nice improvements! First round of comments.

preshed/bloom.pyx

- use constants where possible - make decode_len calculation uniform

polm · 2023-02-07T05:31:37Z

Tests were failing because the Linux image no longer has Python 3.6 available, so I have removed Python 3.6 from testing.

polm · 2023-02-07T06:52:44Z

Thanks for the feedback, it's good to have this reviewed! I think I have addressed all the points that have come up at this point. The details of the repacking are definitely complicated, so let me know if that's still not clear, whether from my explanation or in code.

danieldk

Two small nitpicks.

preshed/bloom.pyx

Co-authored-by: Daniël de Kok <[email protected]>

polm · 2023-02-21T09:57:14Z

Thanks for the feedback! I think my most recent commits should 1. clarify what I meant by "signficant" 2. raise an exception instead of using an assert for version mismatch.

svlandeg

As Daniël has reviewed this in detail, it'd be good for me to merge if he's happy.

As I understand it, care has been taken to make sure old data can still be read. Nevertheless, I wonder whether we shouldn't have this PR in a minor release of preshed instead of a bugfix release?

polm · 2023-02-22T09:12:53Z

I think it would definitely make sense to make this a minor release of preshed.

polm added the bug label Nov 9, 2022

polm added 5 commits November 10, 2022 19:45

Add tests for legacy, Linux one works

da1e09a

Windows one still needs a little work

Formatting of tests

2102633

Fix Windows

27410f6

This is pretty complicated! The tests seem OK though. The sample Windows bytes were generated on Linux, I need to check that they are actually like that on Windows.

Fix the 8.

11432ca

The 8 here is "the size of a byte in bits", not the size of the key (which happens to be the same value). Math...

Remove comment

5cffd2f

svlandeg reviewed Nov 15, 2022

View reviewed changes

preshed/bloom.pyx Show resolved Hide resolved

polm added 10 commits November 15, 2022 18:50

Add checks to avoid alloc'ing with len 0

f3ff452

Temporarily skip windows test to check for segfault cause

e17279a

Skip other legacy test

8f41a4e

Turn tests back on, but make legacy decoding a no-op

f4ff0b5

Expect the tests to fail but without a segfault.

Turn on legacy decoding, but disable all Windows code

2e35295

Expect the tests to pass but segfault, given the non-windows legacy test segfaulted before.

Fix segfault

12d65e3

The issue was that there was no check that length was a multiple of KEY_BITS. Based on the init code, that should always be the case, but that shouldn't be relied on.

Put Windows code back, avoid overflow with seed value

3ccca62

Fix buflen calculation

cd11b52

Objects made with the current init code will always have a bitfield size that's a multiple of container size. But if data is deserialized from legacy data the old size may not be a multiple, so that should be handled too.

Test re-serializing legacy data

5d0c90b

polm commented Nov 16, 2022

View reviewed changes

polm marked this pull request as ready for review November 16, 2022 06:14

danieldk reviewed Feb 2, 2023

View reviewed changes

preshed/bloom.pyx Outdated Show resolved Hide resolved

preshed/bloom.pyx Show resolved Hide resolved

preshed/bloom.pyx Show resolved Hide resolved

polm added 2 commits February 6, 2023 14:51

Use anonymous enum for constants, change version to constant

65d8563

Feedback from code review

f002dc5

polm added 4 commits February 6, 2023 19:45

Run tests in dev mode

669bf47

Suggestions from code review

4034dae

- use constants where possible - make decode_len calculation uniform

Remove overly-complicated int truncation

fb30a8f

Remove python 3.6 tests

aa3f84c

polm added 3 commits February 7, 2023 14:32

Store seed as L, not Q

b79f4c7

More efficient buffer construction

a3b4072

Use memcpy instead of a for loop

a7efad8

danieldk reviewed Feb 21, 2023

View reviewed changes

preshed/bloom.pyx Outdated Show resolved Hide resolved

preshed/bloom.pyx Outdated Show resolved Hide resolved

preshed/bloom.pyx Outdated Show resolved Hide resolved

preshed/bloom.pyx Show resolved Hide resolved

polm and others added 3 commits February 21, 2023 18:44

Update preshed/bloom.pyx

abf0d6f

Co-authored-by: Daniël de Kok <[email protected]>

Change version mismatch to raise exception

f2adc73

significant -> "actually used"

f839a20

svlandeg reviewed Feb 22, 2023

View reviewed changes

danieldk merged commit 3d0981b into explosion:master Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bloom size issues with Windows serialization and in general #38

Fix bloom size issues with Windows serialization and in general #38

polm commented Nov 9, 2022

polm commented Nov 11, 2022

polm commented Nov 14, 2022

polm commented Nov 14, 2022

polm Nov 16, 2022

polm commented Nov 16, 2022

danieldk left a comment

polm commented Feb 7, 2023

polm commented Feb 7, 2023

danieldk left a comment

polm commented Feb 21, 2023

svlandeg left a comment

polm commented Feb 22, 2023

Fix bloom size issues with Windows serialization and in general #38

Fix bloom size issues with Windows serialization and in general #38

Conversation

polm commented Nov 9, 2022

polm commented Nov 11, 2022

polm commented Nov 14, 2022

polm commented Nov 14, 2022

polm Nov 16, 2022

Choose a reason for hiding this comment

polm commented Nov 16, 2022

danieldk left a comment

Choose a reason for hiding this comment

polm commented Feb 7, 2023

polm commented Feb 7, 2023

danieldk left a comment

Choose a reason for hiding this comment

polm commented Feb 21, 2023

svlandeg left a comment

Choose a reason for hiding this comment

polm commented Feb 22, 2023