-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reconsider UnsafeBitArray #34
Comments
Hi @eyalfa, thank you for raising the issue. I'm not sure I fully understood the use case. I appreciate if you can share more info or code example.
Right, it shouldn't. In fact it's even faster, but just one CPU cycle faster 😀 . From other side, managed memory could cause a pressure for GC. It's a trade-off. Probably the best idea would be to expose a |
hi @alexandrnikitin and thanks for your swift reply! regarding my use case with spark:
to create an RDD with a single BloomFilter object per partition.
I'm actually using this pattern twice, in one place I also persist (cache) the bloom filters RDD in memory and reuse it multiple times, in the second place the bloom filters RDD is used as a 'one-of'. There are few issues here related to off-heap memory allocations:
sorry for the long write-up 😄 ,
I definitely think the library should support multiple implementations, but we must first be able to determine the performance penalty (if any) for this approach. |
Thank you for the thorough answer. Now it's clear.
Is there a way to help Spark find out the memory usage? a special interface? re 2 and 3: Can you iterate over RDD with BFs and close them explicitly after you finished working with them?
There's a memory access benchmark already in UnsafeBitArrayBenchmark.scala. It gives the following numbers on my laptop:
The difference is just few CPU cycles. I don't have asm listings to show yet. |
hi @alexandrnikitin ,
|
I'm using the library with Apache-Spark,
specifically I'm optimizing a sparse join by creating a bloom filter per partition, this required wrapping the bloom filter instances with a class that implements
finalaize
.If I hadn't done so I'd get a massive memory leak 😢 , the wrapper solved part of my problems but not all as Spark is unaware of the memory usage of the bloom filters which may lead to situations where it 'misses' a memory spill (spark's way of freeing up memory).
my thought is that implementing UnsafeBitArray in terms of a plain Java long array (
Long[]
) should not harm capabilities or performance of this class while still allowing for proper GC of these objects.I think an even further possibility is using multiple arrays to avoid huge allocations, JVM heap (like many other memory allocators) 'suffers' both from tiny and huge allocations, each having its own merits.
@alexandrnikitin , what do you think? does it worth a benchmark? (perhaps in a separate branch to avoid complexities of side by side support for both code paths)?
The text was updated successfully, but these errors were encountered: