Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use our own logic for index management #5088

Merged
merged 1 commit into from
Jun 18, 2024
Merged

Use our own logic for index management #5088

merged 1 commit into from
Jun 18, 2024

Conversation

rnewson
Copy link
Member

@rnewson rnewson commented Jun 12, 2024

Overview

Use our own logic for index management

Caffeine complaines when eviction is slow (because we might commit the index).
Build our own logic for opening and closing indexes which allows opening and
closing to be very slow, while ensuring we don't close an index that's in use.

Testing recommendations

covered by tests

Related Issues or Pull Requests

N/A

Checklist

  • Code is written and works correctly
  • Changes are covered by tests
  • Any new configurable parameters are documented in rel/overlay/etc/default.ini
  • Documentation changes were made in the src/docs folder
  • Documentation changes were backported (separated PR) to affected branches

holder.state = HolderState.LOADED;
}
} finally {
holder.lock.readLock().lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hold the write lock we're allowed to take a read lock and now we hold both?

I wonder if holder.lock.readLock().lock(); should be before the } finally { instead of after, in other words finally should only handle unlocking the write lock we took above in line 141 holder.lock.writeLock()?

Copy link
Member Author

@rnewson rnewson Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,

Lock downgrading

Reentrancy also allows downgrading from the write lock to a read lock, by acquiring the write lock, then the read lock and then releasing the write lock. However, upgrading from a read lock to the write lock is not possible.

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the purpose here is to ensure we have a read lock or write lock at all times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if readLock().lock() could fail (does it throw or get interrupted?) then the next statement holder.lock.writeLock().unlock(); won't be run and it would be stuck with a write lock taken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see what you mean. will ponder. the proper spell for this in is those javadocs, will double check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. agree that the read lock() should be before the finally for the reason stated. pattern now matches the javadoc example more closely.

Comment on lines +201 to +203
if (forceDelete) {
IOUtils.rm(indexRootPath(name));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any assertions we can make about the presence in cache in these NOT_LOADED/UNLOADED state? For instance, would it make sense to assert we're not in the cache at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, perhaps. I am also pondering if we need three states. It is important not to block reading the cache while we load or unload (as these can take many seconds), hence I let you retrieve from the cache under just the cache lock, though it might return a Holder in NOT_LOADED or UNLOADED states. the first thread to acquire the write lock gets the chore of opening the index. Perhaps I really don't need a distinct UNLOADED state? It could go back to UN_LOADED, and then we'd try to load it again, but only if someone has acquired the Holder before it is removed from the cache map itself... Hm, but no, because then we'd have to reinsert into the map. I think a boring lifecycle is better, NOT_LOADED -> LOADED -> UN_LOADED (and removed from the cache), and repeat. What do you think?

@ben-manes
Copy link

ben-manes commented Jun 14, 2024

did you intend to have IndexEvictionListener run atomically within the map operation that removes the entry? In your new logic you defer that work to be asynchronous, so I believe you could use Caffeine.removalListener with cause.wasEvicted() or dispatch it yourself for the same effect. The naming is a little unobvious, unfortunately, and a characteristic of feature evolution and keeping familiarity with Guava Cache for straightforward migrations.

Edit:
oops, I missed that removeEldestEntry always returns false to evict the entry asynchronously.

I suppose another alternative would be to atomically evict into a victim cache which the AsyncIndexLoader. The AsyncIndexLoader could load from that first if the entry was found, either removing it or optionally having the IndexWeigher give the entry the weight of zero while unloading. When the IndexEvictionListener evicts it to the victim cache it could do your unloading procedure to discard it fully. I don't know whether that dance is any better than your own code that you can more easily understand and control, so just providing options if a helpful thought exercise.

@rnewson
Copy link
Member Author

rnewson commented Jun 14, 2024

@ben-manes Hi, and thanks for jumping in (and for developing Caffeine). Yes, I need the eviction to happen atomically (or, at least, no client should try to use the item while it is unloading), and it can be very slow (potentially minutes). I concluded that I was using Caffeine inappropriately; caching stateful objects that have a mutex external to the JVM, a file-based lock, where loading and evicting can be very slow.

I need to ensure that any given cache entry is loaded and not unloading at the time of any interaction. I also can't load the same item twice (Lucene's write.lock is preventing that for very good reasons), which made the race between an eviction and a concurrent get() somewhat fun.

As you've seen I'm just using LinkedHashMap as a way to trigger awareness that we're at capacity, and I perform the unloading later. Happy to just accept that we might go a little over capacity, the trade-off in code simplicity is worth it to me.

Your suggestions are very helpful though, and I might be pulled back to Caffeine for this as I get more results from real world performance (I am not too thrilled about the synchronized(cache) bits, especially when reading the cache). What I like about this PR though is it pulls all of the concurrency logic into IndexManager directly, and there's not too much of it. Time will tell if it holds up, of course.

@ben-manes
Copy link

Since the mapping is discarded asynchronously, I suppose removeEldestEntry could try to evict the same one multiple times during a slow unloading? In those cases then once that completes the cache would still exceed its capacity and be stuck there, since the call only happens once on insertion. I think you would need to prune the cache yourself after unloading to avoid a memory leak.

If you need a quick-and-dirty concurrent cache, the simple and classic and earliest approaches are to use a Clock (aka SecondChance) or a random sampling based strategy. These are popular, easy to write and maintain, can have good hit rates, are lock-free on reads, and the longer lock hold time on a write is typically okay for small caches. The exact workloads where they are good and bad varies, so it can require algorithm tuning to maximize hit rates, whereas Caffeine does that automatically.

It sounds like the key insight of your approach is to evict the entry from the eviction policy while asynchronously removing it from the data map. Caffeine works the other way by keeping the data map linearizable (as user-visible) and the policy eventually consistent. If you decided to keep using Caffeine then you could have it act as the policy but store the data in a another hash table and use the loader + listener for coordination. You can think of that as an L1 / L2 inclusive cache where L1 pulls from L2 and L2 invalidations are sent to L1. That's not uncommon for local + remote cache layers, but here both would be on-heap.

However you go about it, it sounds like a fun but tricky little problem.

@rnewson
Copy link
Member Author

rnewson commented Jun 14, 2024

hmm, maybe. concurrent attempts to cache.get() while we're unloading should all get the IndexHolder while it is either write-locked (currently unloading) or not locked but in the UN_LOADED state (in which event we retry, as the entry should be removed from the cache and the first thread that gets the lock on cache will create a new IndexHolder in NOT_LOADED state). However, it's exactly all these kinds of cases that I wanted to delegate to something else, like Caffeine, so I will be holding off merging this PR while I ponder it some more.

You've given me a lot to think about, and, yes, it is a surprisingly fun little problem to solve. You'd think on my third "full text for couchdb" codebase that I'd have a solid solution to it by now. alas...

Caffeine complaines when eviction is slow (because we might commit the index).
Build our own logic for opening and closing indexes which allows opening and
closing to be very slow, while ensuring we don't close an index that's in use.
Copy link
Contributor

@nickva nickva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

The general idea seems sound. I lost most of my Java knowledge a while back, so may not have caught all the subtle details.

@rnewson
Copy link
Member Author

rnewson commented Jun 18, 2024

thanks Nick. I expect we'll find a corner case but at least we have all the logic in one place, and it's quite small. Time will tell, and we clearly have some other options to try from what Ben has contributed.

@rnewson rnewson merged commit 3dee4e7 into main Jun 18, 2024
17 checks passed
@ben-manes
Copy link

@rnewson did you review if there is a memory leak as I suspected? The scenario is,

  1. Insert N items
  2. Insert key N+1 to trigger an eviction for key K0 (removeEldestEntry => false)
  3. Unload starts
  4. Insert K entries (removeEldestEntry => false, still selects key=0)
  5. Unload finished
  6. Insert key X => what is evicted?

I believe since removeEldestEntry returns false it will be stuck growing to higher thresholds based on the number of observed concurrent evictions. Since unloading is slow, over time the cache could greatly exceed your threshold. That excess won't be unloaded since only the eldest in-flight unload is observable.

@rnewson
Copy link
Member Author

rnewson commented Jun 18, 2024

ah, oops. will check, and fix in subsequent PR. thanks.

@rnewson rnewson deleted the decaffeinated branch June 20, 2024 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants