-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(robot-server): memory cache analyses #12648
Conversation
Getting analysis results can really bog down the robot server - in fact, it can block the event loop such that new http requests get stalled, and that means that apps will mark the server unhealthy. Recently, the app started asking for analyses much more frequently, which would cause the robot server to bog down more frequently, which would cause the app to mark the robot unhealthy more frequently. This was an unintended change, and reverting it fixes the robot unhealthiness; but getting analysis results still threatens to bog down the server, so this memcache should fix the problem on the server. We have similar problems (and solutions) for other heavy resources like runs and protocols; but there we do the caching with functools.lru_cache. This is not possible for analysis results because we get analysis results with async functions, for a very good reason; so instead, we need a custom cache implementation. This cache implementation uses a python dict to hold the result objects; since there are limited places to add results, we can add things to the cache when they're added to the database and add them to the cache when we get things from the database. We use a deque to maintain a FIFO for cache eviction.
} | ||
uncached_analyses = analysis_set - cached_analyses | ||
|
||
if uncached_analyses: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a significant performance hit doing this SQL and caching here now? Realize this will only be getting hit once per protocol per session (if I'm reading and understanding this right), but curious if this will be noticeable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there shouldn't be, I'd think, because this is only getting the ids and not any kind of significant data. The sql result should only be the ids.
_log.error( | ||
f"cache order queue was empty with {len(self._completed_analysis_cache_by_analysis_id)} elements in cache" | ||
) | ||
eldest = "<incorrect because cache order was empty>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats the rationale for having this set here and then immediately trying to delete that key from the dictionary, if its just going to raise another error again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the logic less complex. Each error check tests precisely one thing. One error case does lead to the other; but in that case, it is truly in that error case. If we hate it, I can factor it into a function that returns early.
By splitting the implementation to add - a public module for CompletedAnalysisStore, which has a public interface that can be tested, with a - public module for the memory cache, which has a public interface that can be tested and injected into CompletedAnalysisStore we can get back to testing only public interfaces, which is really nice. At the same time, we can fix the race condition where get_by_protocol_id wasn't as reentrant as it could be. Aside from that method, since this is async code we should be access safe because the memory cache is synchronous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love this change!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This looks good to me architecturally. Here are a few low-level suggestions.
robot-server/robot_server/protocols/completed_analysis_store.py
Outdated
Show resolved
Hide resolved
async def test_get_by_analysis_id_caches_results( | ||
subject: CompletedAnalysisStore, | ||
sql_engine: Engine, | ||
memcache: MemoryCache[str, CompletedAnalysisResource], | ||
protocol_store: ProtocolStore, | ||
decoy: Decoy, | ||
) -> None: | ||
"""It should cache successful fetches from sql.""" | ||
resource = _completed_analysis_resource("analysis-id", "protocol-id") | ||
protocol_store.insert(make_dummy_protocol_resource("protocol-id")) | ||
await subject.add(resource) | ||
# the analysis is not cached | ||
decoy.when(memcache.get("analysis-id")).then_raise(KeyError()) | ||
from_sql = await subject.get_by_id("analysis-id") | ||
assert from_sql == resource | ||
decoy.verify(memcache.insert("analysis-id", from_sql)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a departure from Robot Services' test and architectural conventions, but it's a departure that I've been thinking we should try, anyway. No action needed—I'll dump my thoughts in Slack.
Smoke-tested on a robot by navigating around. The |
Thank you! |
Getting analysis results can really bog down the robot server - in fact, it can block the event loop such that new http requests get stalled, and that means that apps will mark the server unhealthy.
Recently, the app started asking for analyses much more frequently, which would cause the robot server to bog down more frequently, which would cause the app to mark the robot unhealthy more frequently. This was an unintended change, and reverting it fixes the robot unhealthiness; but getting analysis results still threatens to bog down the server, so this memcache should fix the problem on the server.
We have similar problems (and solutions) for other heavy resources like runs and protocols; but there we do the caching with functools.lru_cache. This is not possible for analysis results because we get analysis results with async functions, for a very good reason; so instead, we need a custom cache implementation.
This cache implementation uses a python dict to hold the result objects; since there are limited places to add results, we can add things to the cache when they're added to the database and add them to the cache when we get things from the database. We use a deque to maintain a FIFO for cache eviction.
Review requests
Caches like this aren't really my wheelhouse; think about the behavior here and whether it's correct.
Testing
Supercedes #12640