fix(robot-server): memory cache analyses #12648

sfoster1 · 2023-05-05T16:02:35Z

Getting analysis results can really bog down the robot server - in fact, it can block the event loop such that new http requests get stalled, and that means that apps will mark the server unhealthy.

Recently, the app started asking for analyses much more frequently, which would cause the robot server to bog down more frequently, which would cause the app to mark the robot unhealthy more frequently. This was an unintended change, and reverting it fixes the robot unhealthiness; but getting analysis results still threatens to bog down the server, so this memcache should fix the problem on the server.

We have similar problems (and solutions) for other heavy resources like runs and protocols; but there we do the caching with functools.lru_cache. This is not possible for analysis results because we get analysis results with async functions, for a very good reason; so instead, we need a custom cache implementation.

This cache implementation uses a python dict to hold the result objects; since there are limited places to add results, we can add things to the cache when they're added to the database and add them to the cache when we get things from the database. We use a deque to maintain a FIFO for cache eviction.

Review requests

Caches like this aren't really my wheelhouse; think about the behavior here and whether it's correct.

Testing

put this on a robot with a lot of protocols stored and retrieve analyses and see if it falls over

Supercedes #12640

Getting analysis results can really bog down the robot server - in fact, it can block the event loop such that new http requests get stalled, and that means that apps will mark the server unhealthy. Recently, the app started asking for analyses much more frequently, which would cause the robot server to bog down more frequently, which would cause the app to mark the robot unhealthy more frequently. This was an unintended change, and reverting it fixes the robot unhealthiness; but getting analysis results still threatens to bog down the server, so this memcache should fix the problem on the server. We have similar problems (and solutions) for other heavy resources like runs and protocols; but there we do the caching with functools.lru_cache. This is not possible for analysis results because we get analysis results with async functions, for a very good reason; so instead, we need a custom cache implementation. This cache implementation uses a python dict to hold the result objects; since there are limited places to add results, we can add things to the cache when they're added to the database and add them to the cache when we get things from the database. We use a deque to maintain a FIFO for cache eviction.

jbleon95 · 2023-05-05T17:43:43Z

robot-server/robot_server/protocols/analysis_store.py

+ }
+ uncached_analyses = analysis_set - cached_analyses
+
+ if uncached_analyses:


Is there a significant performance hit doing this SQL and caching here now? Realize this will only be getting hit once per protocol per session (if I'm reading and understanding this right), but curious if this will be noticeable

there shouldn't be, I'd think, because this is only getting the ids and not any kind of significant data. The sql result should only be the ids.

jbleon95 · 2023-05-05T17:44:40Z

robot-server/robot_server/protocols/analysis_store.py

+ _log.error(
+ f"cache order queue was empty with {len(self._completed_analysis_cache_by_analysis_id)} elements in cache"
+ )
+ eldest = "<incorrect because cache order was empty>"


Whats the rationale for having this set here and then immediately trying to delete that key from the dictionary, if its just going to raise another error again?

It makes the logic less complex. Each error check tests precisely one thing. One error case does lead to the other; but in that case, it is truly in that error case. If we hate it, I can factor it into a function that returns early.

robot-server/robot_server/protocols/analysis_store.py

robot-server/tests/protocols/test_analysis_store.py

By splitting the implementation to add - a public module for CompletedAnalysisStore, which has a public interface that can be tested, with a - public module for the memory cache, which has a public interface that can be tested and injected into CompletedAnalysisStore we can get back to testing only public interfaces, which is really nice. At the same time, we can fix the race condition where get_by_protocol_id wasn't as reentrant as it could be. Aside from that method, since this is async code we should be access safe because the memory cache is synchronous.

TamarZanzouri

I love this change!!

SyntaxColoring

Thanks! This looks good to me architecturally. Here are a few low-level suggestions.

robot-server/robot_server/protocols/completed_analysis_store.py

robot-server/tests/protocols/test_completed_analysis_store.py

robot-server/robot_server/protocols/analysis_memcache.py

SyntaxColoring · 2023-05-09T15:05:47Z

robot-server/tests/protocols/test_completed_analysis_store.py

+async def test_get_by_analysis_id_caches_results(
+ subject: CompletedAnalysisStore,
+ sql_engine: Engine,
+ memcache: MemoryCache[str, CompletedAnalysisResource],
+ protocol_store: ProtocolStore,
+ decoy: Decoy,
+) -> None:
+ """It should cache successful fetches from sql."""
+ resource = _completed_analysis_resource("analysis-id", "protocol-id")
+ protocol_store.insert(make_dummy_protocol_resource("protocol-id"))
+ await subject.add(resource)
+ # the analysis is not cached
+ decoy.when(memcache.get("analysis-id")).then_raise(KeyError())
+ from_sql = await subject.get_by_id("analysis-id")
+ assert from_sql == resource
+ decoy.verify(memcache.insert("analysis-id", from_sql))


This is a departure from Robot Services' test and architectural conventions, but it's a departure that I've been thinking we should try, anyway. No action needed—I'll dump my thoughts in Slack.

Co-authored-by: Max Marrone <[email protected]>

SyntaxColoring · 2023-05-09T16:34:19Z

Smoke-tested on a robot by navigating around. The journalctl logs seem to confirm that the cache is correctly eliding database access.

sfoster1 · 2023-05-10T19:44:51Z

Smoke-tested on a robot by navigating around. The journalctl logs seem to confirm that the cache is correctly eliding database access.

Thank you!

sfoster1 requested a review from a team as a code owner May 5, 2023 16:02

jbleon95 reviewed May 5, 2023

View reviewed changes

SyntaxColoring reviewed May 5, 2023

View reviewed changes

robot-server/robot_server/protocols/analysis_store.py Outdated Show resolved Hide resolved

robot-server/tests/protocols/test_analysis_store.py Outdated Show resolved Hide resolved

TamarZanzouri approved these changes May 9, 2023

View reviewed changes

SyntaxColoring approved these changes May 9, 2023

View reviewed changes

This comment was marked as duplicate.

Sign in to view

sfoster1 and others added 2 commits May 9, 2023 11:18

Update robot-server/tests/protocols/test_completed_analysis_store.py

8a6543d

Co-authored-by: Max Marrone <[email protected]>

use a local dict instead of full memcache

2149dc9

sfoster1 changed the base branch from internal-release_0.6.0 to edge May 10, 2023 19:44

sfoster1 merged commit 056f44b into edge May 10, 2023
5 checks passed

sfoster1 deleted the analysis-memcache branch May 10, 2023 19:57

SyntaxColoring mentioned this pull request Aug 28, 2023

perf(robot-server): Handle analysis requests serially to take more advantage of cache #13397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(robot-server): memory cache analyses #12648

fix(robot-server): memory cache analyses #12648

sfoster1 commented May 5, 2023

jbleon95 May 5, 2023

sfoster1 May 5, 2023

jbleon95 May 5, 2023

sfoster1 May 5, 2023

TamarZanzouri left a comment

SyntaxColoring left a comment

SyntaxColoring May 9, 2023

This comment was marked as duplicate.

SyntaxColoring commented May 9, 2023

sfoster1 commented May 10, 2023

fix(robot-server): memory cache analyses #12648

fix(robot-server): memory cache analyses #12648

Conversation

sfoster1 commented May 5, 2023

Review requests

Testing

jbleon95 May 5, 2023

Choose a reason for hiding this comment

sfoster1 May 5, 2023

Choose a reason for hiding this comment

jbleon95 May 5, 2023

Choose a reason for hiding this comment

sfoster1 May 5, 2023

Choose a reason for hiding this comment

TamarZanzouri left a comment

Choose a reason for hiding this comment

SyntaxColoring left a comment

Choose a reason for hiding this comment

SyntaxColoring May 9, 2023

Choose a reason for hiding this comment

This comment was marked as duplicate.

SyntaxColoring commented May 9, 2023

sfoster1 commented May 10, 2023