Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(robot-server): memory cache analyses #12648

Merged
merged 4 commits into from
May 10, 2023
Merged

fix(robot-server): memory cache analyses #12648

merged 4 commits into from
May 10, 2023

Conversation

sfoster1
Copy link
Member

@sfoster1 sfoster1 commented May 5, 2023

Getting analysis results can really bog down the robot server - in fact, it can block the event loop such that new http requests get stalled, and that means that apps will mark the server unhealthy.

Recently, the app started asking for analyses much more frequently, which would cause the robot server to bog down more frequently, which would cause the app to mark the robot unhealthy more frequently. This was an unintended change, and reverting it fixes the robot unhealthiness; but getting analysis results still threatens to bog down the server, so this memcache should fix the problem on the server.

We have similar problems (and solutions) for other heavy resources like runs and protocols; but there we do the caching with functools.lru_cache. This is not possible for analysis results because we get analysis results with async functions, for a very good reason; so instead, we need a custom cache implementation.

This cache implementation uses a python dict to hold the result objects; since there are limited places to add results, we can add things to the cache when they're added to the database and add them to the cache when we get things from the database. We use a deque to maintain a FIFO for cache eviction.

Review requests

Caches like this aren't really my wheelhouse; think about the behavior here and whether it's correct.

Testing

  • put this on a robot with a lot of protocols stored and retrieve analyses and see if it falls over

Supercedes #12640

Getting analysis results can really bog down the robot server - in fact,
it can block the event loop such that new http requests get stalled, and
that means that apps will mark the server unhealthy.

Recently, the app started asking for analyses much more frequently, which
would cause the robot server to bog down more frequently, which would
cause the app to mark the robot unhealthy more frequently. This was an
unintended change, and reverting it fixes the robot unhealthiness; but
getting analysis results still threatens to bog down the server, so this
memcache should fix the problem on the server.

We have similar problems (and solutions) for other heavy resources like
runs and protocols; but there we do the caching with
functools.lru_cache. This is not possible for analysis results because
we get analysis results with async functions, for a very good reason; so
instead, we need a custom cache implementation.

This cache implementation uses a python dict to hold the result objects;
since there are limited places to add results, we can add things to the
cache when they're added to the database and add them to the cache when
we get things from the database. We use a deque to maintain a FIFO for
cache eviction.
@sfoster1 sfoster1 requested a review from a team as a code owner May 5, 2023 16:02
}
uncached_analyses = analysis_set - cached_analyses

if uncached_analyses:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a significant performance hit doing this SQL and caching here now? Realize this will only be getting hit once per protocol per session (if I'm reading and understanding this right), but curious if this will be noticeable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there shouldn't be, I'd think, because this is only getting the ids and not any kind of significant data. The sql result should only be the ids.

_log.error(
f"cache order queue was empty with {len(self._completed_analysis_cache_by_analysis_id)} elements in cache"
)
eldest = "<incorrect because cache order was empty>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats the rationale for having this set here and then immediately trying to delete that key from the dictionary, if its just going to raise another error again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the logic less complex. Each error check tests precisely one thing. One error case does lead to the other; but in that case, it is truly in that error case. If we hate it, I can factor it into a function that returns early.

By splitting the implementation to add
- a public module for CompletedAnalysisStore, which has a public
interface that can be tested, with a
- public module for the memory cache, which has a public interface that
can be tested and injected into CompletedAnalysisStore

we can get back to testing only public interfaces, which is really nice.

At the same time, we can fix the race condition where get_by_protocol_id
wasn't as reentrant as it could be. Aside from that method, since this
is async code we should be access safe because the memory cache is synchronous.
Copy link
Contributor

@TamarZanzouri TamarZanzouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this change!!

Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good to me architecturally. Here are a few low-level suggestions.

Comment on lines +128 to +143
async def test_get_by_analysis_id_caches_results(
subject: CompletedAnalysisStore,
sql_engine: Engine,
memcache: MemoryCache[str, CompletedAnalysisResource],
protocol_store: ProtocolStore,
decoy: Decoy,
) -> None:
"""It should cache successful fetches from sql."""
resource = _completed_analysis_resource("analysis-id", "protocol-id")
protocol_store.insert(make_dummy_protocol_resource("protocol-id"))
await subject.add(resource)
# the analysis is not cached
decoy.when(memcache.get("analysis-id")).then_raise(KeyError())
from_sql = await subject.get_by_id("analysis-id")
assert from_sql == resource
decoy.verify(memcache.insert("analysis-id", from_sql))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a departure from Robot Services' test and architectural conventions, but it's a departure that I've been thinking we should try, anyway. No action needed—I'll dump my thoughts in Slack.

SyntaxColoring

This comment was marked as duplicate.

@SyntaxColoring
Copy link
Contributor

Smoke-tested on a robot by navigating around. The journalctl logs seem to confirm that the cache is correctly eliding database access.

@sfoster1 sfoster1 changed the base branch from internal-release_0.6.0 to edge May 10, 2023 19:44
@sfoster1
Copy link
Member Author

Smoke-tested on a robot by navigating around. The journalctl logs seem to confirm that the cache is correctly eliding database access.

Thank you!

@sfoster1 sfoster1 merged commit 056f44b into edge May 10, 2023
5 checks passed
@sfoster1 sfoster1 deleted the analysis-memcache branch May 10, 2023 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants