Occasional crash of historical perf test #6291

maxtropets · 2024-06-25T11:11:51Z

When developing #3234 an occasional and inconsistent crash happened multiple times.

The WIP work saved in branch https://github.com/maxtropets/CCF/tree/f/6291-repro.
There're stacktraces from the test output committed (repro_[X] files in the repo root).

Observations

reproduced first time occasionally when adding more code, potentially slowing down the code in compare to previous runs
reproducing it occasionally but consistently (30% chance) on historical perf test: ./tests.sh -VV -R historical_query_perf_test --repeat-until-fail 10 > out.txt
we were chopping out the new code until we got to main branch and it was still reproducing
it disappeared completely after rm -rf build && mkdir build && cd build && cmake (virtual) && ninja, run 100 times, everything works well
it randomly appeared again after some code changes, but the moment we did a full rebuild again - we could not repro it anymore.
all problems happen around dereferencing or incrementing the iterator over cached states, e.g. https://github.com/maxtropets/CCF/blob/f/6291-repro/src/node/historical_queries.h#L1410
tried running with DSAN, didn't repro on fresh build after 100 iterations

Thoughts

could've been partial rebuild problem, however we're not aware of ninja bugs like that
could've also been a real issue, we tried searching for async code, but didn't find anything, we suspected a potential race which could've explained messing up the iterators, as well as crashing on both operator++, dereferencing, etc.

The text was updated successfully, but these errors were encountered:

maxtropets · 2024-06-27T16:27:41Z

We eventually found out that the reason was enclave getting too far behind the host in terms of messages dispatch occasionally.

Because cache::tick depends on the amount of stores fetched in-memory, it can take more time then the time between two tick messages sent by the host, and so they eventually might getting stacking up.

When the host sends a process termination request, and we still have a huge ticks queue, the stop message may get delayed for more then 10 seconds, which is the current timeout in the python process runner. That's why it send's SIGSTIOP (to take a process memory dump), which we see somewhere in std::map iterations, so the behaviour is kinda correct, there's no functional bug in here.

Then the runner terminates the host via SIGKILL and we end up with a failure. The fix is probably going to be test-specific, so we'll increase a timeout for historical perf test only. We won't patch the getting-too-far-behind stuff as soon as we're going to get rid from the SGX anyway.

maxtropets added the bug label Jun 25, 2024

maxtropets mentioned this issue Jun 28, 2024

Fixup historical perf test shutdown #6307

Merged

maxtropets self-assigned this Jul 1, 2024

achamayou closed this as completed in #6307 Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional crash of historical perf test #6291

Occasional crash of historical perf test #6291

maxtropets commented Jun 25, 2024 •

edited

Loading

maxtropets commented Jun 27, 2024 •

edited

Loading

Occasional crash of historical perf test #6291

Occasional crash of historical perf test #6291

Comments

maxtropets commented Jun 25, 2024 • edited Loading

maxtropets commented Jun 27, 2024 • edited Loading

maxtropets commented Jun 25, 2024 •

edited

Loading

maxtropets commented Jun 27, 2024 •

edited

Loading