Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional crash of historical perf test #6291

Closed
maxtropets opened this issue Jun 25, 2024 · 1 comment · Fixed by #6307
Closed

Occasional crash of historical perf test #6291

maxtropets opened this issue Jun 25, 2024 · 1 comment · Fixed by #6307
Assignees
Labels

Comments

@maxtropets
Copy link
Contributor

maxtropets commented Jun 25, 2024

When developing #3234 an occasional and inconsistent crash happened multiple times.

The WIP work saved in branch https://github.com/maxtropets/CCF/tree/f/6291-repro.
There're stacktraces from the test output committed (repro_[X] files in the repo root).

Observations

  • reproduced first time occasionally when adding more code, potentially slowing down the code in compare to previous runs
  • reproducing it occasionally but consistently (30% chance) on historical perf test: ./tests.sh -VV -R historical_query_perf_test --repeat-until-fail 10 > out.txt
  • we were chopping out the new code until we got to main branch and it was still reproducing
  • it disappeared completely after rm -rf build && mkdir build && cd build && cmake (virtual) && ninja, run 100 times, everything works well
  • it randomly appeared again after some code changes, but the moment we did a full rebuild again - we could not repro it anymore.
  • all problems happen around dereferencing or incrementing the iterator over cached states, e.g. https://github.com/maxtropets/CCF/blob/f/6291-repro/src/node/historical_queries.h#L1410
  • tried running with DSAN, didn't repro on fresh build after 100 iterations

Thoughts

  • could've been partial rebuild problem, however we're not aware of ninja bugs like that
  • could've also been a real issue, we tried searching for async code, but didn't find anything, we suspected a potential race which could've explained messing up the iterators, as well as crashing on both operator++, dereferencing, etc.
@maxtropets maxtropets added the bug label Jun 25, 2024
@maxtropets
Copy link
Contributor Author

maxtropets commented Jun 27, 2024

We eventually found out that the reason was enclave getting too far behind the host in terms of messages dispatch occasionally.

Because cache::tick depends on the amount of stores fetched in-memory, it can take more time then the time between two tick messages sent by the host, and so they eventually might getting stacking up.

When the host sends a process termination request, and we still have a huge ticks queue, the stop message may get delayed for more then 10 seconds, which is the current timeout in the python process runner. That's why it send's SIGSTIOP (to take a process memory dump), which we see somewhere in std::map iterations, so the behaviour is kinda correct, there's no functional bug in here.

Then the runner terminates the host via SIGKILL and we end up with a failure. The fix is probably going to be test-specific, so we'll increase a timeout for historical perf test only. We won't patch the getting-too-far-behind stuff as soon as we're going to get rid from the SGX anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant