Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Domain on Derecho Runs Out of Memory #741

Open
scrasmussen opened this issue Mar 13, 2024 · 0 comments
Open

Large Domain on Derecho Runs Out of Memory #741

scrasmussen opened this issue Mar 13, 2024 · 0 comments
Assignees

Comments

@scrasmussen
Copy link
Member

There have been reports that running a long job on Derecho with a large domain will cause the first rank to use up all the memory and crash.
Image below was made using ARM Linaro Forge to give an example of the memory pattern across a node.
image

Expected Behavior

Large domain jobs runs successfully for extended periods of time.

Current Behavior

Crashes after memory usage on the first rank continually rises.

Possible Solution

I've used Valgrind to track down some memory warnings/errors and made some possible fixes (see debug/valgrind-errors branch. More testing needs to be done to see if they fix the issue.

Steps to Reproduce (for bugs)

  1. Use large domain
  2. Run a long time
  3. Track memory usage on first rank, does it stop or continue to grow until crash?

Your Environment

  • Version of the code used: main branch
  • Operating System and version: Derecho, running with one node
  • Compiler and version: Cray 16.0.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant