Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Minimal changes to support migration to use AzNHC for Health Checks #745

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

yosoyjay
Copy link
Member

Proposed updates to cc_slurm_nhc to replace use of NHC with AzNHC included in AzHPC VM images.

Changes include removal because they are included in AzNHC:

  • nhc, bandwidthtest and perftest_gdr
  • topo and graph NCCL files
  • Most custom NHC scripts
  • Remove install_nhc.sh as it's not required
  • Remove script to create NHC tarball

Changes to conf files:

  • Remove tests included by default in AzNHC
  • Removal of most file system checks because they are not mounted in the container by default

Changes to scripts:

  • configure_nhc.sh
    • Change /usr/sbin/nhc to /sched/scripts/run_nch.sh as NHC_EXE
    • Remove custom tests included in AzNHC
  • run_nhc.sh
    • Change NHC_CMD to call AzNHC wrapper (run-health-checks.sh) instead of nhc
  • wait_for_nhc.sh
    • Change check for running from grepping for ps to querying Docker
    • Remove check for nhc.status because detached mode is not explicitly supported with AzNHC
  • kill_nhc.sh
    • Change killing NHC process to stopping AzNHC container

These changes will only be supported in Ubuntu AzHPC versions post microsoft-dsvm:ubuntu-hpc:2004:20.04.2024050901 to incorporate a PR in AzNHC so that it returns the exit code.

Open questions / requests for comment (@garvct @vanzod):

  • Do we need to emulate detached mode?
  • Is there a need to include ib_all_reduce for NCv4. Right now, AzNHC does support providing the NCCL graph file, but we can create a PR if required.
  • Are the removed directory / file system tests okay?
  • How should we treat testing of mounts? They end up as volumes in the container, so the RADI tests (check_raid) and mount checks (check_fs_mount_rw) won't work. We could test for existance?
  • Why do some of the fs test check for /data /app?

@yosoyjay yosoyjay requested review from garvct and vanzod May 14, 2024 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant