Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactored PTH DDP env vars creation in SLURM #2206

Merged
merged 7 commits into from
Sep 20, 2021

Conversation

vfdev-5
Copy link
Collaborator

@vfdev-5 vfdev-5 commented Sep 19, 2021

Fixes #2202

Related to #2048

Description:

  • Refactored PTH DDP env vars creation in SLURM

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: distributed Distributed module label Sep 19, 2021
# To cover case 1), let's ensure that defined RANK == SLURM_PROCID, LOCAL_RANK == SLURM_LOCALID,
# WORLD_SIZE == SLURM_NTASKS. We will use defined MASTER_ADDR and MASTER_PORT instead of defining
# them by our means
# To cover case 2), let's check that defined RANK >= SLURM_PROCID, LOCAL_RANK >= SLURM_LOCALID,
Copy link
Contributor

@sdesrozis sdesrozis Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly what is done, the idea is to ensure that in such a case, the user didn't use srun as a mistake

srun python -m torch.distributed.launch ...

Therefore, every process should have a rank, local rank and world size greater or equal to what is defined by slurm.

Is it correct ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case 2 is to cover use-case like :

srun -N1 -n1 -G8 python -m torch.distributed.launch\
        --nproc_per_node=8 --nnodes=1 --node_rank=0 \
        --master_addr="localhost" --master_port=1234 \
        main.py

Copy link
Contributor

@sdesrozis sdesrozis Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case 2)

RANK >= SLURM_PROCID, LOCAL_RANK >= SLURM_LOCALID means that one process is spawn on the node by srun
but
WORLD_SIZE >= SLURM_NTASKS sounds weird. SLURM_NTASKS is the max number of tasks checked by slurm. If WORLD_SIZE is greater to that value, the scheduler should kill the process, because more than allocated ressources for the job are used. I think it could be an issue using gloo.

What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, it works on my side.

Copy link
Collaborator Author

@vfdev-5 vfdev-5 Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example above srun -N1 -n1 allocates SLURM_NTASKS=1, but launcher creates ws=8. That's why WORLD_SIZE >= SLURM_NTASKS. Am I missing something ?

See here as well : https://www.hpcworkshops.com/08-ml-on-parallelcluster/03-distributed-data-parallel.html

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but the slurm scheduler can kill the job if more ressources than allocated are used. I suppose it depends on how the scheduler is configured. Imagine you schedule a job defining 4 tasks and in fact you use 8, it could be a big issue in production. Anyway, I think that does not really matter. This is a user constraint, we can't handle.

Copy link
Contributor

@sdesrozis sdesrozis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@sdesrozis sdesrozis merged commit 09599e0 into master Sep 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: distributed Distributed module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SLURM and conflicting environment variables
2 participants