Jobstart-related software and information
A. Get Slurm deploy scripts:
- Go to the root directory of the experiment:
cd <rootdir>
- Clone deploy scripts
$ git clone https://github.com/artpol84/jobstart.git
- Go to the deploy directory:
cd jobstart/slurm_deploy/
- Setup configuration in
deploy_ctl.conf
NOTE: You need to set theINSTALL_DIR
to the directory that is unique for each node (like/tmp/slurm_deploy
). Otherwise Slurm daemon instances will conflict for the common files.
B. Bild and start the installation
- Allocate resources:
$ salloc -N <x> -t <y>
- Download all of the packages:
$ ./deploy_cmd.sh source_prepare
- Build and install all of the packages:
./deploy_cmd.sh build_all
- Distribute everything
$ ./deploy_cmd.sh distribute_all
- Configure Slurm, please see
jobstart/slurm_deploy/files/slurm.conf.in
for the general configuration and provide the customization file <local.conf> with control machine and partitions description (seejobstart/slurm_deploy/files/local.conf
as an example)
./deploy_cmd.sh slurm_config ./files/local.conf
- Start the Slurm instance:
./deploy_cmd.sh slurm_start
C. Check the installation
NOTE: From another terminal!
- Check that deploy is functional.
$ export SLURMDEP_INST=<INSTALL_DIR from deploy_ctl.conf>
$ cd $SLURMDEP_INST/slurm/bin
$ ./sinfo
<check that the output is correct>
- Allocate nodes inside the deployed Slurm installation:
$ ./salloc -N <X> <other options>
- Run hostname to test:
$ ./srun hostname
5.Run hostname with pmix plugin:
./srun --mpi=pmix hostname
D. Check with the distributed application
NOTE: from the allocation of deployed Slurm (same terminal as C.)
- Go to the test app directory
$ cd <rootdir>/jobstart/shmem/
- compile the program
$ $SLURMDEP_INST/ompi/bin/oshcc -o hello_oshmem_c -g hello_oshmem_c.c # INSTALL_DIR from deploy_ctl.conf
- Launch the application
$ cd <rootdir>/jobstart/launch/
$ ./run.sh {dtcp|ducx|sapi} [early|noearly] [openib] [timing] -N <nnodes> -n <nprocs> <other-slurm-opts> ./hello_oshmem_c
The following set of commands can be used to re-deploy Slurm after the initial allocation was lost:
export SLURMDEP_INST=<INSTALL_DIR from deploy_ctl.conf>
./deploy_cmd.sh slurm_stop
./deploy_cmd.sh cleanup_remote
rm --preserve-root ${SLURMDEP_INST}/slurm/tmp/*
rm --preserve-root ${SLURMDEP_INST}/slurm/var/*
./deploy_cmd.sh distribute_all
./deploy_cmd.sh slurm_config
./deploy_cmd.sh slurm_start