Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not run a PAM session for batch job on compute node #20

Open
kkm000 opened this issue Dec 10, 2020 · 1 comment
Open

Do not run a PAM session for batch job on compute node #20

kkm000 opened this issue Dec 10, 2020 · 1 comment
Assignees
Labels
design Discussion of a new feature. P3 Non-blocking, easy workaround exists
Milestone

Comments

@kkm000
Copy link
Member

kkm000 commented Dec 10, 2020

Slurm uses su -l to acquire the identity of a job submitter when executing a batch on the compute node. By default, su -l opens a PAM session, which in turn makes systemd to create a user session. This is bad for two reasons:
‧ It is an overhead. The session includes user services, such as gpg-agent etc. There is no reason to do that.
‧ It potentially interferes with Slurm's cgroup management (no evidence that this in fact happens), as both Slurm and systemd isolate the user processes in cgroups, essentially nesting them.

The analysis is incorrect. The whole PAM session and systemd session thing is done to "obtain user's clean environment", by running a script on the target node, essentially boiling down to either su - <username> env or su -c env <username> (configurable with --get-user-env[=...], q. v.). This is not the whole truth: --get-user-env is implicit in some cases.

Another option, --export, controls which environment variables are inherited from sbatch by the batch script. The documentation on it is also ambiguous, but the behavior has not changed in the most recent Slurm release (20.11.1-1), so it's apparently by design. The documentation is:

  • correct that --get-user-env is respected in the case of the --export switch omitted or given as --export=ALL or --export, which are synonyms, and all three are equivalent semantically. Omitting --get-user-env causes all batch environment be set from that of the sbatch command. This is the only case when omitting --get-user-env does not result in running the su thing.
  • incorrect about the case --export=NONE, quoth “--get-user-env will be ignored.” This is not true: if --get-user-env[=...] is not specified, the behavior is same as if --get-user-env was specified, deferring to the default timeout and method of obtaining the “clean” environment.
  • silent about other cases: --export=[ALL,]{<variable>,...}{<variable>=<value>,...} The actual behavior is identical to the above case.

All this is evident from the code fragment in sbatch. Also notable is the behavior of the --export-file=... option, which has a lower precedence w.r.t the environment setting, and is hardly practically useful (trumped either by the current complete environment in the default case of --export=ALL, or the "clean" environment obtained on the node with the su trick.

It almost reads that the condition opt.get_user_env_time >= 0 is a bug, that should have been opt.get_user_env_time > 0, which at least would true the false documentation statements, but it has been now duplicated in the scrontab command and in the REST API, both new for version 20, which implies the behavior is intended and the manual is incorrect and incomplete.

The environment variable SLURM_GET_USER_ENV is set to 1 and is not possible to override, and is checked as the sole condition that triggers the whole "obtaining a clean environment" conundrum in slurmctld (the actual su ... env invocation happens in the env_array_user_default function in the file env.c).

(Parenthetically, why this single bit of information is sent over the wire not as a bit flag bit in the RPC, as is normally done in many other cases, but rather by stuffing a magic string into the environment of the job, is beyond my understanding. Although not in a performance-critical path, this method increases the RPC message size without any reason: if need be, the environment variable could have been added on the slurmctld side with an identical end result).

Options:

  • Do nothing, do not specify --export, and pass the whole environment to the job over the wire. This is suboptimal, as the whole environment size could be on the order of tens of kilobytes. PAM can be configured on compute nodes to avoid creating the systemd session: su -l uses the PAM id su-l, and it is configured by default in Debian in /etc/pam.d/su-l.
  • Patch the setting of the variable out of sbatch's code. Since we are already patching it to increase polling frequency, and it is very unlikely to change, this is probably the preferred solution.
  • Remove all environment variables except those the user want to export explicitly from the environment in the Kaldi slurm.pl driver before invoking sbatch. This a more general solution, suitable for the users who want to use slurm.pl driver in a generic Slurm environments, and who do not compile their own sbatch. However, this optimization likely loses its value and hardly worth the effort, given the new hardcoded "steady-state delay of 32s between queries," given that some Kaldi jobs take mere seconds to complete, and only a few minutes long on average.
@kkm000 kkm000 added P3 Non-blocking, easy workaround exists design Discussion of a new feature. labels Dec 10, 2020
@kkm000 kkm000 added this to the 0.6beta milestone Dec 10, 2020
@kkm000 kkm000 self-assigned this Dec 10, 2020
@kkm000 kkm000 changed the title Do not run a PAM session for batch job on compte node Do not run a PAM session for batch job on compute node Dec 10, 2020
@kkm000
Copy link
Member Author

kkm000 commented Dec 13, 2020

/cc @burrmill/core

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Discussion of a new feature. P3 Non-blocking, easy workaround exists
Projects
None yet
Development

No branches or pull requests

1 participant