Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm does not detect node loss for 120s. Regression? #9

Open
kkm000 opened this issue Apr 8, 2020 · 0 comments
Open

Slurm does not detect node loss for 120s. Regression? #9

kkm000 opened this issue Apr 8, 2020 · 0 comments
Assignees
Labels
area-bug Something isn't working P2 Serious, workaround known slurm? Possible slurm misconfiguration or bug
Projects
Milestone

Comments

@kkm000
Copy link
Member

kkm000 commented Apr 8, 2020

BTW: There is not a single timeout of 240s either in slurm.baseconf or in documented defaults in man slurm.conf.


Apr 07 17:44:14 xc-control slurmctld[4676]: agent/is_node_resp: node:xc-node-n1-48 RPC:REQUEST_PING : Communication connection failure
Apr 07 17:44:17 xc-control slurmctld[4676]: agent/is_node_resp: node:xc-node-n1-48 RPC:REQUEST_PING : Communication connection failure
Apr 07 17:44:20 xc-control slurmctld[4676]: agent/is_node_resp: node:xc-node-n1-48 RPC:REQUEST_PING : Communication connection failure
Apr 07 17:44:33 xc-control slurmctld[4676]: error: Unable to resolve "xc-node-n1-48": Host name lookup failure
 . . . The first message of this kind, repeats over many times . . .

Then SUDDENLY

Apr 07 17:46:32 xc-control slurmctld[4676]: agent/is_node_resp: node:xc-node-n1-48 RPC:REQUEST_PING : Can't find an address, check slurm.conf
Apr 07 17:46:33 xc-control slurmctld[4676]: error: Unable to resolve "xc-node-n1-48": Host name lookup failure
Apr 07 17:46:33 xc-control slurmctld[4676]: error: fwd_tree_thread: can't find address for host xc-node-n1-48, check slurm.conf
Apr 07 17:46:34 xc-control slurmctld[4676]: error: Unable to resolve "xc-node-n1-48": Host name lookup failure
Apr 07 17:46:34 xc-control slurmctld[4676]: error: fwd_tree_thread: can't find address for host xc-node-n1-48, check slurm.conf
Apr 07 17:46:35 xc-control slurmctld[4676]: error: Unable to resolve "xc-node-n1-48": Host name lookup failure
Apr 07 17:46:35 xc-control slurmctld[4676]: error: fwd_tree_thread: can't find address for host xc-node-n1-48, check slurm.conf
Apr 07 17:46:36 xc-control slurmctld[4676]: error: Unable to resolve "xc-node-n1-48": Host name lookup failure
Apr 07 17:46:36 xc-control slurmctld[4676]: error: fwd_tree_thread: can't find address for host xc-node-n1-48, check slurm.conf
Apr 07 17:46:37 xc-control slurmctld[4676]: error: Nodes xc-node-n1-48 not responding, setting DOWN

Glad you've noticed, thanks.
But WHY 2 minutes???

The "can't find address for" continues for a few seconds

Apr 07 17:46:40 xc-control slurmctld[4676]: error: fwd_tree_thread: can't find address for host xc-node-n1-48, check slurm.conf
Apr 07 17:47:00 xc-control slurmctld[4676]: error: Unable to resolve "xc-node-n1-48": Host name lookup failure
Apr 07 17:47:00 xc-control slurmctld[4676]: error: fwd_tree_thread: can't find address for host xc-node-n1-48, check slurm.conf

Then the trigger kicks in, but it's set not to act before 20s, so 17:46:51 was the earliest time it should have kicked in, and the latest 15s before, 17:46:36, given Slurm's trigger batching. Barely crosses the DOWN event, but it does.

Apr 07 17:47:11 xc-control slurmctld[4676]: update_node: node xc-node-n1-48 reason set to: recovery
Apr 07 17:47:11 xc-control slurmctld[4676]: update_node: node xc-node-n1-48 state set to DRAINED*
Apr 07 17:47:11 xc-control slurmctld[4676]: update_node: node xc-node-n1-48 reason set to: recovery
@kkm000 kkm000 created this issue from a note in 0.6beta (To do) Apr 8, 2020
@kkm000 kkm000 added the area-bug Something isn't working label Apr 8, 2020
@kkm000 kkm000 self-assigned this Apr 8, 2020
@kkm000 kkm000 added this to the 0.6beta milestone Apr 9, 2020
@kkm000 kkm000 added the P2 Serious, workaround known label Apr 9, 2020
@kkm000 kkm000 added the slurm? Possible slurm misconfiguration or bug label Oct 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-bug Something isn't working P2 Serious, workaround known slurm? Possible slurm misconfiguration or bug
Projects
0.6beta
  
To do
Development

No branches or pull requests

1 participant