Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FYI - Simple remedy system designed for use with NPD #199

Closed
negz opened this issue Aug 29, 2018 · 6 comments
Closed

FYI - Simple remedy system designed for use with NPD #199

negz opened this issue Aug 29, 2018 · 6 comments

Comments

@negz
Copy link
Contributor

negz commented Aug 29, 2018

Hello,

I wanted to bring Draino to your attention, in case it's useful to others. Draino is a very simple 'remedy' system for permanent problems detected by the Node Problem Detector - it simply cordons and drains nodes exhibiting configurable Node Conditions.

At Planet we run a small handful of Kubernetes clusters on GCE (not GKE). We have a particular analytics workload that is really good at killing GCE persistent volumes. Without going into too much detail, we see persistent volume related processes (mkfs.ext4, mount, etc) hanging forever in uninterruptible sleep, preventing the pods wanting to consume said volumes from running. We're working with GCP to resolve this issue, but in the meantime we got tired of manually cordoning and draining affected nodes, so we wrote Draino.

Our remedy system looks like:

  1. Detect permanent node problems and set Node Conditions using the Node Problem Detector.
  2. Configure Draino to cordon and drain nodes when they exhibit the NPD's KernelDeadlock condition, or a variant of KernelDeadlock we call VolumeTaskHung.
  3. Let the Cluster Autoscaler scale down underutilised nodes, including the nodes Draino has drained.

It's worth noting that once the Descheduler supports descheduling pods based on taints Draino could be replaced by the Descheduler running in combination with the scheduler's TaintNodesByCondition functionality.

@andyxning
Copy link
Member

@negz This is a quite good use case for NPD. Will learn about what you said detailedly later. Would you mind to add your use case of NPD to the usage case section in ReadMe.

@andyxning
Copy link
Member

This is quite what NPD is first proposed to do. Because the remedy system is end user dependent, common remedy system is not so easily developed.

@negz
Copy link
Contributor Author

negz commented Sep 4, 2018

@andyxning Thanks! I'd be happy to mention this use case in the README. Would it be too self-promotional to link to our Draino tool there? :)

@andyxning
Copy link
Member

@negz No. Draino is actually an POC of a remedy system based on NPD. :)

Could you please make a PR to add the use case?

@andyxning
Copy link
Member

andyxning commented Sep 5, 2018

@negz I have read Draino code briefly. It seems quite good and absolutely worth a use case of NPD. Please do not hesitate to add the Draino use case. I am willing to review it. :)

@ghost
Copy link

ghost commented Aug 31, 2020

Hello, using draino, the permanent problem detected by the node problem detector -- it simply blocks and drains the node that behaves as a drainable node condition,
When a node appears an NPD's kernel deadlock condition, or a variant of a kernel deadlock known as VolumeTaskHung, configuring drain to lock and drain nodes

Here is my example shown below. I blocked for more than 300 seconds by echoing an echo "task docker:7 SEC." | systemd-cat-t kernel

The drain causes my kernel error and the rule KernelDeadlock True, but the draino doesn't work together, making my node set as unschedulable. Is this the wrong item

This is my runtime environment

# kubectl get po -A |egrep  'node-problem-detector|draino'
kube-system   draino-58fc699f84-br2m2                     1/1     Running   0          17m
kube-system   node-problem-detector-smjw7                 1/1     Running   0          18m

My KernelDeadlock True has triggered the rule, but the draino seems to drain

# for node in `kubectl get node |sed '1d' |awk '{print $1}'`;do kubectl describe node $node |sed -n '/Conditions/,/Ready/p' ;done
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  KernelDeadlock       True    Sun, 30 Aug 2020 13:49:54 +0800   Sun, 30 Aug 2020 13:39:52 +0800   DockerHung                   task docker:7 blocked for more than 300 seconds.
  NetworkUnavailable   False   Tue, 25 Aug 2020 13:39:47 +0800   Tue, 25 Aug 2020 13:39:47 +0800   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Sun, 30 Aug 2020 13:49:54 +0800   Tue, 25 Aug 2020 13:39:10 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sun, 30 Aug 2020 13:49:54 +0800   Tue, 25 Aug 2020 13:39:10 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sun, 30 Aug 2020 13:49:54 +0800   Tue, 25 Aug 2020 13:39:10 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sun, 30 Aug 2020 13:49:54 +0800   Thu, 27 Aug 2020 07:17:14 +0800   KubeletReady                 kubelet is posting ready status

The draino isn't working, setting my node to unlarged and expelling my pod

# kubectl get events -n kube-system | grep -E '(^LAST|draino)'
LAST SEEN   TYPE     REASON              OBJECT                            MESSAGE
<unknown>   Normal   Scheduled           pod/draino-58fc699f84-br2m2       Successfully assigned kube-system/draino-58fc699f84-br2m2 to master
18m         Normal   Pulling             pod/draino-58fc699f84-br2m2       Pulling image "planetlabs/draino:5e07e93"
18m         Normal   Pulled              pod/draino-58fc699f84-br2m2       Successfully pulled image "planetlabs/draino:5e07e93"
18m         Normal   Created             pod/draino-58fc699f84-br2m2       Created container draino
18m         Normal   Started             pod/draino-58fc699f84-br2m2       Started container draino
18m         Normal   SuccessfulCreate    replicaset/draino-58fc699f84      Created pod: draino-58fc699f84-br2m2
18m         Normal   ScalingReplicaSet   deployment/draino                 Scaled up replica set draino-58fc699f84 to 1
# kubectl get no
NAME     STATUS   ROLES    AGE   VERSION
master   Ready    master   5d    v1.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants