The cluster kube-scheduler scheduling is unbalanced, causing the pod to hang and fail to run, even though there are currently idle nodes #125503

Giant-goose · 2024-06-14T07:46:43Z

What happened?

I have 10 nodes in my cluster, and I keep the default cluster scheduling mechanism without configuring the manual scheduling mechanism for tasks (including node taints, label scheduling affinity, etc., and the configuration of pod limit request), but I found that one of my pods will use up 50% of the node's memory after running for a period of time, and the total memory occupied by other nodes of this node is more than 90%. But when I restarted this pod, I found the first problem, it was still scheduled to this node, although there were other nodes with only 30% memory usage, and the second problem was that I found that the pod seemed to like this node very much, and other newly started pods would also be assigned to this 90% load machine, resulting in these pod tasks can only be suspended, can not run, resulting in business stagnation. Why is this?

Off topic: I understand that you can use artificial scheduling mechanism to solve this problem, but the default scheduling I found from the official website's documentation explanation is to score nodes, and there is no additional mechanism to affect other I am very curious about this problem, and I found that many people in the k8s community have encountered this problem.

What did you expect to happen?

I want to know two questions about the above:

Regarding the node score, even though the load of this node is 90%, why is its score still higher than other nodes or is it scheduled to this machine for other reasons?
Why do subsequent pods like to run on this node so much, when there are other nodes and their models and configurations are exactly the same? Is there any data affinity (pod will look for the node where the data was stored last time to reduce the cluster's response time and performance consumption)?

How can we reproduce it (as minimally and precisely as possible)?

This is a long-term practical problem

Anything else we need to know?

Kubernetes version

$ kubectl version
# paste output here

v1.26.1

Cloud provider

aws and tencent k8s cluster

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

centos7

Install tools

Container runtime (CRI) and version (if applicable)

kubeadm

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot · 2024-06-14T07:46:51Z

There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

/sig <group-name>
/wg <group-name>
/committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2024-06-14T07:46:52Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

neolit123 · 2024-06-14T09:03:11Z

1.26 is no longer in support. also better to ask on support channels.

please see:
https://git.k8s.io/kubernetes/SUPPORT.md

/kind support
/close

k8s-ci-robot · 2024-06-14T09:03:17Z

@neolit123: Closing this issue.

In response to this:

1.26 is no longer in support. also better to ask on support channels.

please see:
https://git.k8s.io/kubernetes/SUPPORT.md

/kind support
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Giant-goose added the kind/bug Categorizes issue or PR as related to a bug. label Jun 14, 2024

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 14, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 14, 2024

k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Jun 14, 2024

k8s-ci-robot closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The cluster kube-scheduler scheduling is unbalanced, causing the pod to hang and fail to run, even though there are currently idle nodes #125503

The cluster kube-scheduler scheduling is unbalanced, causing the pod to hang and fail to run, even though there are currently idle nodes #125503

Giant-goose commented Jun 14, 2024

k8s-ci-robot commented Jun 14, 2024

k8s-ci-robot commented Jun 14, 2024

neolit123 commented Jun 14, 2024

k8s-ci-robot commented Jun 14, 2024

The cluster kube-scheduler scheduling is unbalanced, causing the pod to hang and fail to run, even though there are currently idle nodes #125503

The cluster kube-scheduler scheduling is unbalanced, causing the pod to hang and fail to run, even though there are currently idle nodes #125503

Comments

Giant-goose commented Jun 14, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Jun 14, 2024

k8s-ci-robot commented Jun 14, 2024

neolit123 commented Jun 14, 2024

k8s-ci-robot commented Jun 14, 2024