Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cluster kube-scheduler scheduling is unbalanced, causing the pod to hang and fail to run, even though there are currently idle nodes #125503

Closed
Giant-goose opened this issue Jun 14, 2024 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@Giant-goose
Copy link

What happened?

I have 10 nodes in my cluster, and I keep the default cluster scheduling mechanism without configuring the manual scheduling mechanism for tasks (including node taints, label scheduling affinity, etc., and the configuration of pod limit request), but I found that one of my pods will use up 50% of the node's memory after running for a period of time, and the total memory occupied by other nodes of this node is more than 90%. But when I restarted this pod, I found the first problem, it was still scheduled to this node, although there were other nodes with only 30% memory usage, and the second problem was that I found that the pod seemed to like this node very much, and other newly started pods would also be assigned to this 90% load machine, resulting in these pod tasks can only be suspended, can not run, resulting in business stagnation. Why is this?

Off topic: I understand that you can use artificial scheduling mechanism to solve this problem, but the default scheduling I found from the official website's documentation explanation is to score nodes, and there is no additional mechanism to affect other I am very curious about this problem, and I found that many people in the k8s community have encountered this problem.

What did you expect to happen?

I want to know two questions about the above:

  1. Regarding the node score, even though the load of this node is 90%, why is its score still higher than other nodes or is it scheduled to this machine for other reasons?
  2. Why do subsequent pods like to run on this node so much, when there are other nodes and their models and configurations are exactly the same? Is there any data affinity (pod will look for the node where the data was stored last time to reduce the cluster's response time and performance consumption)?

How can we reproduce it (as minimally and precisely as possible)?

This is a long-term practical problem

Anything else we need to know?

187d5c530b74963a020687139c0cc24
8612d5413d9fd36f5052b6b1ea96936

Kubernetes version

$ kubectl version
# paste output here
v1.26.1

Cloud provider

aws and tencent k8s cluster

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here
centos7

Install tools

Container runtime (CRI) and version (if applicable)

kubeadm

Related plugins (CNI, CSI, ...) and versions (if applicable)

@Giant-goose Giant-goose added the kind/bug Categorizes issue or PR as related to a bug. label Jun 14, 2024
@k8s-ci-robot
Copy link
Contributor

There are no sig labels on this issue. Please add an appropriate label by using one of the following commands:

  • /sig <group-name>
  • /wg <group-name>
  • /committee <group-name>

Please see the group list for a listing of the SIGs, working groups, and committees available.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 14, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jun 14, 2024
@neolit123
Copy link
Member

1.26 is no longer in support. also better to ask on support channels.

please see:
https://git.k8s.io/kubernetes/SUPPORT.md

/kind support
/close

@k8s-ci-robot k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Jun 14, 2024
@k8s-ci-robot
Copy link
Contributor

@neolit123: Closing this issue.

In response to this:

1.26 is no longer in support. also better to ask on support channels.

please see:
https://git.k8s.io/kubernetes/SUPPORT.md

/kind support
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants