Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller manager fails to update endpoints of host network services when nodeName has changed #66720

Closed
r7vme opened this issue Jul 27, 2018 · 8 comments · Fixed by #68575
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@r7vme
Copy link

r7vme commented Jul 27, 2018

/kind bug

What happened:
We have noticed that one of the node-exporters is down. It turned out that service has "old" endpoint still. After digging deeper i see following errors in controller manager

I0727 12:41:59.942556       1 endpoints_controller.go:375] Error syncing endpoints for service "kube-system/node-exporter": Endpoints "node-exporter" is invalid: [subsets[0].addresses[1].nodeName: Forbidden: Cannot change NodeName for 172.23.0.122 to worker-g5t48-5497c6d6-8rtc8, subsets[0].addresses[3].nodeName: Forbidden: Cannot change NodeName for 172.23.0.178 to worker-63nev-69566fdb54-5xc7l]

Turns out controller manager can not update endpoints for node-exporter service, because of this error. In endpoint list is see right IPs for worker nodes, but wrong IP for master node (that's why we detected the problem).

Context:

  • two worker nodes were recreated with new hostnames (generated randomly), but they use same ip address
  • one master node were recreated with new hostname and new ip

I've tried to manually update hostname and got same error, seems nodeName is not updatable field.

subsets:
- addresses:
  - ip: 172.23.0.122 <=== IP stayed the same
    nodeName: worker-63nev-69566fdb54-k6jrc <=== still old host name, that basically breaks reconciliation.
    targetRef:
      kind: Pod
      name: node-exporter-bsl62
      namespace: kube-system
      resourceVersion: "11812304"
      uid: d60d9a17-52a6-11e8-b36e-deadbe40d32e
  - ip: 172.23.0.166 <=== still old IP
    nodeName: master-0ufqh-85fcf4ffd-rcd5t <=== still old host name
    targetRef:
      kind: Pod
      name: node-exporter-t9w84
      namespace: kube-system
      resourceVersion: "11811791"
      uid: f50bd59c-5111-11e8-8479-deadbe40d32e
  - ip: 172.23.0.178 <=== IP stayed the same
    nodeName: worker-g5t48-5497c6d6-sj22v <=== still old host name
    targetRef:
      kind: Pod
      name: node-exporter-p6njg
      namespace: kube-system
      resourceVersion: "11812640"
      uid: e7169993-52c7-11e8-b36e-deadbe40d32e

It also affects all other endpoints, we were forced to recreated all pods (so they will change IPs) to be able to fix endpoints reconciliation.

What you expected to happen:
controller manager successfully update endpoints after node recreated with same IP, but different hostname.

How to reproduce it (as minimally and precisely as possible):

  1. cluster 1 master, 1 worker, service that points to pods with host network.
  2. shutdown k8s API service (keep it shut down)
  3. recreate worker node with same IP and different host name
  4. try to recreate pod on worker and in controller manager should appear error above.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.10.1
  • Cloud provider or hardware configuration: baremetal
  • OS (e.g. from /etc/os-release): CoreOS 1688.5.3
  • Kernel (e.g. uname -a): 4.14.32-coreos
  • Install tools: own tool (kvm-operator)
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jul 27, 2018
@r7vme
Copy link
Author

r7vme commented Jul 27, 2018

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 27, 2018
@yue9944882
Copy link
Member

@r7vme #31311 what is the use case of changing node name?

@yue9944882
Copy link
Member

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Jul 27, 2018
@r7vme
Copy link
Author

r7vme commented Jul 30, 2018

@yue9944882

We have ephemeral immutable nodes in our clusters. So upgrade or any other operation will cause node recreation. Hostname generated automatically, but the IP address depends on the underlay hypervisor node, where node will be scheduled. So it's possible that node stays on the same hypervisor and uses same ip address.

@fedebongio
Copy link
Contributor

/cc @cheftako

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 31, 2018
@yue9944882
Copy link
Member

/assign

@yue9944882
Copy link
Member

/cc @kubernetes/sig-api-machinery-feature-requests

do you think it's reasonable to loose endpoints validation?

@liggitt
Copy link
Member

liggitt commented Aug 10, 2018

/sig network
this API is owned by sig network

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants