Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zookeeper stops working after injecting istio-proxy #19280

Closed
linsun opened this issue Nov 28, 2019 · 30 comments
Closed

zookeeper stops working after injecting istio-proxy #19280

linsun opened this issue Nov 28, 2019 · 30 comments
Assignees
Milestone

Comments

@linsun
Copy link
Member

linsun commented Nov 28, 2019

Bug description

  1. Installed zookeeper via helm chart via instructions here: https://github.com/helm/charts/tree/master/incubator/zookeeper

All worked fine and validated each of the 3 pods within the stateful set is good and the quorum is established.

  1. Annotate the namespace with istio auto injection and kill each of the 3 zookeeper pod. Watch the pod restarted with istio-proxy however, none of the pod become running for long, always restarting:
$ k get services
NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
kubernetes           ClusterIP   172.21.0.1     <none>        443/TCP                      167d
zookeeper            ClusterIP   172.21.229.8   <none>        2181/TCP                     4h42m
zookeeper-headless   ClusterIP   None           <none>        2181/TCP,3888/TCP,2888/TCP   4h42m
(⎈ |linistio10/7caab3af9f514f028081a8180c107b69:default)
~/Downloads/istio-1.4.0 ⌚ 20:31:55
$ k get pods
NAME          READY   STATUS             RESTARTS   AGE
zookeeper-0   1/2     Running            62         3h32m
zookeeper-1   1/2     CrashLoopBackOff   61         3h31m
zookeeper-2   2/2     Running            62         3h30m
(⎈ |linistio10/7caab3af9f514f028081a8180c107b69:default)
~/Downloads/istio-1.4.0 ⌚ 20:38:48
$ k get statefulset
NAME        READY   AGE
zookeeper   1/3     4h49m

Chatted with @hzxuzhonghu briefly via #networking channel on slack - would like to open an issue to track this.

Expected behavior
zookeeper continues to work, at least in permissive mode.

Steps to reproduce the bug
see above

Version (include the output of istioctl version --remote and kubectl version and helm version if you used Helm)
$ istioctl version
client version: 1.4.0
control plane version: 1.4.0
data plane version: 1.3.2 (3 proxies), 1.4.0 (4 proxies)

How was Istio installed?
istioctl manifest apply

Environment where bug was observed (cloud vendor, OS, etc)
IBM Cloud K8s 1.14 cluster

@linsun
Copy link
Member Author

linsun commented Nov 28, 2019

zookeeper pod log

2019-11-28 01:19:17,154 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer$QuorumServer@196] - Failed to resolve address: zookeeper-2.zookeeper-headless.default.svc.cluster.local
java.net.UnknownHostException: zookeeper-2.zookeeper-headless.default.svc.cluster.local: Name or service not known

Notice the pods are rebooted a few dozen times.

@hzxuzhonghu
Copy link
Member

/assign

@hzxuzhonghu hzxuzhonghu self-assigned this Nov 28, 2019
@linsun
Copy link
Member Author

linsun commented Nov 28, 2019

Found a workaround for this... zookeeper has 3 ports: 2181/TCP,3888/TCP,2888/TCP

2181 is for client connnections, and 3888/2888 are both used internally for leader election and followers. I went ahead and excluded 3888/2888 for inbound ports, e.g.

spec:
  serviceName: zookeeper-headless
  replicas: 3
  selector:
    matchLabels:
      app: zookeeper
      release: zookeeper
      component: server
  updateStrategy:
    type: RollingUpdate
    
  template:
    metadata:
      annotations:
        traffic.sidecar.istio.io/excludeInboundPorts: "2888,3888"

and redeployed the statefulset. After that, all my zookeeper pods are coming up fine and the quorum are established.

@dcberg
Copy link
Contributor

dcberg commented Nov 28, 2019

If you exclude these ports then when you have mTLS enabled the pods will not communicate with each other using mTLS....correct?

@hzxuzhonghu
Copy link
Member

I can not startup with https://github.com/helm/charts/tree/master/incubator/zookeeper, the error is not related to istio.

kubectl logs zookeeper-0  -c zookeeper
+ /config-scripts/run
mkdir: cannot create directory ‘/data/log’: Permission denied
/config-scripts/run: line 46: /data/myid: Permission denied
+ exec java -cp '/apache-zookeeper-3.5.5-bin/lib/*:/apache-zookeeper-3.5.5-bin/*jar:/conf:' -Xmx2G -Xms2G org.apache.zookeeper.server.quorum.QuorumPeerMain /conf/zoo.cfg
2019-11-28 08:46:41,471 [myid:] - INFO  [main:QuorumPeerConfig@133] - Reading configuration from: /conf/zoo.cfg
2019-11-28 08:46:41,477 [myid:] - INFO  [main:QuorumPeerConfig@385] - clientPortAddress is 0.0.0.0/0.0.0.0:2181
2019-11-28 08:46:41,478 [myid:] - INFO  [main:QuorumPeerConfig@389] - secureClientPort is not set
2019-11-28 08:46:41,491 [myid:] - ERROR [main:QuorumPeerMain@89] - Invalid config, exiting abnormally
org.apache.zookeeper.server.quorum.QuorumPeerConfig$ConfigException: Error processing /conf/zoo.cfg
	at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:154)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:113)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:82)
Caused by: java.lang.IllegalArgumentException: myid file is missing
	at org.apache.zookeeper.server.quorum.QuorumPeerConfig.checkValidity(QuorumPeerConfig.java:734)
	at org.apache.zookeeper.server.quorum.QuorumPeerConfig.setupQuorumPeerConfig(QuorumPeerConfig.java:605)
	at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseProperties(QuorumPeerConfig.java:420)
	at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:150)
	... 2 more
Invalid config, exiting abnormally

But i tried with another deploy https://blog.csdn.net/wslyk606/article/details/90720424, it workes well in my env.

My env is installed from a recent istio-1.5-alpha.3b7b43f911ab6c69bf0da9623966521563cff3f8, by applying the istio-demo.yaml without mtls.

@linsun So maybe tls setting is not right, can you show your configdump of the zookeeper?

@hzxuzhonghu
Copy link
Member

install following this guide https://istio.io/docs/tasks/security/authentication/auto-mtls/#before-you-begin

It also works for me.

@linsun
Copy link
Member Author

linsun commented Nov 28, 2019

@dcberg I don't think mTLS will be an issue on other ports that are excluded so basically i'm asking istio to do nothing with the inter member communication within the zookeeper cluster by excluding the 2 ports.

@linsun
Copy link
Member Author

linsun commented Nov 28, 2019

@hzxuzhonghu great you were able to recreate it, maybe? I think the issue you had was caused by the configuration of the storage provider... you had to configure it when install the helm chart which I configured for IKS. I used a fresh istio 1.4.0 installation with default profile (NO mTLS enabled).

Let me know if you need anything.

@dcberg
Copy link
Contributor

dcberg commented Nov 28, 2019

@linsun I wasn't able to get my environment working with just the annotation that you used to excludeInboundPorts. I had to excludedOutboundPorts as well.

traffic.sidecar.istio.io/excludeOutboundPorts: 2888,3888

Once I did this then establishing quorum did work for me.

@dcberg
Copy link
Contributor

dcberg commented Nov 28, 2019

The question I have is should we have to use the annotations to exclude both inbound and outbound ports that are used for inter pod communications for the pods within the StatefulSet when Istio is enabled?

@hzxuzhonghu
Copy link
Member

The root cause is that zookeeper listens on pod ip only

# kubectl exec -ti  zk-1 -c istio-proxy sh
$ netstat -pan |grep 3888
tcp6       0      0 10.244.0.95:3888        :::*                    LISTEN      - 

Ref: https://istio.io/faq/applications/#cassandra

This is really a bad UX, @rshriram @howardjohn @lambdai any idea how can we solve this?

@linsun
Copy link
Member Author

linsun commented Dec 2, 2019

@hzxuzhonghu thank you so much for looking at this! Could you elaborate why "listens on pod ip only" caused the prob?

@mbanikazemi have you ever tried zookeeper + istio ? I noticed your https://istio.io/faq/applications/#cassandra is only for cassandra.

@dcberg I think users should not need to exclude these ports unless they specifically don't want istio to intercept traffic on these ports. In my case, I did it simply because istio can't handle it right now.

@hzxuzhonghu
Copy link
Member

@hzxuzhonghu thank you so much for looking at this! Could you elaborate why "listens on pod ip only" caused the prob?

The inbound cluster address is set to 127.0.0.1 for ipv4

@linsun
Copy link
Member Author

linsun commented Dec 2, 2019

Thank you @hzxuzhonghu, I see the difference now.

$ k exec -it zookeeper-0 bash -c istio-proxy
istio-proxy@zookeeper-0:/$ netstat -pan | grep 3888
tcp6       0      0 172.30.244.100:3888     :::*                    LISTEN      -                   
istio-proxy@zookeeper-0:/$ netstat -pan | grep 2888
tcp        0      0 172.30.244.100:39664    172.30.115.67:2888      ESTABLISHED 19/envoy            
tcp6       0      0 172.30.244.100:39662    172.30.115.67:2888      ESTABLISHED -                   
istio-proxy@zookeeper-0:/$ netstat -pan | grep 2181
tcp6       0      0 :::2181                 :::*                    LISTEN      -                   
tcp6       0      0 127.0.0.1:54756         127.0.0.1:2181          TIME_WAIT   -                   
tcp6       0      0 127.0.0.1:54548         127.0.0.1:2181          TIME_WAIT   -  

@mbanikazemi
Copy link
Contributor

mbanikazemi commented Dec 2, 2019

This looks like the issues we have with apps that do not listen on local host. This can be changed with updating one or more configuration parameters for a given app. Looking at zookeeper docs (https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_configuration) I see:

clientPortAddress
New in 3.3.0: the address (ipv4, ipv6 or hostname) to listen for client connections; that is, the address that clients attempt to connect to. This is optional, by default we bind in such a way that any connection to the clientPort for any address/interface/nic on the server will be accepted.

looking into it.

@mbanikazemi
Copy link
Contributor

mbanikazemi commented Dec 2, 2019

as noted earlier, the issue is that the servers listen on their IP address for ports used for communication between the servers. There is a config option quorumListenOnAllIPs which when set to true makes servers listen on all addresses and that seems to solve the problem. Note that this config option is among "unsafe" options. Will follow up to see why.

@esnible
Copy link
Contributor

esnible commented Dec 2, 2019

There is a PR to the Zookeeper helm chart that fixes this. See helm/charts#17183 and helm/charts#17258 . The PR is stuck because of a screwup with a contributor license bot.

@linsun
Copy link
Member Author

linsun commented Dec 2, 2019

works pretty well after i remove the annotation! @banix @Snible thank you much for the suggestion of using quorumListenOnAllIPs.

Here is what I did.

  1. I had a generated zookeeper yaml based on helm template cmd. Add the following to zookeeper cm:

echo "quorumListenOnAllIPs=true" >> $ZK_CONFIG_FILE

  1. remove all istio related annotations for exclude ports in the zookeeper yaml

  2. redeploy zookeeper yaml file. make sure all pods are deployed new and check init container to ensure the ports 2888/3888 aren't there.

  3. all zookeeper pods should be up running. exec into any of the zookerpods.

$ k exec -it zookeeper-0 bash -c istio-proxy
istio-proxy@zookeeper-0:/$ nc -v zookeeper.default.svc.cluster.local 2181
zookeeper.default.svc.cluster.local [172.21.225.83] 2181 (?) open
status
Zookeeper version: 3.5.5-390fe37ea45dee01bf87dc1c042b5e3dcce88653, built on 05/03/2019 12:07 GMT
Clients:
 /127.0.0.1:37924[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/0/0
Received: 5
Sent: 4
Connections: 1
Outstanding: 0
Zxid: 0x700000000
Mode: follower
Node count: 5

istio-proxy@zookeeper-0:/$ netstat -pan | grep 3888
tcp        0      0 127.0.0.1:45104         127.0.0.1:3888          ESTABLISHED 19/envoy            
tcp        0      0 127.0.0.1:45102         127.0.0.1:3888          ESTABLISHED 19/envoy            
tcp6       0      0 :::3888                 :::*                    LISTEN      -                   
tcp6       0      0 127.0.0.1:3888          127.0.0.1:45104         ESTABLISHED -                   
tcp6       0      0 127.0.0.1:3888          127.0.0.1:45102         ESTABLISHED -                   
istio-proxy@zookeeper-0:/$ netstat -pan | grep 2888
tcp        0      0 172.30.244.96:34532     172.30.196.190:2888     ESTABLISHED 19/envoy            
tcp6       0      0 172.30.244.96:34530     172.30.196.190:2888     ESTABLISHED -                   
istio-proxy@zookeeper-0:/$ netstat -pan | grep 2181
tcp6       0      0 :::2181                 :::*                    LISTEN      -                   
tcp6       0      0 127.0.0.1:39262         127.0.0.1:2181          TIME_WAIT   -                   
tcp6       0      0 127.0.0.1:39044         127.0.0.1:2181          TIME_WAIT   - 

@mbanikazemi
Copy link
Contributor

Added an entry to the FAQ: istio/istio.io#5951

@linsun
Copy link
Member Author

linsun commented Dec 5, 2019

Thank you @mbanikazemi ! With the FAQ, I will close the issue

@linsun linsun closed this as completed Dec 5, 2019
sylvainOL added a commit to sylvainOL/zookeeper-operator that referenced this issue Dec 17, 2019
as said here (istio/istio#19280 (comment)), zookeeper doesn't listen on 0.0.0.0 per default and then is not "service mesh friendly".
Adding this options makes it listen to all ports and not one IP address.
sylvainOL added a commit to sylvainOL/zookeeper-operator that referenced this issue Dec 17, 2019
as said in istio/istio#19280 (comment),
zookeeper doesn't listen on 0.0.0.0 per default and then is not "service mesh friendly".

Adding this options makes it listen to all ports and not one IP address.

Signed-off-by: Sylvain Desbureaux <[email protected]>
@satyakatta
Copy link

satyakatta commented Feb 4, 2020

@mbanikazemi @linsun tried deploying the zookeeper with the suggested fixes to use the 0.0.0.0 on the host server and also have enabled quorumListenOnAllIPs=true on my zookeeper servers. I am still getting the UnknownHostException. Any idea on how to debug this.

istio-proxy@tst-zk-istio-zookeeper-0:/$ nc -v tst-zk-istio-zookeeper-headless-mirror-main.tst-zk-istio.svc.cluster.local 2181
DNS fwd/rev mismatch: tst-zk-istio-zookeeper-headless-mirror-main.tst-zk-istio.svc.cluster.local != tst-zk-istio-zookeeper-1.tst-zk-istio-zookeeper-headless-mirror-main.tst-zk-istio.svc.cluster.local
DNS fwd/rev mismatch: tst-zk-istio-zookeeper-headless-mirror-main.tst-zk-istio.svc.cluster.local != tst-zk-istio-zookeeper-2.tst-zk-istio-zookeeper-headless-mirror-main.tst-zk-istio.svc.cluster.local
DNS fwd/rev mismatch: tst-zk-istio-zookeeper-headless-mirror-main.tst-zk-istio.svc.cluster.local != tst-zk-istio-zookeeper-0.tst-zk-istio-zookeeper-headless-mirror-main.tst-zk-istio.svc.cluster.local
tst-zk-istio-zookeeper-headless-mirror-main.tst-zk-istio.svc.cluster.local [172.25.3.233] 2181 (?) open
istio-proxy@tst-zk-istio-zookeeper-0:/$ netstat -pan | grep 3888
tcp 0 0 127.0.0.1:44388 127.0.0.1:3888 TIME_WAIT -
tcp 0 0 127.0.0.1:42206 127.0.0.1:3888 TIME_WAIT -
tcp6 0 0 :::3888 :::* LISTEN -

@kaushiksrinivas
Copy link

we see, with quorumListenOnAllIPs=true things work fine. But if we scale the zookeeper cluster say from 3 pods to 5 pods. we see issues, and zk qourum gets disrupted and never comes up till the pods are restarted.
No pod in the cluster will be a follower or a leader.
This we are seeing only on scale up or down events.
If we exclude the port 2888and3888, things work fine even in scale up/down cases.

Anybody seen such issues and probable root cause? this is very consistent for us.

@chris530
Copy link

Thanks @linsun

Adding echo "quorumListenOnAllIPs=true" >> conf/zookeeper.conf; worked for me

@diranged
Copy link
Contributor

diranged commented Jan 8, 2021

Chiming in here. I am running the Strimzi operator to bring up Kafka and Zookeeper. I also found that I had the same "zookeeper is listening to Pod IP not 0.0.0.0 for 2888/3888" issue. While adding spec.zookeeper.config.quorumListenOnAllIPs: true did indeed make Zookeeper start listening on 0.0.0.0 for all of its listeners, we ran into a second issue. It seems that the Zookeeper-to-Zookeeper communication was having tLS issues and seeing the wrong certificates.

While I am aware that double-TLS is probably not a great idea.. I am unclear on why it broke. If there is mTLS happening automatically on the outbound traffic for Zookeeper1 -> Zookeeper 2, and automatic mTLS happening on the inbound traffic to Zookeeper2 .. then I don't understand why the applications running in each of those pods would see any difference with regards to the traffic.

2021-01-08 01:32:19,337 ERROR Failed to verify hostname: localhost (org.apache.zookeeper.common.ZKTrustManager) [0.0.0.0/0.0.0.0:3888]
javax.net.ssl.SSLPeerUnverifiedException: Certificate for <localhost> doesn't match any of the subject alternative names: [*.kafka-zookeeper-client.kafka-istio.svc, *.kafka-zookeeper-client.kafka-istio.svc.cluster.local, kafka-zookeeper-0.kafka-zookeeper-nodes.kafka-istio.svc.cluster.local, kafka-zookeeper-0.kafka-zookeeper-nodes.kafka-istio.svc, *.kafka-zookeeper-nodes.kafka-istio.svc, kafka-zookeeper-client.kafka-istio.svc, kafka-zookeeper-client.kafka-istio.svc.cluster.local, kafka-zookeeper-client, kafka-zookeeper-client.kafka-istio, *.kafka-zookeeper-nodes.kafka-istio.svc.cluster.local]
	at org.apache.zookeeper.common.ZKHostnameVerifier.matchDNSName(ZKHostnameVerifier.java:224)

To make it work, I had to go ahead and add the traffic.sidecar.istio.io/excludeInboundPorts: "2888,3888" tag. Any idea why this was necessary?

@howardjohn
Copy link
Member

@diranged this sounds like your zookeeper is calling localhost instead of one of the listed *.kafka-zookeeper-client.kafka-istio.svc.cluster.local like addresses there, so it rejects the cert (which isn't signed for localhost)? Not a zookeeper expert though, just guessing based on the hostname.

The double TLS should work

@samene
Copy link

samene commented Jul 13, 2021

I have a 3 node zookeeper (bitnami zookeeper helm chart, zk version 3.6.1) with istio proxy. quorumListenOnAllIPs was required to be true until istio 1.10.0. When I upgraded my istio control plane from 1.9.1 to 1.10.2 today and did a rollout restart of the zookeeper statefulset to upgrade the istio dataplane it is failing to start.

all pods are full of

2021-07-13 07:36:47,004 [myid:3] - WARN  [NIOWorkerThread-1:NIOServerCnxn@373] - Close of session 0x0
java.io.IOException: ZooKeeperServer not running
        at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
        at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
        at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

But if I delete all pods at the same time instead of a rollout restart then it works properly. Any idea, anyone? Should I change quorumListenOnAllIPs back to false in istio 1.10.2?

@howardjohn
Copy link
Member

quorumListenOnAllIPs

Not a zk expert, but my understanding: this is required on 1.9, optional on 1.10. See https://istio.io/latest/blog/2021/upcoming-networking-changes/ for details.

@linsun
Copy link
Member Author

linsun commented Jul 19, 2021

yes, you should not set quorumListenOnAllIPs in istio 1.10. quorumListenOnAllIPs is an experimental flag from zookeeper and not recommended for production anyway.

@seema-sandeep-jaisinghani-db
Copy link

seema-sandeep-jaisinghani-db commented Jan 7, 2022

@diranged Is it possible to share your strimzi kafka configuration? I am hitting same mtls issues for zookeeper when istio is injected.

@diranged
Copy link
Contributor

diranged commented Jan 7, 2022

@diranged Is it possible to share your strimzi kafka configuration? I am hitting same mtls issues for zookeeper when istio is injected.

We gave up trying to run Strimzi-on-Istio, and instead explicitly run it off the mesh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests