Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of topic data at Minikube restart #89

Closed
solsson opened this issue Nov 5, 2017 · 12 comments
Closed

Loss of topic data at Minikube restart #89

solsson opened this issue Nov 5, 2017 · 12 comments

Comments

@solsson
Copy link
Contributor

solsson commented Nov 5, 2017

After minikube stop && minikube start topics are gone. Dunnot if it's been like this always. Running v0.23.0 with --kubernetes-version=v1.8.0.

@pavel-agarkov
Copy link

pavel-agarkov commented Jan 20, 2018

I have to recreate all the topics from the scratch every morning (T_T)
Do you have any updates for this issue?
minikube version: v0.24.1 still have the problem
Will k8s update help or we have to wait next minikube release?

@solsson
Copy link
Contributor Author

solsson commented Jan 21, 2018

Do you have any updates for this issue?

Actually I've sort of abandoned minikube. They're way too slow with new releases, still on a .0 for 1.8.x. When they're lagging behind our production environment (GKE) it isn't much use. I run testing in https://github.com/Yolean/kubeadm-vagrant these days.

minikube version: v0.24.1 still have the problem
Will k8s update help or we have to wait next minikube release?

Hence I don't know. With kubeadm being easy to use and K8s coming with docker for mac (#126) any stabilization with minikube would be thanks to contributions.

@pavel-agarkov
Copy link

pavel-agarkov commented Dec 1, 2018

Hi! Looks like this issue is reproducible in GKE.
When I use pre-emptible nodes for kafka cluster (that's just a test) nodes are killed every day and kafka looses all the topics on each restart.
But data are not gone - if I recreate the topic (kafka-topics.sh --create) it will magically have all the old data.
Also if I restart broker pod some topics will be resurrected but not all of them so I need to recreate the rest manually.
It is good that data are not lost but still...
The only change I made to the repository was zookeeper StorageClass to pd-standard type since I don't have ssd in my test cluster.

@solsson
Copy link
Contributor Author

solsson commented Dec 1, 2018

@pavel-agarkov I assume this is with v5.0.0? We've had an instance of this before when zookeeper config somehow didn't reflect the mount path. I will try to reproduce the issue myself.

@pavel-agarkov
Copy link

Yes, it is v5, and k8s is v1.10.7-gke.11
Let me now if you need any data/logs to help with reproduction.

@pavel-agarkov
Copy link

Also I forgot to mention that I applied your single node configuration on top of master branch.
here is my fork: https://github.com/Midnight-Lizard/kubernetes-kafka

@solsson
Copy link
Contributor Author

solsson commented Dec 1, 2018

I've reproduced it on GKE like so:

kubectl get nodes -o name | xargs -I '{}' kubectl cordon '{}'
kubectl -n kafka delete pod -l app=zookeeper
# interestingly at this point tests still produce and consume data
kubectl -n kafka delete pod -l app=kafka
# kafkacat based tests is now crashlooping and produce-consume is 2/3 ready
kubectl get nodes -o name | xargs -I '{}' kubectl uncordon '{}'

The kafkacat test has the clearest output, "Unknown topic":

%7|1543686240.703|STATE|rdkafka#producer-1| [thrd:kafka-2.broker.kafka.svc.cluster.local:9092/2]: kafka-2.broker.kafka.svc.cluster.local:9092/2: Broker changed state APIVERSION_QUERY -> UP
%7|1543686240.703|STATE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Broker changed state INIT -> CONNECT
%7|1543686240.703|CONNECT|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Connected to ipv4#10.36.0.11:9092
%7|1543686240.703|CONNECTED|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Connected (#1)
%7|1543686240.703|FEATURE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Updated enabled protocol features +ApiVersion to ApiVersion
%7|1543686240.703|STATE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Broker changed state CONNECT -> APIVERSION_QUERY
%7|1543686240.704|FEATURE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Updated enabled protocol features to MsgVer1,ApiVersion,BrokerBalancedConsumer,ThrottleTime,Sasl,SaslHandshake,BrokerGroupCoordinator,LZ4,OffsetTime,MsgVer2
%7|1543686240.704|STATE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Broker changed state APIVERSION_QUERY -> UP
% Fatal error at produce:174:
% ERROR: Failed to produce message (55 bytes): Local: Unknown topic
9k44m:Test kafkacat-9k44m@2018-12-01T17:42:40,171384922+00:00
9k44m:Test kafkacat-9k44m@2018-12-01T17:42:50,167799970+00:00

The java based consumer keeps saying:

[2018-12-01 17:45:49,955] WARN [Consumer clientId=consumer-1, groupId=console-consumer-24973] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.3.4:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-01 17:45:50,995] WARN [Consumer clientId=consumer-1, groupId=console-consumer-24973] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.6:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-01 17:45:54,101] WARN [Consumer clientId=consumer-1, groupId=console-consumer-24973] Connection to node 2 (kafka-2.broker.kafka.svc.cluster.local/10.36.0.7:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

It's a different error this time I think. According to logs we store data under the mounted path.

[2018-12-01 17:08:57,001] INFO Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /var/lib/zookeeper/log/version-2 snapdir /var/lib/zookeeper/data/version-2 (org.apache.zookeeper.server.ZooKeeperServer)
[2018-12-01 17:08:57,001] INFO FOLLOWING - LEADER ELECTION TOOK - 9892 (org.apache.zookeeper.server.quorum.Learner)
[2018-12-01 17:08:57,008] INFO Resolved hostname: pzoo-2.pzoo to address: pzoo-2.pzoo/10.36.0.14 (org.apache.zookeeper.server.quorum.QuorumPeer)
[2018-12-01 17:08:57,022] INFO Getting a snapshot from leader 0x200000000 (org.apache.zookeeper.server.quorum.Learner)
[2018-12-01 17:08:57,026] INFO Snapshotting: 0x200000000 to /var/lib/zookeeper/data/version-2/snapshot.200000000 (org.apache.zookeeper.server.persistence.FileTxnSnapLog)
$ kubectl -n kafka exec zoo-0 df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sdc          999320    2588    980348   1% /var/lib/zookeeper/data

@pavel-agarkov
Copy link

So how can it be fixed? Maybe by adding some readinessProb to check mounted volumes?

@solsson
Copy link
Contributor Author

solsson commented Dec 1, 2018

I've tried the two things that changed apart from the new version + Java 11, changed back to OrderedReady and doubled the memory limit. No difference.

@solsson
Copy link
Contributor Author

solsson commented Dec 1, 2018

So how can it be fixed? Maybe by adding some readinessProb to check mounted volumes?

I don't think that the mount is a problem. I'm trying to figure out how zk saves state. kubectl -n kafka exec pzoo-2 -- ls -lR /var/lib/zookeeper/data has no content that changes when I create a topic.

Edit: pzoo-0 had some snapshot files. The existence of lost+found indicates that this is a mounted volume

$ kubectl -n kafka exec zoo-0 -- ls -lR /var/lib/zookeeper/data
/var/lib/zookeeper/data:
total 24
drwx------ 2 root root 16384 Dec  1 18:05 lost+found
-rw-r--r-- 1 root root     2 Dec  1 18:12 myid
drwxr-xr-x 2 root root  4096 Dec  1 18:14 version-2

/var/lib/zookeeper/data/lost+found:
total 0

/var/lib/zookeeper/data/version-2:
total 16
-rw-r--r-- 1 root root   1 Dec  1 18:14 acceptedEpoch
-rw-r--r-- 1 root root   1 Dec  1 18:14 currentEpoch
-rw-r--r-- 1 root root 296 Dec  1 18:07 snapshot.100000000
-rw-r--r-- 1 root root 308 Dec  1 18:14 snapshot.200000001

@solsson
Copy link
Contributor Author

solsson commented Dec 1, 2018

Based on https://zookeeper.apache.org/doc/r3.4.13/zookeeperAdmin.html#sc_dataFileManagement I think there should be log files too.

@solsson
Copy link
Contributor Author

solsson commented Dec 1, 2018

Found the issue, probably introduced in ccb9e5d. Fix coming up.

solsson added a commit that referenced this issue Dec 2, 2018
Fixes #89, "logs" which are actually data would end up outside the mount.

Zookeeper's startup logs are more clear than the property file entries:
INFO Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /var/lib/zookeeper/log/version-2 snapdir /var/lib/zookeeper/data/version-2
@solsson solsson closed this as completed in b88b319 Dec 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants