Loss of topic data at Minikube restart #89

solsson · 2017-11-05T12:57:35Z

After minikube stop && minikube start topics are gone. Dunnot if it's been like this always. Running v0.23.0 with --kubernetes-version=v1.8.0.

The text was updated successfully, but these errors were encountered:

pavel-agarkov · 2018-01-20T09:37:21Z

I have to recreate all the topics from the scratch every morning (T_T)
Do you have any updates for this issue?
minikube version: v0.24.1 still have the problem
Will k8s update help or we have to wait next minikube release?

solsson · 2018-01-21T19:29:25Z

Do you have any updates for this issue?

Actually I've sort of abandoned minikube. They're way too slow with new releases, still on a .0 for 1.8.x. When they're lagging behind our production environment (GKE) it isn't much use. I run testing in https://github.com/Yolean/kubeadm-vagrant these days.

minikube version: v0.24.1 still have the problem
Will k8s update help or we have to wait next minikube release?

Hence I don't know. With kubeadm being easy to use and K8s coming with docker for mac (#126) any stabilization with minikube would be thanks to contributions.

pavel-agarkov · 2018-12-01T07:33:12Z

Hi! Looks like this issue is reproducible in GKE.
When I use pre-emptible nodes for kafka cluster (that's just a test) nodes are killed every day and kafka looses all the topics on each restart.
But data are not gone - if I recreate the topic (kafka-topics.sh --create) it will magically have all the old data.
Also if I restart broker pod some topics will be resurrected but not all of them so I need to recreate the rest manually.
It is good that data are not lost but still...
The only change I made to the repository was zookeeper StorageClass to pd-standard type since I don't have ssd in my test cluster.

solsson · 2018-12-01T16:40:05Z

@pavel-agarkov I assume this is with v5.0.0? We've had an instance of this before when zookeeper config somehow didn't reflect the mount path. I will try to reproduce the issue myself.

pavel-agarkov · 2018-12-01T16:55:54Z

Yes, it is v5, and k8s is v1.10.7-gke.11
Let me now if you need any data/logs to help with reproduction.

pavel-agarkov · 2018-12-01T17:56:27Z

Also I forgot to mention that I applied your single node configuration on top of master branch.
here is my fork: https://github.com/Midnight-Lizard/kubernetes-kafka

solsson · 2018-12-01T17:58:16Z

I've reproduced it on GKE like so:

kubectl get nodes -o name | xargs -I '{}' kubectl cordon '{}'
kubectl -n kafka delete pod -l app=zookeeper
# interestingly at this point tests still produce and consume data
kubectl -n kafka delete pod -l app=kafka
# kafkacat based tests is now crashlooping and produce-consume is 2/3 ready
kubectl get nodes -o name | xargs -I '{}' kubectl uncordon '{}'

The kafkacat test has the clearest output, "Unknown topic":

%7|1543686240.703|STATE|rdkafka#producer-1| [thrd:kafka-2.broker.kafka.svc.cluster.local:9092/2]: kafka-2.broker.kafka.svc.cluster.local:9092/2: Broker changed state APIVERSION_QUERY -> UP
%7|1543686240.703|STATE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Broker changed state INIT -> CONNECT
%7|1543686240.703|CONNECT|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Connected to ipv4#10.36.0.11:9092
%7|1543686240.703|CONNECTED|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Connected (#1)
%7|1543686240.703|FEATURE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Updated enabled protocol features +ApiVersion to ApiVersion
%7|1543686240.703|STATE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Broker changed state CONNECT -> APIVERSION_QUERY
%7|1543686240.704|FEATURE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Updated enabled protocol features to MsgVer1,ApiVersion,BrokerBalancedConsumer,ThrottleTime,Sasl,SaslHandshake,BrokerGroupCoordinator,LZ4,OffsetTime,MsgVer2
%7|1543686240.704|STATE|rdkafka#producer-1| [thrd:kafka-0.broker.kafka.svc.cluster.local:9092/0]: kafka-0.broker.kafka.svc.cluster.local:9092/0: Broker changed state APIVERSION_QUERY -> UP
% Fatal error at produce:174:
% ERROR: Failed to produce message (55 bytes): Local: Unknown topic
9k44m:Test kafkacat-9k44m@2018-12-01T17:42:40,171384922+00:00
9k44m:Test kafkacat-9k44m@2018-12-01T17:42:50,167799970+00:00

The java based consumer keeps saying:

[2018-12-01 17:45:49,955] WARN [Consumer clientId=consumer-1, groupId=console-consumer-24973] Connection to node 1 (kafka-1.broker.kafka.svc.cluster.local/10.36.3.4:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-01 17:45:50,995] WARN [Consumer clientId=consumer-1, groupId=console-consumer-24973] Connection to node 0 (kafka-0.broker.kafka.svc.cluster.local/10.36.0.6:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2018-12-01 17:45:54,101] WARN [Consumer clientId=consumer-1, groupId=console-consumer-24973] Connection to node 2 (kafka-2.broker.kafka.svc.cluster.local/10.36.0.7:9092) could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)

It's a different error this time I think. According to logs we store data under the mounted path.

[2018-12-01 17:08:57,001] INFO Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /var/lib/zookeeper/log/version-2 snapdir /var/lib/zookeeper/data/version-2 (org.apache.zookeeper.server.ZooKeeperServer)
[2018-12-01 17:08:57,001] INFO FOLLOWING - LEADER ELECTION TOOK - 9892 (org.apache.zookeeper.server.quorum.Learner)
[2018-12-01 17:08:57,008] INFO Resolved hostname: pzoo-2.pzoo to address: pzoo-2.pzoo/10.36.0.14 (org.apache.zookeeper.server.quorum.QuorumPeer)
[2018-12-01 17:08:57,022] INFO Getting a snapshot from leader 0x200000000 (org.apache.zookeeper.server.quorum.Learner)
[2018-12-01 17:08:57,026] INFO Snapshotting: 0x200000000 to /var/lib/zookeeper/data/version-2/snapshot.200000000 (org.apache.zookeeper.server.persistence.FileTxnSnapLog)

$ kubectl -n kafka exec zoo-0 df
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/sdc          999320    2588    980348   1% /var/lib/zookeeper/data

pavel-agarkov · 2018-12-01T18:03:10Z

So how can it be fixed? Maybe by adding some readinessProb to check mounted volumes?

solsson · 2018-12-01T18:26:49Z

I've tried the two things that changed apart from the new version + Java 11, changed back to OrderedReady and doubled the memory limit. No difference.

solsson · 2018-12-01T18:31:23Z

So how can it be fixed? Maybe by adding some readinessProb to check mounted volumes?

I don't think that the mount is a problem. I'm trying to figure out how zk saves state. kubectl -n kafka exec pzoo-2 -- ls -lR /var/lib/zookeeper/data has no content that changes when I create a topic.

Edit: pzoo-0 had some snapshot files. The existence of lost+found indicates that this is a mounted volume

$ kubectl -n kafka exec zoo-0 -- ls -lR /var/lib/zookeeper/data
/var/lib/zookeeper/data:
total 24
drwx------ 2 root root 16384 Dec  1 18:05 lost+found
-rw-r--r-- 1 root root     2 Dec  1 18:12 myid
drwxr-xr-x 2 root root  4096 Dec  1 18:14 version-2

/var/lib/zookeeper/data/lost+found:
total 0

/var/lib/zookeeper/data/version-2:
total 16
-rw-r--r-- 1 root root   1 Dec  1 18:14 acceptedEpoch
-rw-r--r-- 1 root root   1 Dec  1 18:14 currentEpoch
-rw-r--r-- 1 root root 296 Dec  1 18:07 snapshot.100000000
-rw-r--r-- 1 root root 308 Dec  1 18:14 snapshot.200000001

solsson · 2018-12-01T18:43:46Z

Based on https://zookeeper.apache.org/doc/r3.4.13/zookeeperAdmin.html#sc_dataFileManagement I think there should be log files too.

solsson · 2018-12-01T19:26:40Z

Found the issue, probably introduced in ccb9e5d. Fix coming up.

Fixes #89, "logs" which are actually data would end up outside the mount. Zookeeper's startup logs are more clear than the property file entries: INFO Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /var/lib/zookeeper/log/version-2 snapdir /var/lib/zookeeper/data/version-2

solsson mentioned this issue Dec 2, 2018

Upgrade path for the zookeeper persistence issue #228

Merged

solsson closed this as completed in b88b319 Dec 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss of topic data at Minikube restart #89

Loss of topic data at Minikube restart #89

solsson commented Nov 5, 2017

pavel-agarkov commented Jan 20, 2018 •

edited

Loading

solsson commented Jan 21, 2018

pavel-agarkov commented Dec 1, 2018 •

edited

Loading

solsson commented Dec 1, 2018

pavel-agarkov commented Dec 1, 2018

pavel-agarkov commented Dec 1, 2018

solsson commented Dec 1, 2018

pavel-agarkov commented Dec 1, 2018

solsson commented Dec 1, 2018

solsson commented Dec 1, 2018 •

edited

Loading

solsson commented Dec 1, 2018

solsson commented Dec 1, 2018

Loss of topic data at Minikube restart #89

Loss of topic data at Minikube restart #89

Comments

solsson commented Nov 5, 2017

pavel-agarkov commented Jan 20, 2018 • edited Loading

solsson commented Jan 21, 2018

pavel-agarkov commented Dec 1, 2018 • edited Loading

solsson commented Dec 1, 2018

pavel-agarkov commented Dec 1, 2018

pavel-agarkov commented Dec 1, 2018

solsson commented Dec 1, 2018

pavel-agarkov commented Dec 1, 2018

solsson commented Dec 1, 2018

solsson commented Dec 1, 2018 • edited Loading

solsson commented Dec 1, 2018

solsson commented Dec 1, 2018

pavel-agarkov commented Jan 20, 2018 •

edited

Loading

pavel-agarkov commented Dec 1, 2018 •

edited

Loading

solsson commented Dec 1, 2018 •

edited

Loading