The major reason that I replaced Flannel with Weave Net as the Kubernetes CNI is that Flannel does not support multicast.
I did not realize it until my test Confluence DC cluster became dysfunctional. There are two Confluence nodes in the Confluence DC cluster. Each one is a pod, they are working fine when they are running on the same k8s node. But when they are running on different k8s node, it is like two kings can not wear the same crown, they take turns to be killed by the liveness probe, then start over again and again.
From the log, I can see Hazelcast (which is the underlying cluster management tool) throws error. The node complains that an unknown node is modifying the Database:
Clustered Confluence: Database is being updated by an instance which is not part of the current cluster. You should check network connections between cluster nodes, especially multicast traffic.
2019-11-01 00:00:50,922 ERROR [Caesium-1-4] [impl.schedule.caesium.JobRunnerWrapper] runJob Scheduled job ====DEBUG DUMP END==== 2019-11-01 00:00:50,832 WARN [Caesium-1-4] [analytics.client.listener.ProductEventListener] processEventWithTiming Processing a critical event: com.atlassian.confluence.cluster.safety.ClusterPanicAnalyticsEvent@7f215a2f 2019-11-01 00:00:50,836 ERROR [Caesium-1-4] [confluence.cluster.safety.ClusterPanicListener] onClusterPanicEvent Received a panic event, stopping processing on the node: [Origin node: 538c39e6 listening on /10.244.1.4:5801] Clustered Confluence: Database is being updated by an instance which is not part of the current cluster. You should check network connections between cluster nodes, especially multicast traffic. 2019-11-01 00:00:50,836 WARN [Caesium-1-4] [confluence.cluster.safety.ClusterPanicListener] onClusterPanicEvent com.atlassian.confluence.cluster.hazelcast.HazelcastClusterInformation@7167aab3 2019-11-01 00:00:50,837 WARN [Caesium-1-4] [confluence.cluster.safety.ClusterPanicListener] onClusterPanicEvent Shutting down scheduler 2019-11-01 00:00:50,860 INFO [Caesium-1-4] [confluence.cluster.hazelcast.HazelcastClusterManager] stopCluster Shutting down the cluster 2019-11-01 00:00:50,916 WARN [Caesium-1-4] [confluence.cluster.hazelcast.HazelcastClusterSafetyManager] rateLimitHazelcastLogging Rate limiting logging caused by HazelcastInstanceNotActiveException 2019-11-01 00:00:50,921 WARN [Caesium-1-4] [confluence.cluster.hazelcast.HazelcastClusterSafetyManager] rateLimitHazelcastLogging Rate limiting logging caused by HazelcastInstanceNotActiveException 2019-11-01 00:00:50,922 WARN [Caesium-1-4] [confluence.cluster.hazelcast.HazelcastClusterSafetyManager] rateLimitHazelcastLogging Rate limiting logging caused by HazelcastInstanceNotActiveException 2019-11-01 00:00:50,922 ERROR [Caesium-1-4] [impl.schedule.caesium.JobRunnerWrapper] runJob Scheduled job ClusterSafetyJob#ClusterSafetyJob failed to run com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active! at com.hazelcast.spi.AbstractDistributedObject.throwNotActiveException(AbstractDistributedObject.java:105) at com.hazelcast.spi.AbstractDistributedObject.lifecycleCheck(AbstractDistributedObject.java:100) at com.hazelcast.spi.AbstractDistributedObject.getNodeEngine(AbstractDistributedObject.java:94) at com.hazelcast.map.impl.proxy.MapProxyImpl.unlock(MapProxyImpl.java:332) at com.atlassian.confluence.cluster.hazelcast.HazelcastDualLock.unlock(HazelcastDualLock.java:53) at com.atlassian.confluence.impl.schedule.caesium.JobRunnerWrapper.doRunJob(JobRunnerWrapper.java:150) at com.atlassian.confluence.impl.schedule.caesium.JobRunnerWrapper.lambda$runJob$0(JobRunnerWrapper.java:87) at com.atlassian.confluence.impl.vcache.VCacheRequestContextManager.doInRequestContextInternal(VCacheRequestContextManager.java:84) at com.atlassian.confluence.impl.vcache.VCacheRequestContextManager.doInRequestContext(VCacheRequestContextManager.java:68) at com.atlassian.confluence.impl.schedule.caesium.JobRunnerWrapper.runJob(JobRunnerWrapper.java:87) at com.atlassian.scheduler.core.JobLauncher.runJob(JobLauncher.java:134) at com.atlassian.scheduler.core.JobLauncher.launchAndBuildResponse(JobLauncher.java:106) at com.atlassian.scheduler.core.JobLauncher.launch(JobLauncher.java:90) at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.launchJob(CaesiumSchedulerService.java:435) at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeLocalJob(CaesiumSchedulerService.java:402) at com.atlassian.scheduler.caesium.impl.CaesiumSchedulerService.executeQueuedJob(CaesiumSchedulerService.java:380) at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeJob(SchedulerQueueWorker.java:66) at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.executeNextJob(SchedulerQueueWorker.java:60) at com.atlassian.scheduler.caesium.impl.SchedulerQueueWorker.run(SchedulerQueueWorker.java:35) at java.lang.Thread.run(Unknown Source)
It is Confluence cluster running on Kubernetes, the pod IP is dynamic, and AWS role does not apply here, so multicast in the only option when I first setup the Conflucne Cluster. So the first thing came into my mind is that the multicast IP address does not pass across the k8s nodes. And I confirmed it by finding this open issue in the Flannel Github repository.
The fix is to replace Flannel with other CNI plugin that supports multicast. In my case, I use Weave Net. Here is how I replace it, just be aware of that it requires an outage.
First, install Weave Net CNI plugin
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
Second, uninstall Flannel CNI plugin
# Assume kube-flannel.yml is the config file that you used to install Flannel kubectl delete -f kube-flannel.yml
Third, delete the Interface and config file that were created by Flannel on each k8s node:
ip link delete cni0 ip link delete flannel.1 rm /etc/cni/net.d/10-flannel.conflist
Fourth, restart kubelet service on each k8s node:
$ systemctl restart kubelet
Lastly, you need to kill all Pods as they still use the old CIDR that is from Flannel. In my case, I just scale down the statefulSet down to 0, then scale it up.
kubectl scale --replicas=0 sts/confluence kubectl scale --replicas=3 sts/confluence
Now look the three happy nodes 🙂
$ kubectl get pods -l app=confluence -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES confluence-0 1/1 Running 1 133m 10.32.0.3 node-2 <none> <none> confluence-1 1/1 Running 0 85m 10.40.0.19 node-1 <none> <none> confluence-2 1/1 Running 0 78m 10.32.0.5 node-2 <none> <none>
Some interesting reading: Multi CNI and Containers with Multi Network Interfaces on Kubernetes with CNI-Genie