How to configure a multi-cluster replicated control planes to failover?

Using the multi-cluster replicated control planes model, if I have the same exact service in cluster A and cluster B, is it possible to configure it to always go to the local instance ( running in cluster A) if possible and if that fails or is not healthy then to use to the version that in second cluster ( running in cluster B) using outlier detection.

I have read about and tried to configure the locality load balancing, but I have yet to be successful in setting this up, and I have not found any documentation on how this can be achieved in a mesh federation configuration.

If someone can provide any pointers, that would be amazing!

Hello. What version of Istio are you using? I know there were some issues opened with the LLB not working in some instances. I believe 1.3.x has issues. I also had some issues with 1.2.6 trying to replicate some things I had working in earlier 1.2.x versions. It may also be that design changes require something that I’m not doing (and may not be adequately documented).

I may build and test with the master branch to see if things are fixed and may need to be cherry-picked back.

I tried with 1.2.6 and 1.3.1. Can you expand on what you had working on earlier 1.2.x versions?

If I create a destination rule with outlier detection and a service entry, such as the ones below, is there additional configuration needed to tell it to go to the .global service that is running on the other cluster when the service is unavailable locally? I have modified service entries and destination rules in many ways in other attempts, but I am not still not too sure the configuration that is needed to make it go the external cluster only when it is unavailable in the local cluster. I am trying to access them by running curl commands in a sleep pod in the following manor. Accessing them individually and explicitly works perfectly.

curl -I

kind: DestinationRule
  name: bar
  namespace: bar
  host: ""
      consecutiveErrors: 1
      baseEjectionTime": 2m
      interval: 1m

kind: ServiceEntry
  name: httpbin-bar
  - address: 135.X.X.X
      http1: 32618
  location: MESH_INTERNAL
  - name: http1
    number: 8000
    protocol: http
  resolution: DNS

For 1.2, I had explicitly enabled locality load balancing (–set global.localityLbSetting.enabled=“true” added to example text). (In 1.3 it is enabled by default since it a lack of outlier detection effectively disables it).

I’ll try again on the new 1.2.7 to verify.

So here’s what I did to show this works on 1.2.7. Apologies if I have some typos. I have a todo to write this up as a blog entry. I am using Istio 1.2.7 and I have two IKS clusters each in a different datacenter. If you do a k get nodes -o yaml | grep failure in each cluster you should see that region or zone is different, else this won’t work:

> k get nodes  --context=istio-2 -o yaml | grep failure | grep zone                                                                                             wdc06 wdc06 wdc06
 > k get nodes  --context=istio-1 -o yaml | grep failure | grep zone                                                                                                                          wdc04 wdc04 wdc04

Now I ran through the shared control plane example at with one BIG exception. Within the two blocks to create the istio-*auth.yaml files you need to add a line to enable LLB (as noted in earlier remark). An example for the first one (second to the last line):

helm template --name=istio --namespace=istio-system \
  --set global.mtls.enabled=true \
  --set security.selfSigned=false \
  --set global.controlPlaneSecurityEnabled=true \
  --set global.proxy.accessLogFile="/dev/stdout" \
  --set global.meshExpansion.enabled=true \
  --set 'global.meshNetworks.network1.endpoints[0].fromRegistry'=Kubernetes \
  --set 'global.meshNetworks.network1.gateways[0].address'= \
  --set 'global.meshNetworks.network1.gateways[0].port'=443 \
  --set gateways.istio-ingressgateway.env.ISTIO_META_NETWORK="network1" \
  --set"network1" \
  --set 'global.meshNetworks.network2.endpoints[0].fromRegistry'=n2-k8s-config \
  --set 'global.meshNetworks.network2.gateways[0].address'= \
  --set 'global.meshNetworks.network2.gateways[0].port'=443 \
  --set global.localityLbSetting.enabled="true" \
  install/kubernetes/helm/istio > istio-auth.yaml

Now run through the example and if you hit sleep in cluster1, you should see if go to helloworld in cluster1 and cluster2, just as the example is written.

Now for the LLB part. To make traffic stay in cluster, you need an outlier so I used this:

cat helloworld-destination-rule-outlier.yaml                                                                                                                                                 

kind: DestinationRule
  name: helloworld-outlier-detection
  host: helloworld.sample.svc.cluster.local
      mode: ISTIO_MUTUAL
      consecutiveErrors: 7
      interval: 30s
      baseEjectionTime: 30s

Apply this in cluster 1: kubectl apply -f helloworld-destination-rule-outlier.yaml -n sample --context=$CTX_CLUSTER1

Now, do the same execs against sleep in cluster1 and you should see that only helloworld in cluster1 is called. Traffic will stay in the region. If you scale down the helloworld in cluster 1 to zero, traffic will failover to cluster2. When cluster1’s helloworld is scaled back up, traffic will revert to cluster 1.

Hopefully that will get you on your way.

Thank you for sharing this. I was really hoping to see if it is possible using the replicated control plane model rather than the shared control plane model you shared With 1-2 clusters I believe it’s okay to use a shared control plane, but isn’t there concern when you start connecting many cluster having the single point of failure in the cluster with the control plane? I’m really curious how other are deciding to configure and architect multi-cluster service meshes with istio.

I can work on documenting an example using the gateway example as well (but it will be a little while). I think the main issue I ran into when experimenting earlier was using the global vs local service names. Maybe the example with moving reviews-v3 to cluster #2 could be adopted.

That would be amazing and I, as well as others, would really appreciate if you do! I have yet to see a successful example due to the global vs local service names, and it seems like it would be a fairly common use case