How to configure a multi-cluster replicated control planes to failover?

Using the multi-cluster replicated control planes model, if I have the same exact service in cluster A and cluster B, is it possible to configure it to always go to the local instance (httpbin.foo.svc.cluster.local running in cluster A) if possible and if that fails or is not healthy then to use to the version that in second cluster (httpbin.foo.global running in cluster B) using outlier detection.

I have read about and tried to configure the locality load balancing, but I have yet to be successful in setting this up, and I have not found any documentation on how this can be achieved in a mesh federation configuration.

If someone can provide any pointers, that would be amazing!

Hello. What version of Istio are you using? I know there were some issues opened with the LLB not working in some instances. I believe 1.3.x has issues. I also had some issues with 1.2.6 trying to replicate some things I had working in earlier 1.2.x versions. It may also be that design changes require something that I’m not doing (and may not be adequately documented).

I may build and test with the master branch to see if things are fixed and may need to be cherry-picked back.

I tried with 1.2.6 and 1.3.1. Can you expand on what you had working on earlier 1.2.x versions?

If I create a destination rule with outlier detection and a service entry, such as the ones below, is there additional configuration needed to tell it to go to the .global service that is running on the other cluster when the service is unavailable locally? I have modified service entries and destination rules in many ways in other attempts, but I am not still not too sure the configuration that is needed to make it go the external cluster only when it is unavailable in the local cluster. I am trying to access them by running curl commands in a sleep pod in the following manor. Accessing them individually and explicitly works perfectly.

curl -I httbin.bar:8000/status/200

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: bar
  namespace: bar
spec:
  host: "httbin.bar.svc.cluster.local"
  trafficPolicy:
    outlierDetection:
      consecutiveErrors: 1
      baseEjectionTime": 2m
      interval: 1m


apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: httpbin-bar
spec:
  addresses:
  - 240.0.0.2
  endpoints:
  - address: 135.X.X.X
    ports:
      http1: 32618
  hosts:
  - httpbin.bar.global
  location: MESH_INTERNAL
  ports:
  - name: http1
    number: 8000
    protocol: http
  resolution: DNS

For 1.2, I had explicitly enabled locality load balancing (–set global.localityLbSetting.enabled=“true” added to example text). (In 1.3 it is enabled by default since it a lack of outlier detection effectively disables it).

I’ll try again on the new 1.2.7 to verify.

So here’s what I did to show this works on 1.2.7. Apologies if I have some typos. I have a todo to write this up as a blog entry. I am using Istio 1.2.7 and I have two IKS clusters each in a different datacenter. If you do a k get nodes -o yaml | grep failure in each cluster you should see that region or zone is different, else this won’t work:

> k get nodes  --context=istio-2 -o yaml | grep failure | grep zone                                                                                               
      failure-domain.beta.kubernetes.io/zone: wdc06
      failure-domain.beta.kubernetes.io/zone: wdc06
      failure-domain.beta.kubernetes.io/zone: wdc06
 > k get nodes  --context=istio-1 -o yaml | grep failure | grep zone                                                                                                                            
      failure-domain.beta.kubernetes.io/zone: wdc04
      failure-domain.beta.kubernetes.io/zone: wdc04
      failure-domain.beta.kubernetes.io/zone: wdc04

Now I ran through the shared control plane example at https://istio.io/docs/setup/install/multicluster/shared-gateways/ with one BIG exception. Within the two blocks to create the istio-*auth.yaml files you need to add a line to enable LLB (as noted in earlier remark). An example for the first one (second to the last line):

helm template --name=istio --namespace=istio-system \
  --set global.mtls.enabled=true \
  --set security.selfSigned=false \
  --set global.controlPlaneSecurityEnabled=true \
  --set global.proxy.accessLogFile="/dev/stdout" \
  --set global.meshExpansion.enabled=true \
  --set 'global.meshNetworks.network1.endpoints[0].fromRegistry'=Kubernetes \
  --set 'global.meshNetworks.network1.gateways[0].address'=0.0.0.0 \
  --set 'global.meshNetworks.network1.gateways[0].port'=443 \
  --set gateways.istio-ingressgateway.env.ISTIO_META_NETWORK="network1" \
  --set global.network="network1" \
  --set 'global.meshNetworks.network2.endpoints[0].fromRegistry'=n2-k8s-config \
  --set 'global.meshNetworks.network2.gateways[0].address'=0.0.0.0 \
  --set 'global.meshNetworks.network2.gateways[0].port'=443 \
  --set global.localityLbSetting.enabled="true" \
  install/kubernetes/helm/istio > istio-auth.yaml

Now run through the example and if you hit sleep in cluster1, you should see if go to helloworld in cluster1 and cluster2, just as the example is written.

Now for the LLB part. To make traffic stay in cluster, you need an outlier so I used this:

cat helloworld-destination-rule-outlier.yaml                                                                                                                                                 

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: helloworld-outlier-detection
spec:
  host: helloworld.sample.svc.cluster.local
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
    outlierDetection:
      consecutiveErrors: 7
      interval: 30s
      baseEjectionTime: 30s

Apply this in cluster 1: kubectl apply -f helloworld-destination-rule-outlier.yaml -n sample --context=$CTX_CLUSTER1

Now, do the same execs against sleep in cluster1 and you should see that only helloworld in cluster1 is called. Traffic will stay in the region. If you scale down the helloworld in cluster 1 to zero, traffic will failover to cluster2. When cluster1’s helloworld is scaled back up, traffic will revert to cluster 1.

Hopefully that will get you on your way.

Thank you for sharing this. I was really hoping to see if it is possible using the replicated control plane model https://istio.io/docs/setup/install/multicluster/gateways/ rather than the shared control plane model you shared https://istio.io/docs/setup/install/multicluster/shared-gateways/. With 1-2 clusters I believe it’s okay to use a shared control plane, but isn’t there concern when you start connecting many cluster having the single point of failure in the cluster with the control plane? I’m really curious how other are deciding to configure and architect multi-cluster service meshes with istio.

I can work on documenting an example using the gateway example as well (but it will be a little while). I think the main issue I ran into when experimenting earlier was using the global vs local service names. Maybe the example with moving reviews-v3 to cluster #2 could be adopted.

That would be amazing and I, as well as others, would really appreciate if you do! I have yet to see a successful example due to the global vs local service names, and it seems like it would be a fairly common use case