How to configure a multi-cluster replicated control planes to failover?

tapaskapadia · October 7, 2019, 9:09pm

Using the multi-cluster replicated control planes model, if I have the same exact service in cluster A and cluster B, is it possible to configure it to always go to the local instance (httpbin.foo.svc.cluster.local running in cluster A) if possible and if that fails or is not healthy then to use to the version that in second cluster (httpbin.foo.global running in cluster B) using outlier detection.

I have read about and tried to configure the locality load balancing, but I have yet to be successful in setting this up, and I have not found any documentation on how this can be achieved in a mesh federation configuration.

If someone can provide any pointers, that would be amazing!

ericvn · October 8, 2019, 1:25pm

Hello. What version of Istio are you using? I know there were some issues opened with the LLB not working in some instances. I believe 1.3.x has issues. I also had some issues with 1.2.6 trying to replicate some things I had working in earlier 1.2.x versions. It may also be that design changes require something that I’m not doing (and may not be adequately documented).

I may build and test with the master branch to see if things are fixed and may need to be cherry-picked back.

tapaskapadia · October 8, 2019, 2:42pm

I tried with 1.2.6 and 1.3.1. Can you expand on what you had working on earlier 1.2.x versions?

If I create a destination rule with outlier detection and a service entry, such as the ones below, is there additional configuration needed to tell it to go to the .global service that is running on the other cluster when the service is unavailable locally? I have modified service entries and destination rules in many ways in other attempts, but I am not still not too sure the configuration that is needed to make it go the external cluster only when it is unavailable in the local cluster. I am trying to access them by running curl commands in a sleep pod in the following manor. Accessing them individually and explicitly works perfectly.

curl -I httbin.bar:8000/status/200

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: bar
  namespace: bar
spec:
  host: "httbin.bar.svc.cluster.local"
  trafficPolicy:
    outlierDetection:
      consecutiveErrors: 1
      baseEjectionTime": 2m
      interval: 1m


apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: httpbin-bar
spec:
  addresses:
  - 240.0.0.2
  endpoints:
  - address: 135.X.X.X
    ports:
      http1: 32618
  hosts:
  - httpbin.bar.global
  location: MESH_INTERNAL
  ports:
  - name: http1
    number: 8000
    protocol: http
  resolution: DNS

ericvn · October 8, 2019, 4:19pm

For 1.2, I had explicitly enabled locality load balancing (–set global.localityLbSetting.enabled=“true” added to example text). (In 1.3 it is enabled by default since it a lack of outlier detection effectively disables it).

I’ll try again on the new 1.2.7 to verify.

ericvn · October 8, 2019, 7:47pm

So here’s what I did to show this works on 1.2.7. Apologies if I have some typos. I have a todo to write this up as a blog entry. I am using Istio 1.2.7 and I have two IKS clusters each in a different datacenter. If you do a k get nodes -o yaml | grep failure in each cluster you should see that region or zone is different, else this won’t work:

> k get nodes  --context=istio-2 -o yaml | grep failure | grep zone                                                                                               
      failure-domain.beta.kubernetes.io/zone: wdc06
      failure-domain.beta.kubernetes.io/zone: wdc06
      failure-domain.beta.kubernetes.io/zone: wdc06
 > k get nodes  --context=istio-1 -o yaml | grep failure | grep zone                                                                                                                            
      failure-domain.beta.kubernetes.io/zone: wdc04
      failure-domain.beta.kubernetes.io/zone: wdc04
      failure-domain.beta.kubernetes.io/zone: wdc04

Now I ran through the shared control plane example at https://istio.io/docs/setup/install/multicluster/shared-gateways/ with one BIG exception. Within the two blocks to create the istio-*auth.yaml files you need to add a line to enable LLB (as noted in earlier remark). An example for the first one (second to the last line):

helm template --name=istio --namespace=istio-system \
  --set global.mtls.enabled=true \
  --set security.selfSigned=false \
  --set global.controlPlaneSecurityEnabled=true \
  --set global.proxy.accessLogFile="/dev/stdout" \
  --set global.meshExpansion.enabled=true \
  --set 'global.meshNetworks.network1.endpoints[0].fromRegistry'=Kubernetes \
  --set 'global.meshNetworks.network1.gateways[0].address'=0.0.0.0 \
  --set 'global.meshNetworks.network1.gateways[0].port'=443 \
  --set gateways.istio-ingressgateway.env.ISTIO_META_NETWORK="network1" \
  --set global.network="network1" \
  --set 'global.meshNetworks.network2.endpoints[0].fromRegistry'=n2-k8s-config \
  --set 'global.meshNetworks.network2.gateways[0].address'=0.0.0.0 \
  --set 'global.meshNetworks.network2.gateways[0].port'=443 \
  --set global.localityLbSetting.enabled="true" \
  install/kubernetes/helm/istio > istio-auth.yaml

Now run through the example and if you hit sleep in cluster1, you should see if go to helloworld in cluster1 and cluster2, just as the example is written.

Now for the LLB part. To make traffic stay in cluster, you need an outlier so I used this:

cat helloworld-destination-rule-outlier.yaml                                                                                                                                                 

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: helloworld-outlier-detection
spec:
  host: helloworld.sample.svc.cluster.local
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
    outlierDetection:
      consecutiveErrors: 7
      interval: 30s
      baseEjectionTime: 30s

Apply this in cluster 1: kubectl apply -f helloworld-destination-rule-outlier.yaml -n sample --context=$CTX_CLUSTER1

Now, do the same execs against sleep in cluster1 and you should see that only helloworld in cluster1 is called. Traffic will stay in the region. If you scale down the helloworld in cluster 1 to zero, traffic will failover to cluster2. When cluster1’s helloworld is scaled back up, traffic will revert to cluster 1.

Hopefully that will get you on your way.

tapaskapadia · October 9, 2019, 3:40pm

Thank you for sharing this. I was really hoping to see if it is possible using the replicated control plane model https://istio.io/docs/setup/install/multicluster/gateways/ rather than the shared control plane model you shared https://istio.io/docs/setup/install/multicluster/shared-gateways/. With 1-2 clusters I believe it’s okay to use a shared control plane, but isn’t there concern when you start connecting many cluster having the single point of failure in the cluster with the control plane? I’m really curious how other are deciding to configure and architect multi-cluster service meshes with istio.

ericvn · October 10, 2019, 2:25pm

I can work on documenting an example using the gateway example as well (but it will be a little while). I think the main issue I ran into when experimenting earlier was using the global vs local service names. Maybe the example with moving reviews-v3 to cluster #2 could be adopted.

tapaskapadia · October 10, 2019, 3:35pm

That would be amazing and I, as well as others, would really appreciate if you do! I have yet to see a successful example due to the global vs local service names, and it seems like it would be a fairly common use case

Camille_Rodriguez · November 1, 2019, 3:24pm

I am trying to do the same thing! Since your last comment, did you successfully set this up @tapaskapadia?

tapaskapadia · November 1, 2019, 7:59pm

I was not able to successfully set this up. I think one of the issues I had in my setup was I could not reasonably create a “single mesh”. The docs state:

I could not reasonably replicate all services / namespaces everywhere; however, I may be understanding that incorrectly.

Also, as a heads up, I ran into this related issue as well.:

github.com/istio/istio

Deploy the same version service in multiple clusters

opened 10:03AM - 15 Apr 19 UTC

closed 05:49AM - 21 Jul 20 UTC

yuchen9459

area/networking lifecycle/stale lifecycle/automatically-closed

**Describe the feature request** Deploy the same version service in multiple cl…usters using gateway connected feature and make sure they can be accessed **Describe alternatives you've considered** In the [Version Routing in a Multicluster Service Mesh tutorial ](https://istio.io/blog/2019/multicluster-version-routing/), it deploys two version of review services in the second cluster. I created three Kubernetes clusters and deploy review-v1 in cluster1, review-v2 and review-v3 in cluster2 and cluster 3. ![autodraw 4_15_2019](https://user-images.githubusercontent.com/8820341/56124135-4468a880-5f76-11e9-89bd-04f9ac2b6c4b.png) I wonder how can I deploy the same version service like review-v2 in two clusters and make sure both of them can be accessed. Currently, I have tried to add some lines to the deployment files: ``` apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: reviews-default spec: hosts: - reviews.default.global location: MESH_INTERNAL ports: - name: http1 number: 9080 protocol: http resolution: DNS addresses: - 127.255.0.3 endpoints: - address: 172.17.0.3 labels: cluster: cluster2 ports: http1: 30417 # Do not change this port value - address: 172.17.0.4 labels: cluster: cluster3 ports: http1: 31940 --- apiVersion: networking.istio.io/v1alpha3 kind: DestinationRule metadata: name: reviews-global spec: host: reviews.default.global trafficPolicy: tls: mode: ISTIO_MUTUAL subsets: - name: v2 labels: cluster: cluster3 cluster: cluster2 - name: v3 labels: cluster: cluster2 cluster: cluster3 ``` As you can see from the yaml file, I added two endpoints for the ServiceEntry review-default. Since one subsets cannot have two labels, the second label in the configuration file will overwrite the first one. After deployment, v2 subset's corresponding label is cluster2, and v3 subset's corresponding label is cluster3. Therefore, after the deployment, all requests for a certain service will be directed to a cluster instead of two-clusters in a certain order (such as round-robin).

If you end up figure out more, I’m still interested, but right now I am not actively trying to make Istio work.

Camille_Rodriguez · November 4, 2019, 3:58pm

Ok thank you for the heads up! I am currently trying to understand how to configure the service entry for it to include both hosts (.local and .global instances). I really cannot find an example anywhere. I will let you know if I make any progress.

anilcs0405 · November 4, 2019, 4:25pm

@Camille_Rodriguez
Did you try this link? https://preliminary.istio.io/blog/2019/multicluster-version-routing/

tapaskapadia · November 4, 2019, 6:01pm

@anilcs0405

That was a really helpful wiki I looked at to load balance between the different versions between local and global service with the destination rules, but that opened the door to the issue I linked above which I could not figure out. The fail-over as well as being able to access multiple of the same version in different cluster tripped me up.

pramodrj07 · November 7, 2019, 2:56am

Were you successful tapas! I am exactly in the same situation. Need to load-balance and evict unhealthy hosts from the pool using replicated control planes.

rshriram · November 7, 2019, 3:19am

How about this:

serviceEntry:
 host: foo.global
 resolution: dns
 endpoints:
 - foo.cluster1.gateway.host
   locality: cluster1
 - foo.cluster2.gateway.host
   locality: cluster2

You then need to setup the global failover config in mesh config https://github.com/istio/istio/blob/90315dc0bb4b695df05e0a9c67c425fffae7f575/install/kubernetes/helm/istio/values.yaml#L576 stating failover from cluster1 to cluster2 [for the control plane on cluster1 and vice versa on cluster2].

when you deploy this in cluster1, [ assuming all the pods in cluster1 also have the k8s locality labels [failuredomain.k8s.io:…] set to cluster1, traffic from pods in cluster1 will go to foo.cluster1.gateway.host - which goes to the ingress and then comes back inside. If this is unavailable, then it goes to foo.cluster2.gateway.host.

We understand that entering the local cluster via ingress gateway is sub optimal. We are trying to optimize this and enable a new form of service entry that might let you cleanly achieve this goal.

hzxuzhonghu · November 7, 2019, 7:46am

The above configuration can be a workaround for accessing same service across clusters. But it is tricky for traffic flowing through ingress gateway when access local cluster service instance.

It is better to support expand SE endpoints, which means if the endpoint address is domain, and mesh internal service, we need to populate all its endpoints. But currently, i think @rshriram provided a good workaround.

nrjpoddar · November 7, 2019, 3:32pm

I think as it currently stands, replicated control planes don’t really support failover. @rshriram suggestion is interesting, with caveats of traffic redirected to ingress gateway in local cluster and it will require clients to use the same host name (.global instead of .<local namespace>.svc.cluster.local) to reach both local and remote cluster which is less than ideal. At that point you’re manually creating a shared control plane with consistent naming IMO.

johscheuer · November 28, 2019, 6:18am

Depending on how your clusters are setup (or actually how your nodes are labeled) I got this working with this setup: https://github.com/istio/istio/issues/19257 actually there is never a need to use *.global and provide an extra IP address.

If you want I can write some documentation how to setup transparent failover (and it’s limitation -> it will only work for regional different clusters not cluster failover in the same region (or you misuse the region label).

anuraagrijal · December 20, 2019, 7:16pm

@johscheuer Could you please explain what changes are required to set up replicated cluster to use .svc.cluster.local instead of .global. I tried changing destination rule and service entry for my foo.svc.cluster.local(note foo service is present in both clusters and want to achieve exactly what you wanted in https://github.com/istio/istio/issues/19257). Unlike your problem, “istioctl pc ep podname” only shows one endpoint, it doesnot show serviceentry ep “:15443”. Or you could update the documentation. Thank you!

johscheuer · January 24, 2020, 6:25am

Hi,

currently I’m writing a small blog post about this (which I hopefully can also share over the Istio blog) which explains the differences and how to setup. I hope the post will be available in the next few weeks. Here is already a working code example: https://github.com/johscheuer/istio-playground/blob/master/Transparent_Multicluster.md#adjustments-for-transparent-failover

Topic		Replies	Views
Locality LoadBalacing not working on Istio	3	779	March 30, 2020
Can I use mesh-federation with single control plane, instead of replicated control planes across the two Istio meshes? I am not talking about multi-cluster deployments, I am asking about mesh-federation	6	389	March 2, 2021
Multiple region, multi-cluster Istio on GKE/GCP	2	2212	July 14, 2019
Configure Mesh-Wide Outlier Detection Networking	0	428	April 12, 2021
Multi-cluster with replicated control plane Config	1	503	November 22, 2019

How to configure a multi-cluster replicated control planes to failover?

Related topics