We have a multi-primary Istio running on GCP. Hardware outages from GCP are beyond our control. We wish to have a way to exclude workloads from being discovered from the degraded cluster during those outages.
Istio circuit breaker won’t work since outages like this may have many different symptoms , it is hard to come up with a unified circuit breaking rule to cover all the scenarios.
Does anyone have any recommended approach for this?
Options we have found
Use Istio-discovery selector to mark the namespaces in the degraded cluster.
Drawback is that it is an operation in the degraded cluster, during network partitions we are not sure if we can access its api server
Create a destination rule subset with all regions, then exclude the troubled cluster from this subset during an outage.
DR does not support exclusion. We have to do inclusion.
DR’s label selector supports AND only. We can’t create a subset with all regions like “us-east1 || us-east2 || us-central1” then erase a degraded region during outages.
100% Fault Injection + Circuit Breaker
We don’t have Fault Injection for TCP connections. Also it is hard to do fault injection for workloads in one specific cluster.
At this time the only thing I can think of is to create 5 subsets if we have 5 regions then implement ROUND ROBIN ourselves in virtual services.
During an outage we have to erase the troubled subset from the virtual service and reassign the routing percentage to the rest of the other subsets.
Does anyone have any better ideas?