We’ve got our Istio service mesh set up in our reasonably-sized retailing platform microservice landscape for a while now, and we’ve been experimenting more and more with fault injection.
Note: This next section borrows heavily from Site Reliability Engineering (SRE) terminology, which we’re trying to roll out to our engineering teams.
Now as a logical next step we want to enable it on our production environment as a measure against over performing on our SLOs. However, if an incident does consume a significant part of our error budget, we would like to turn off our fault injection automatically.
Now we can of course build something that simply changes the VirtualService
configuration based on perhaps an alert or metric change, but to me it seems like an obvious use case if you’re doing SRE & Istio so I was thinking it could make sense to enable within Istio itself.
What I’m thinking is if you can perhaps do something along the lines of:
kind: VirtualService
name: service
spec:
hosts:
- service.namespace.svc.cluster.local
http:
- route:
- destination:
host: service.namespace.svc.cluster.local
fault:
abort:
httpStatus: 500
percentage:
value: 0.2
## This is the newly suggested bit
when:
metric: (sum(rate(istio_requests_total{reporter="source", destination_service=~"service.namespace.svc.cluster.local", response_code!~"5.*"}[7d])) / sum(rate(istio_requests_total{reporter="source", destination_service=~"service.namespace.svc.cluster.local"}[7d]))) > 0.995
Now I’ve structured the condition using PromQL here, which I guess doesn’t really make sense, but I hope it illustrates what I’m trying to achieve.
So basically the scenario is that there’s a 99% availability SLO on the service, and in order to prevent it from over performing, you inject a 0.2 percent error rate (Which should consume 20% of the error budget) as long as you have more than 50% error budget remaining.
Would it make sense to create a feature request for this? I did see some mention in other related issues that the fault injection might have to be refactored to go server-side, but I don’t think that would change the need for this feature (But it would probably affect implementation)