Conditional Fault Injection for controlled Error Budget consumption (Edit: through metric-based route matching)

We’ve got our Istio service mesh set up in our reasonably-sized retailing platform microservice landscape for a while now, and we’ve been experimenting more and more with fault injection.

Note: This next section borrows heavily from Site Reliability Engineering (SRE) terminology, which we’re trying to roll out to our engineering teams.

Now as a logical next step we want to enable it on our production environment as a measure against over performing on our SLOs. However, if an incident does consume a significant part of our error budget, we would like to turn off our fault injection automatically.

Now we can of course build something that simply changes the VirtualService configuration based on perhaps an alert or metric change, but to me it seems like an obvious use case if you’re doing SRE & Istio so I was thinking it could make sense to enable within Istio itself.

What I’m thinking is if you can perhaps do something along the lines of:

kind: VirtualService
name: service
spec:
  hosts:
    - service.namespace.svc.cluster.local
  http:
  - route:
    - destination:
        host: service.namespace.svc.cluster.local
    fault:
      abort:
        httpStatus: 500
        percentage:
          value: 0.2
      ## This is the newly suggested bit
      when:
        metric: (sum(rate(istio_requests_total{reporter="source", destination_service=~"service.namespace.svc.cluster.local", response_code!~"5.*"}[7d])) / sum(rate(istio_requests_total{reporter="source", destination_service=~"service.namespace.svc.cluster.local"}[7d]))) > 0.995

Now I’ve structured the condition using PromQL here, which I guess doesn’t really make sense, but I hope it illustrates what I’m trying to achieve.

So basically the scenario is that there’s a 99% availability SLO on the service, and in order to prevent it from over performing, you inject a 0.2 percent error rate (Which should consume 20% of the error budget) as long as you have more than 50% error budget remaining.

Would it make sense to create a feature request for this? I did see some mention in other related issues that the fault injection might have to be refactored to go server-side, but I don’t think that would change the need for this feature (But it would probably affect implementation)

I’m not sure if replying to my own topic is the right way to update this, let me know if I should edit my original post instead.

I was thinking about it this morning, and having a separate condition property on the HTTPFaultInjection object is probably not the right way to go about this. The fault injection is of course already conditional, only those conditions are managed by the Route match clause. So putting a metric-type of selector in the match clause makes a lot more sense I think.

As a matter of fact this enables something else (which is surprisingly unrelated) I’ve been thinking about for a while now, which is metric-aware load balancing and metric-aware load mirroring (A combination of which would allow one to directly engineer e.g. resources vs. tail latency).

So that would turn the proposed solution into something like:

kind: VirtualService
name: service
spec:
  hosts:
    - service.namespace.svc.cluster.local
  http:
  - route:
    - destination:
        host: service.namespace.svc.cluster.local
    match:
      - metric: (sum(rate(istio_requests_total{reporter="source", destination_service=~"service.namespace.svc.cluster.local", response_code!~"5.*"}[7d])) / sum(rate(istio_requests_total{reporter="source", destination_service=~"service.namespace.svc.cluster.local"}[7d]))) > 0.995
    fault:
      abort:
        httpStatus: 500
        percentage:
          value: 0.2

The fact that this feature enables both these entirely separate but relevant use cases makes me quite enthusiastic about getting it done.

The biggest question marks to me are about how the Istio metrics work internally (are they even available at the point the route determination is done, is consulting them feasible from a resources point of view) and what kind of selectors for that would make sense.

Have a nice day!