Help with circuit breaker weirdly bringing site down

We’re evaluating istio at work and we ran into a weird situation. The gist of it, as I understand it so far, is:

  • Adding a new empty node to our cluster under load somehow triggers a circuit breaker and kills 68% of our traffic
  • Circuit breaker remains active indefinitely
  • Scheduling an extra pod to the application resets the circuit breaker and everything goes back to normal

It sounds very weird, and we managed to reproduce it twice on the same day, then never again. I’ll provide as much detail as I can on our setup and what we observed when the issue happened, and I’d love some pointers on what could help us make sense of this situation.

Topology

image

We have two node groups, one for cluster-critical stuff called system-workers, with 4 nodes, and one for applications to be scheduled in, called user-workers.

The scenario

We were running a load test of our application, with:

  • 180k requests per minute
  • 31 nodes
  • 334 pods
  • CA and HPA disabled
  • istio 1.1.7

The failure

We had no pending pods, and manually added a new user-workers node to the user-workers group, which triggered this failure:

image

  • Requests being processed went down from 187k rpm to 49k rpm, stabilizing at around 58k rpm.
  • Istio started reporting UO errors (delayed by 1 sample tho, 15s, not sure if precise or relevant)
  • Response time took a hit, but not a major one

Some extra details:

  • This node had only daemonsets not tied to istio, and no regular, deployment-scheduled pods
  • The node was getting traffic through the NodePort we use for istio-ingressgateway
  • The node’s istio health-check was healthy right away (from ALB’s perspective), which alone doesn’t exclude the possibility of this NodePort being broken, but the severe traffic dip should, as this node is 1/32th of the fleet, but we saw 2/3rd dip in traffic

The situation continued like this for 30min, until we scheduled a new app_pod, and then it all came back to normal.

image

We were puzzled as to why adding a new node from a node group which runs nothing istio-related, and looking at our logs we observed adding a new node triggers lots of pilot messages:

All of these push messages happen when we add nodes. We’re not sure if this is related, or relevant, but it’s the only way we noticed adding a node affects istio.

The fact that scheduling a pod changed things excludes for us the possibility of this being caused by anything external to the cluster (say, AWS). It seemed then that ticking the clock on pilot is what brought things back to a healthy state.

Loading the website during the “outage” would give us this message, more frequently than not: upstream connect error or disconnect/reset before headers. reset reason: overflow

Someone on Github said:

Overflow means The stream was reset because of a resource overflow essentially I believe you tripped the envoy circuit breakers on the upstream cluster.

These circuit breakers are built in envoy to help prevent cascading failures - the default value for the limit that you hit maximum pending requests, is 1024. and when that limit hits, its starts to shed load (i.e. returning reset reason overflow). see more info here

I don’t have proxyv2 or application logs from when the failure happened to help inspect this further.

This shifts the understanding to something like: adding a pod or a node ticks pilot which then ticks the circuit breaker in some way.

I don’t understand istio metrics related to pending requests. Nothing indicates a circuit breaker should be breaking:

We’re using the default 1024 maximum pending requests. Some metrics show absolutely no pending, one metric shows a constant 334~335 active requests (1 per pod? I guess it makes sense since we’re using http2?).

Help?

I’m puzzled as to why adding a node triggered this, and why adding a pod fixed it.

I reproduced this twice in succession. Removing the 32nd node, adding it again, seeing the world burn, adding a pod, seeing it go back to normal.

I tried reproducing it again at a different time under similar conditions but could not get it to fail like this.

Some questions for folks kind enough to help:

  • Is there some metric you would like to see if you were to try to diagnose this failure?
    • I’d love to see the % of pending requests relative to the maximum, but I haven’t found how to get that reliably.
  • Is there any other variable you’d consider trying to reproduce this?
    • I have tested bigger clusters
    • I have tested more pods
    • I have tested fewer pods and more traffic (to see if underprovisioning plays a role)
  • Does this behavior make sense to you?
    • I don’t get why, if this is a circuit breaker kicking in, it would remain active indefinitely and reset by adding a new pod.
    • I also don’t get how adding a node without changing the fire rate of our load tester could affect the pending requests and trigger this circuit breaker.

But I’d super appreciate any pointers from more experienced folks on how I could make sense of this, repro, and hopefully prevent it from happening when we go live with istio.