Pilot spikes debugging

Hello,

I noticed that Pilot had CPU spikes from time to time. These spikes happen when it had lots of push errors (timeouts) for 2-3 mins. Meanwhile, eds pushes went from 2 ops to 60+ and back to 2; same for cds, lds, rds, but with lower values.

In Pilot pods I found this type of logs:

...
2019-09-02T12:04:37.188365Z    warn    ads    Failed to push, client busy sidecar~10.200.144.16~<POD1>-25524
2019-09-02T12:04:37.079611Z    warn    ads    Failed to push, client busy sidecar~10.200.134.218~<POD2>-25596
...

In order to investigate, I ran istioctl proxy-status I get the following output

NAME                                                                                      CDS        LDS        EDS               RDS        PILOT                            VERSION
<POD1_SVC_A>                        SYNCED     SYNCED     SYNCED (100%)     SYNCED     istio-pilot-864c44c7f8-rn7v7     1.0-dev
<POD1_SVC_B>                        SYNCED     SYNCED     SYNCED (100%)     SYNCED     istio-pilot-864c44c7f8-bww7j     1.2.4
<POD2_SVC_B>                        SYNCED     SYNCED     SYNCED (100%)     SYNCED     istio-pilot-864c44c7f8-9bspg     1.0-dev
...
// many more pods from different services with same values as above

Everything looked ok, except the VERSION column. Why are there multiple versions? Both Pilot and istio-proxy had running a istio:1.2.4 official image.

Moreover, what should I look for when Pilot has this kind of random behaviour?

I am running istio 1.2.4, installed via Helm, on-premise cluster.

Thank you!

The version issue is unrelated, something we are working on fixing.

Pilot is inherently spiky. In a stable cluster, there is nothing for pilot to do, but when you change a config, spin up a new service, etc, pilot must push all of this config to all of the proxies.

Thank you for clearing that out. Indeed, the version issue was related to some different istio images running with the same tag.

Following up on the spiky behaviour: I notice that cpu spikes are almost always happening when istio-pilot reports push errors or timeouts.
Moreover, I belive I am running a lot of pilots for my cluster size and its rate of change. The mesh has ~500 services and 680 pods and I am running 5 pilots with 3 CPU limit and 4 GB ram. This makes ~120 connected endpoints per istio-pilot instance.

I am not sure what Failed to push, client busy sidecar really means…

One interesting fact that I noticed while fixing the issue about the version mentioned above, if I delete/create pods at a rate of 1-2/s, running the setup of pilots as previously mentioned, they always go full on CPU and they throw timeouts and errors at a constant rate of 15+.

This means that it took more than the timeout (I think in 1.2 it was 5s) to send config to a sidecar.

This is probably because Pilot spikes too hard and uses too much CPU and gets throttled, so it slows down enough to get timeouts.

680 pods and 15 CPUs of pilot is pretty high. You can try to limit the number of pushes pilot - see serviceB gateway deployment has high performance impact on serviceA traffic over ingressgateway · Issue #14283 · istio/istio · GitHub for instructions. Also, if you are not already doing so, limit what config pilot pushes by setting up Istio / Sidecar resources (egress section).

Also 1.3 has major performance improvements to Pilot - sometimes 5-10x. I would recommend you give it a shot when it comes out (probably this week, or release candidates are available although there was a performance regression in the last RC that will be fixed before the official release)