I noticed that Pilot had CPU spikes from time to time. These spikes happen when it had lots of push errors (timeouts) for 2-3 mins. Meanwhile, eds pushes went from 2 ops to 60+ and back to 2; same for cds, lds, rds, but with lower values.
In Pilot pods I found this type of logs:
...
2019-09-02T12:04:37.188365Z warn ads Failed to push, client busy sidecar~10.200.144.16~<POD1>-25524
2019-09-02T12:04:37.079611Z warn ads Failed to push, client busy sidecar~10.200.134.218~<POD2>-25596
...
In order to investigate, I ran istioctl proxy-status I get the following output
NAME CDS LDS EDS RDS PILOT VERSION
<POD1_SVC_A> SYNCED SYNCED SYNCED (100%) SYNCED istio-pilot-864c44c7f8-rn7v7 1.0-dev
<POD1_SVC_B> SYNCED SYNCED SYNCED (100%) SYNCED istio-pilot-864c44c7f8-bww7j 1.2.4
<POD2_SVC_B> SYNCED SYNCED SYNCED (100%) SYNCED istio-pilot-864c44c7f8-9bspg 1.0-dev
...
// many more pods from different services with same values as above
Everything looked ok, except the VERSION column. Why are there multiple versions? Both Pilot and istio-proxy had running a istio:1.2.4 official image.
Moreover, what should I look for when Pilot has this kind of random behaviour?
I am running istio 1.2.4, installed via Helm, on-premise cluster.
The version issue is unrelated, something we are working on fixing.
Pilot is inherently spiky. In a stable cluster, there is nothing for pilot to do, but when you change a config, spin up a new service, etc, pilot must push all of this config to all of the proxies.
Thank you for clearing that out. Indeed, the version issue was related to some different istio images running with the same tag.
Following up on the spiky behaviour: I notice that cpu spikes are almost always happening when istio-pilot reports push errors or timeouts.
Moreover, I belive I am running a lot of pilots for my cluster size and its rate of change. The mesh has ~500 services and 680 pods and I am running 5 pilots with 3 CPU limit and 4 GB ram. This makes ~120 connected endpoints per istio-pilot instance.
I am not sure what Failed to push, client busy sidecar really means…
One interesting fact that I noticed while fixing the issue about the version mentioned above, if I delete/create pods at a rate of 1-2/s, running the setup of pilots as previously mentioned, they always go full on CPU and they throw timeouts and errors at a constant rate of 15+.
Also 1.3 has major performance improvements to Pilot - sometimes 5-10x. I would recommend you give it a shot when it comes out (probably this week, or release candidates are available although there was a performance regression in the last RC that will be fixed before the official release)