High Amount of Pilot Push Errors

We are using Istio 1.2.3 on EKS and finding we are getting a high rate of push errors. We are getting a 5-8% error rate every 5 to 10 minutes.

Im using this query to calculate the error % and its frequently 5-8%

sum(rate(pilot_xds_push_errors{job="pilot"}[1m])) / sum(rate(pilot_xds_pushes{job="pilot"}[1m]))

Also I notice that our pushes spike as high as 900 ops and are frequently 300ops every 5 minutes


When i check the pilot logs I frequently see entries like this:

transport: http2Server.HandleStreams failed to read frame: read tcp 172.99.99.xx:15010-> use of closed network connection
istio-pilot-aaa-bbbb discovery 2019-08-21T14:00:54.823655Z	info	transport: loopyWriter.run returning. connection error: desc = "transport is closing"
istio-pilot-aaa-bbbb discovery 2019-08-21T14:00:54.823703Z	info	ads	ADS: "172.99.99.xx::36838" sidecar~ terminated rpc error: code = Canceled desc = context canceled

What more can I do to debug this issue or what may cause this

If all of the errors are like this, it should not be a concern. This will happen any time a pod is terminating, OR every 30min (the pilot/envoy connection will terminate every 30min). Basically it just means it started sending a push but the connection closed.

In https://github.com/istio/istio/pull/15636 we made this not reported as an error but it is not in 1.2

thanks @howardjohn but what about the high amount of push errors (5%) seen in the Pilot Dashboard

@crhuber thats what I mean, the push errors are not really errors, they are expected. Which is why in https://github.com/istio/istio/pull/15636/files I made them not be reported as errors.