High Amount of Pilot Push Errors

crhuber · August 21, 2019, 2:23pm

We are using Istio 1.2.3 on EKS and finding we are getting a high rate of push errors. We are getting a 5-8% error rate every 5 to 10 minutes.

Im using this query to calculate the error % and its frequently 5-8%

sum(rate(pilot_xds_push_errors{job="pilot"}[1m])) / sum(rate(pilot_xds_pushes{job="pilot"}[1m]))

Also I notice that our pushes spike as high as 900 ops and are frequently 300ops every 5 minutes

sum(rate(pilot_xds_pushes{type!~".*_senderr"}[1m]))

When i check the pilot logs I frequently see entries like this:

transport: http2Server.HandleStreams failed to read frame: read tcp 172.99.99.xx:15010->172.99.99.99:36838: use of closed network connection
istio-pilot-aaa-bbbb discovery 2019-08-21T14:00:54.823655Z	info	transport: loopyWriter.run returning. connection error: desc = "transport is closing"
istio-pilot-aaa-bbbb discovery 2019-08-21T14:00:54.823703Z	info	ads	ADS: "172.99.99.xx::36838" sidecar~172.99.99.99~podname.svc.cluster.local-34312 terminated rpc error: code = Canceled desc = context canceled

What more can I do to debug this issue or what may cause this

howardjohn · August 21, 2019, 6:01pm

If all of the errors are like this, it should not be a concern. This will happen any time a pod is terminating, OR every 30min (the pilot/envoy connection will terminate every 30min). Basically it just means it started sending a push but the connection closed.

In Don't report send error for expected errors by howardjohn · Pull Request #15636 · istio/istio · GitHub we made this not reported as an error but it is not in 1.2

crhuber · August 22, 2019, 6:49am

thanks @howardjohn but what about the high amount of push errors (5%) seen in the Pilot Dashboard

howardjohn · August 22, 2019, 3:00pm

@crhuber thats what I mean, the push errors are not really errors, they are expected. Which is why in https://github.com/istio/istio/pull/15636/files I made them not be reported as errors.

Topic		Replies	Views
Pilot spikes debugging	3	1343	September 9, 2019
TCP connections terminate periodically Networking	6	9244	October 5, 2019
Pilot is intentionally closed every 30 minutes => Issue with TCP session to external databases	1	556	February 15, 2022
Istiod full push every 5m	5	2490	August 17, 2021
Established TCP connections were destroyed when Envoy receives configuration from Pilot	0	2003	February 22, 2019

High Amount of Pilot Push Errors

Related topics