We are running production workloads with Istio 1.14 and noticed that for a specific timeframe, the request latency reported to the telemetry component for client invoked traffic increased from 50-60ms to 6-7 seconds and at the same time we started observing 500 (internal server error) response codes from Envoy. We are trying to understand under what cases Envoy returns 500 and the only thing I could find in the documentation / source code was that a 500 is returned if the response body must be buffered and it exceeds the buffer limit. This is certainly not the case for us, as those 500 occurred for a health check endpoint beyond other endpoints, whose response body is very small.
I also encountered a lot of 500s and 503s when I upgraded to Istio-1.1. I was able to attribute the 503s to the new memory limits set in Istio-1.1. Our proxies in 1.0 are hovering around 130-140 Mb memory usage. Forcing very strict memory results in envoy getting killed and 503s ensue.
I don’t know what caused the 500s yet. Our applications behind envoy did not get the request. If they did, we would have seen it in the logs.
What’s the status of your problem? I also encountered the similar problem in Istio 1.3.0, mostly in the http request with around 30K request. I found the latency increased from 50ms to 60,000 ms, and got 500 response code for those long latency request.