Istio Ingress Gatewaying Queuing Requests for 10+ seconds?

Hey everyone,

So we’ve recently enabled the tracing options for Istio in our clusters, and I’ve noticed that the ingress-gateway seems to be holding/queuing up requests for several seconds at a time :thinking:

For example, here this request seems to have been held for 10 seconds at the ingress gateway, before being passed ahead to the “mini-main” service in the “prod” namespace (this could’ve been a network issue, but seems unlikely, as I’ve only observed it specifically at hops between the ingress gateway and other services…).

We were running only 4 gateways (and their CPU usage was very low). After this I’ve tried scaling them up to 12 but saw now meaningful reduction in the p90, p99 latency, and the behavior persisted. I’ve considered that instead this could be a problem with the tracing data – but this is generated by Istio as well.

Anyway, is there maybe some fine tuning I could do, like some undocumented request queue size I could edit? Maybe someone has had a similar issue? Alternatively, is there a way I could get metrics from the ingress gateway (like some sorte of request queue size, etc?).

Thanks!

Edit: I did find this post which might be helpful and I did realize that I could start to scrape Prometheus metrics directly from the envoy sidecars in the ingress gateway. I guess monitoring the rq_total, rq_active and downstream_rq_active metrics might help in this case :thinking: will try that out tomorrow (we have istio metrics in Prometheus but didn’t realize until now how useful raw Envoy metrics would be).

Also, scaling up further actually might have helped, but it might be a bit early to tell.

Yeah, I’ve managed to load the metrics into prometheus, but they don’t seem all that helpful. As a matter of fact, the request (rq_) and connection (cx_) don’t seem to reflect the connections in the cluster. I’m not sure what they are meant to reflect, really.

For example, envoy_cluster_upstream_rq_total (which should be the “total requests” at the proxy) is increasing painfully slowly at the clusters, despite having thousands of requests per minute.

So I guess these metrics don’t really help.

The original problem seems to be gone (although this is hard without a sure way to monitor pending connections) but I’m running 25 ingress gateways and allocating 6GB RAM for them all which seems an overkill :thinking: