So we’ve recently enabled the tracing options for Istio in our clusters, and I’ve noticed that the ingress-gateway seems to be holding/queuing up requests for several seconds at a time
For example, here this request seems to have been held for 10 seconds at the ingress gateway, before being passed ahead to the “mini-main” service in the “prod” namespace (this could’ve been a network issue, but seems unlikely, as I’ve only observed it specifically at hops between the ingress gateway and other services…).
We were running only 4 gateways (and their CPU usage was very low). After this I’ve tried scaling them up to 12 but saw now meaningful reduction in the p90, p99 latency, and the behavior persisted. I’ve considered that instead this could be a problem with the tracing data – but this is generated by Istio as well.
Anyway, is there maybe some fine tuning I could do, like some undocumented request queue size I could edit? Maybe someone has had a similar issue? Alternatively, is there a way I could get metrics from the ingress gateway (like some sorte of request queue size, etc?).
Edit: I did find this post which might be helpful and I did realize that I could start to scrape Prometheus metrics directly from the envoy sidecars in the ingress gateway. I guess monitoring the rq_total, rq_active and downstream_rq_active metrics might help in this case will try that out tomorrow (we have istio metrics in Prometheus but didn’t realize until now how useful raw Envoy metrics would be).
Also, scaling up further actually might have helped, but it might be a bit early to tell.