I was using Prometheus-Operator earlier with 1.0. Prometheus was getting metrics from envoy via statsd collector. I did add statsd when I upgraded to 1.1, as it is not added by default any more. I don’t see the metrics anymore.
Is that metrics not available via the statsd collector now? Is there any other requirement that I am missing with the upgrade?
Envoy statsd export is disabled by default. You should be able to enable it via helm options. However, you can have Prometheus directly scrape the exposed envoy endpoint instead (which is what the default istio prometheus config does).
I do have statsd enabled. I tried the following approaches so far.
- Stock Prometheus + Grafana. I do see the envoy endpoints active in Prometheus targets.
- Prometheus operator + Service Monitor for Istio mixer & statsd -> The setup I had working with Istio 1.0.
- Prometheus operator + Service Monitor for Istio mixer & statsd + additional scrape config for prometheus (from stock prometheus config) based on this issue.
I am still not getting server side metrics like Server Request Volume
, Response size by Source
etc., on Grafana.
Has anyone verified that server side metrics from envoy are being pushed to Prometheus and are show in Grafana?
I have verified that the resources in this config suggested in Istio docs are present in my cluster.
@adityats the stats you are missing come from the prometheus adapter for Mixer (the istio-telemetry
service). Those are distinct from statsd entirely. I suspect that something is not correct with your prometheus operator config. FWIW, I’m currently working on a proper set of CRs for the operator to help in this scenario.
We have verified (and continue to verify) that server-side metrics are generated via Mixer and scraped by Prometheus (there are a number of e2e and stability tests for this use case).
I suggest looking through https://istio.io/help/ops/telemetry/missing-metrics/ to troubleshoot further.
@douglas-reid I figure I am missing the metrics from Envoy itself. I followed https://istio.io/help/ops/telemetry/envoy-stats/ to add collection keys cluster_inbound
, cluster_outbound
and listener
that I had on my older cluster. I am able to see those metrics on
$ kubectl exec -it $POD -c istio-proxy -- sh -c 'curl localhost:15000/stats'
I also edited the default Prometheus spec which was filtering out some metrics from envoy.
metric_relabel_configs:
# Exclude some of the envoy metrics that have massive cardinality
# This list may need to be pruned further moving forward, as informed
# by performance and scalability testing.
- - source_labels: [ cluster_name ]
- regex: '(outbound|inbound|prometheus_stats).*'
- action: drop
- - source_labels: [ tcp_prefix ]
- regex: '(outbound|inbound|prometheus_stats).*'
- action: drop
- - source_labels: [ listener_address ]
- regex: '(.+)'
- action: drop
- - source_labels: [ http_conn_manager_listener_prefix ]
- regex: '(.+)'
- action: drop
- - source_labels: [ http_conn_manager_prefix ]
- regex: '(.+)'
- action: drop
- - source_labels: [ __name__ ]
- regex: 'envoy_tls.*'
- action: drop
- - source_labels: [ __name__ ]
- regex: 'envoy_tcp_downstream.*'
- action: drop
- - source_labels: [ __name__ ]
- regex: 'envoy_http_(stats|admin).*'
- action: drop
- source_labels: [ __name__ ]
regex: 'envoy_cluster_(lb|retry|bind|internal|max|original).*'
action: drop
The metrics are still not showing up in Grafana. I will continue to look into this. Just wanted to give you a heads up where I am at.
Also, I just want to clarify that the metrics I am looking for are not of istio, but envoy metrics of containers I have deployed in the cluster. These are the ones that are excluded by default according to the docs I linked.
I think we have a misunderstanding here.
Your issue, as I understand it, is that some of the charts in grafana are not rendering any data. I believe you specifically called out Response Size By Source. That metric is an Istio metric.
In fact, almost all of the metrics displayed in the Istio grafana dashboards are Istio metrics (as opposed to envoy metrics).
No amount of exporting envoy stats or customizing the inclusion prefixes, etc., will help with your issue.
Instead, you need to troubleshoot why you aren’t scraping the Istio generated metrics from Mixer. I suspect that it is an issue with your operator configuration. Please see the troubleshooting guide I linked in my previous response.
Hope that helps.
I do get all Client metrics in Grafana. Client Request Volume
, Client Success Rate
… Corresponding server side metrics are what don’t show up.
When I try port forwarding to istio-telemetry
, I get the following error
404 page not found
Is there anything on the network side that could be causing this?
Metrics are missing from the default namespace (default). Services created in a different namespace are showing metrics. This indicates the issue isn’t due to the envoy stats whitelist changes I made.
I don’t see any configuration that could drop metrics from default namespace. I have verified that the deployment spec is the same in both the namespaces.
Are your services in the default namespace sidecar-injected?
Yes they are. I am trying to recreate the setup from scratch again. First attempt facing this issue - https://github.com/istio/istio/issues/9504
Will update later today if I can get past that and test the metrics again.
Update:
I didn’t face envoy - pilot issues this time.
Regarding metrics, I am still seeing missing metrics for default namespace.
Compare nginx.default and httpbin.other-system metrics below. The spec is the same except for image and name.
Are the proxies in the default namespace having issues communicating with istio-telemetry
service? Are there routing rules or something else that are blocking the comms? Are you using Sidecar resources perhaps?
Can you look in the proxy logs for your nginx
service to see if there are REPORT failures, etc.?
There is one report failure in the logs.
[2019-04-19 21:16:56.591][28][warning][filter] [src/istio/mixerclient/report_batch.cc:106] Mixer Report failed with: UNAVAILABLE:upstream connect error or disconnect/reset before headers. reset reason: connection failure
I checked for TLS errors. I am running in PERMISSIVE mode with ControlPlaneAuthPolicy set to NONE.
istioctl authn tls-check nginx-8564d7d7d-l8dxb.default istio-telemetry.istio-system.svc.cluster.local
HOST:PORT STATUS SERVER CLIENT AUTHN POLICY DESTINATION RULE
istio-telemetry.istio-system.svc.cluster.local:9091 OK HTTP/mTLS mTLS default/ default/default
Update:
Destination rule was interfering with comms between Envoy and Mixer.
kubectl get DestinationRule
NAME HOST AGE
api-server kubernetes.default.svc.cluster.local 3h
default *.local 3h
Changed it to the following
kubectl get DestinationRule
NAME HOST AGE
api-server kubernetes.default.svc.cluster.local 3h
default *.default.svc.cluster.local 3h
Was there a change between 1.0 and 1.1? I have another setup (1.0) with the same destination rule and metrics work in that cluster.
@adityats I think this is worth filing an issue for and tagging it as networking
related. The behavior of DestinationRules for the default namespace is probably not being exercised by the e2e tests and I bet that the impact of this change is not immediately clear.
@douglas-reid Created https://github.com/istio/istio/issues/13524
Please tag it appropriately.
Thanks.