Stackdriver not reflecting DestinationRule weight

#1

We are using Istio in production, and weighting Destination Rules for releasing newer versions in a canary like fashion. We are also using the Stackdriver adaptor. We updated the weight distribution of a new version to 100%, and all other versions set to 0%.

Viewing the logs of the each version, we were able to confirm that indeed no traffic was being routed to the old versions and all traffic was correctly served from the new 100% version. But when reviewing Stackdriver, the metrics did not seem to reflect (for any metric really). We made this distribution change at approximately 2PM and Stackdriver didn’t trail off until ~7PM.

The following Stackdriver graph shows Istio Server Request Count (istio.io/service/server/request_count) for each version. Filtered by source=istio-ingressgateway & destination=(our service name)

We’re curious about these observations, and have struggled to come up with any good rational for why the graph suggests a five hour trail off which appears to still be staggered traffic (not a straight line to 0) while we confirmed traffic was routed as expected via application logging.

#2

A few questions:

  • Is it possible that there is a time sync issue with the various pods involved ?
  • Are there a bunch of errors in the mixer logs related to Stackdriver by any chance?
  • Are you using any custom handler config for the Stackdriver adapter?

IIRC, the Stackdriver adapter buffers timeseries data for ~1m by default – and then retries on errors. But I would expect that this would lead to resolution relatively quickly.

I guess it is also possible that your istio-telemetry service was so overloaded that it took 5 hours to work through the backlog, but that seems quite extreme.

#3

@douglas-reid We were the ones which reached out regarding a flood of Mixer errors around the Stackdriver adapter. Which oddly enough, stopped flooding those errors around 5AM the morning after this deployment… after about 2 weeks of constant erroring.

We’ll look into the other aspects you mentioned with regards to time sync & anything that might be out of the norm with regards to SD adaptor.

And thanks for the details of the buffer. We’ll continue to track as the Mixer errors have also crept into our Development environment…but I’m not sure telemetry is overloaded. Its utilization seems very low.

#4

@kamboyer fwiw, there is someone working on eliminating those REPORT timeout errors in the client now. I don’t fully know the progress of that work, but it should be coming along.

#5

This is another example we’ve observed. This particular event as shifting all traffic to a new cluster…and it appears to have taken nearly a day for Istio to stop reporting to Stackdriver after all traffic was before from the original cluster (orange)

FYI: These lines are Istio Request Count metrics grouped by cluster_name