503s to one of two subsets despite all pods being healthy

We ran into a problem with our cluster running Istio 1.4.2 on gke.

Setup:

  • Istio 1.4.2
  • Two subsets (stable and master).
  • One k8s service.
  • One virtual service: We route with cookie to one (master). If cookie not set, go to other (stable). 2 pods in each subset. We have two pilots running.

Symptom: all requests to stable receive a 503 with response_flags UH. Prometheus stats (envoy_cluster_membership_healthy) show no entries active in the cluster corresponding to subset stable. The cluster corresponding to the entire service shows 4 entries. EDS retrieved from istioctl to one of the ingresses shows no rows for stable, but rows for master.

Recovery: Restarting one of the two running pilots resolved the problem. We first tried restarting one of the ingress gateways, which did not fix the problem. The pilot I restarted had more in the logs (various EDS/ADS push events) than the other, which is why I chose it.

I’m interested in understanding what could have gone wrong here, and how to prevent it in the future/if there is a bug. Any thoughts or suggestions are greatly appreciated.

Relevant Information:

Timeline:

  • At around 16:24 GMT the node which was running both the pilot I restarted and the two stable pods rebooted.
  • When the node came back up the stable pods were started at around 16:25.
  • From the time the node rebooted until 16:44, both ingress gateways reported the stable cluster as having no members. This is quite a bit longer than expected. The pods were definitely up during this time, as the cluster containing all pods for the service showed 4 members in prometheus.
  • From 16:44 onwards, one of the two ingress gateways was reporting the stable cluster as being down. Restarting that gateway fixed the problem. This may or may not be related.
  • At about 17:30 GMT, one of the ingress gateways reported the stable cluster as having no members in the envoy_cluster_membership_healthy stat
  • At about 17:36 GMT the other ingress gateway reported the same.
  • The system stayed in this state until I restarted pilot.

Endpoints before restarting pilot.

 istioctl proxy-config endpoints -n istio-system istio-ingressgateway-78b57d75d-kgfp2 | grep portal
10.16.0.20:5000        HEALTHY     OK                outbound_.5000_._.portal.portal.svc.cluster.local
10.16.0.20:5000        HEALTHY     OK                outbound|5000||portal.portal.svc.cluster.local
10.16.0.30:5000        HEALTHY     OK                outbound_.5000_._.portal.portal.svc.cluster.local
10.16.0.30:5000        HEALTHY     OK                outbound|5000||portal.portal.svc.cluster.local
10.16.2.31:5000        HEALTHY     OK                outbound_.5000_._.portal.portal.svc.cluster.local
10.16.2.31:5000        HEALTHY     OK                outbound_.5000_.master_.portal.portal.svc.cluster.local
10.16.2.31:5000        HEALTHY     OK                outbound|5000|master|portal.portal.svc.cluster.local
10.16.2.31:5000        HEALTHY     OK                outbound|5000||portal.portal.svc.cluster.local
10.16.4.27:5000        HEALTHY     OK                outbound_.5000_._.portal.portal.svc.cluster.local
10.16.4.27:5000        HEALTHY     OK                outbound_.5000_.master_.portal.portal.svc.cluster.local
10.16.4.27:5000        HEALTHY     OK                outbound|5000|master|portal.portal.svc.cluster.local
10.16.4.27:5000        HEALTHY     OK                outbound|5000||portal.portal.svc.cluster.local

Endpoints after restarting pilot. Note how stable is now present, but we still have the same set of IPs:

istioctl proxy-config endpoints -n istio-system istio-ingressgateway-78b57d75d-kgfp2 | grep portal
10.16.0.20:5000        HEALTHY     OK                outbound_.5000_._.portal.portal.svc.cluster.local
10.16.0.20:5000        HEALTHY     OK                outbound_.5000_.stable_.portal.portal.svc.cluster.local
10.16.0.20:5000        HEALTHY     OK                outbound|5000|stable|portal.portal.svc.cluster.local
10.16.0.20:5000        HEALTHY     OK                outbound|5000||portal.portal.svc.cluster.local
10.16.0.30:5000        HEALTHY     OK                outbound_.5000_._.portal.portal.svc.cluster.local
10.16.0.30:5000        HEALTHY     OK                outbound_.5000_.stable_.portal.portal.svc.cluster.local
10.16.0.30:5000        HEALTHY     OK                outbound|5000|stable|portal.portal.svc.cluster.local
10.16.0.30:5000        HEALTHY     OK                outbound|5000||portal.portal.svc.cluster.local
10.16.2.31:5000        HEALTHY     OK                outbound_.5000_._.portal.portal.svc.cluster.local
10.16.2.31:5000        HEALTHY     OK                outbound_.5000_.master_.portal.portal.svc.cluster.local
10.16.2.31:5000        HEALTHY     OK                outbound|5000|master|portal.portal.svc.cluster.local
10.16.2.31:5000        HEALTHY     OK                outbound|5000||portal.portal.svc.cluster.local
10.16.4.27:5000        HEALTHY     OK                outbound_.5000_._.portal.portal.svc.cluster.local
10.16.4.27:5000        HEALTHY     OK                outbound_.5000_.master_.portal.portal.svc.cluster.local
10.16.4.27:5000        HEALTHY     OK                outbound|5000|master|portal.portal.svc.cluster.local
10.16.4.27:5000        HEALTHY     OK                outbound|5000||portal.portal.svc.cluster.local

The destination rule we are using to build the subsets:

---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: portal
  namespace: portal
spec:
  host: portal
  subsets:
    - name: stable
      labels:
        ver: stable
    - name: master
      labels:
        ver: master
  trafficPolicy:
    connectionPool:
      tcp:
        connectTimeout: 1500ms

Thanks!

Kyle

Hi Kyle, can you report this in istio issues so that I can cc some relevant folks?
when you report the issue, please specify your settings (using galley/ how many other services/pods in the system, etc.). This issue in isolation does not seem reproducible, but I wager there are other environmental factors contributing to this issue.

Done! 503s to one of two subsets despite all pods being healthy · Issue #20367 · istio/istio · GitHub

I’m not quite sure what you mean by “using galley” – do you mean collect information using galley (if so, how do I do this?), or do you mean to ask whether we’re using galley at all?