We ran into a problem with our cluster running Istio 1.4.2 on gke.
Setup:
- Istio 1.4.2
- Two subsets (stable and master).
- One k8s service.
- One virtual service: We route with cookie to one (master). If cookie not set, go to other (stable). 2 pods in each subset. We have two pilots running.
Symptom: all requests to stable receive a 503 with response_flags
UH. Prometheus stats (envoy_cluster_membership_healthy
) show no entries active in the cluster corresponding to subset stable. The cluster corresponding to the entire service shows 4 entries. EDS retrieved from istioctl to one of the ingresses shows no rows for stable, but rows for master.
Recovery: Restarting one of the two running pilots resolved the problem. We first tried restarting one of the ingress gateways, which did not fix the problem. The pilot I restarted had more in the logs (various EDS/ADS push events) than the other, which is why I chose it.
I’m interested in understanding what could have gone wrong here, and how to prevent it in the future/if there is a bug. Any thoughts or suggestions are greatly appreciated.
Relevant Information:
Timeline:
- At around 16:24 GMT the node which was running both the pilot I restarted and the two stable pods rebooted.
- When the node came back up the stable pods were started at around 16:25.
- From the time the node rebooted until 16:44, both ingress gateways reported the stable cluster as having no members. This is quite a bit longer than expected. The pods were definitely up during this time, as the cluster containing all pods for the service showed 4 members in prometheus.
- From 16:44 onwards, one of the two ingress gateways was reporting the stable cluster as being down. Restarting that gateway fixed the problem. This may or may not be related.
- At about 17:30 GMT, one of the ingress gateways reported the stable cluster as having no members in the
envoy_cluster_membership_healthy
stat - At about 17:36 GMT the other ingress gateway reported the same.
- The system stayed in this state until I restarted pilot.
Endpoints before restarting pilot.
istioctl proxy-config endpoints -n istio-system istio-ingressgateway-78b57d75d-kgfp2 | grep portal
10.16.0.20:5000 HEALTHY OK outbound_.5000_._.portal.portal.svc.cluster.local
10.16.0.20:5000 HEALTHY OK outbound|5000||portal.portal.svc.cluster.local
10.16.0.30:5000 HEALTHY OK outbound_.5000_._.portal.portal.svc.cluster.local
10.16.0.30:5000 HEALTHY OK outbound|5000||portal.portal.svc.cluster.local
10.16.2.31:5000 HEALTHY OK outbound_.5000_._.portal.portal.svc.cluster.local
10.16.2.31:5000 HEALTHY OK outbound_.5000_.master_.portal.portal.svc.cluster.local
10.16.2.31:5000 HEALTHY OK outbound|5000|master|portal.portal.svc.cluster.local
10.16.2.31:5000 HEALTHY OK outbound|5000||portal.portal.svc.cluster.local
10.16.4.27:5000 HEALTHY OK outbound_.5000_._.portal.portal.svc.cluster.local
10.16.4.27:5000 HEALTHY OK outbound_.5000_.master_.portal.portal.svc.cluster.local
10.16.4.27:5000 HEALTHY OK outbound|5000|master|portal.portal.svc.cluster.local
10.16.4.27:5000 HEALTHY OK outbound|5000||portal.portal.svc.cluster.local
Endpoints after restarting pilot. Note how stable is now present, but we still have the same set of IPs:
istioctl proxy-config endpoints -n istio-system istio-ingressgateway-78b57d75d-kgfp2 | grep portal
10.16.0.20:5000 HEALTHY OK outbound_.5000_._.portal.portal.svc.cluster.local
10.16.0.20:5000 HEALTHY OK outbound_.5000_.stable_.portal.portal.svc.cluster.local
10.16.0.20:5000 HEALTHY OK outbound|5000|stable|portal.portal.svc.cluster.local
10.16.0.20:5000 HEALTHY OK outbound|5000||portal.portal.svc.cluster.local
10.16.0.30:5000 HEALTHY OK outbound_.5000_._.portal.portal.svc.cluster.local
10.16.0.30:5000 HEALTHY OK outbound_.5000_.stable_.portal.portal.svc.cluster.local
10.16.0.30:5000 HEALTHY OK outbound|5000|stable|portal.portal.svc.cluster.local
10.16.0.30:5000 HEALTHY OK outbound|5000||portal.portal.svc.cluster.local
10.16.2.31:5000 HEALTHY OK outbound_.5000_._.portal.portal.svc.cluster.local
10.16.2.31:5000 HEALTHY OK outbound_.5000_.master_.portal.portal.svc.cluster.local
10.16.2.31:5000 HEALTHY OK outbound|5000|master|portal.portal.svc.cluster.local
10.16.2.31:5000 HEALTHY OK outbound|5000||portal.portal.svc.cluster.local
10.16.4.27:5000 HEALTHY OK outbound_.5000_._.portal.portal.svc.cluster.local
10.16.4.27:5000 HEALTHY OK outbound_.5000_.master_.portal.portal.svc.cluster.local
10.16.4.27:5000 HEALTHY OK outbound|5000|master|portal.portal.svc.cluster.local
10.16.4.27:5000 HEALTHY OK outbound|5000||portal.portal.svc.cluster.local
The destination rule we are using to build the subsets:
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: portal
namespace: portal
spec:
host: portal
subsets:
- name: stable
labels:
ver: stable
- name: master
labels:
ver: master
trafficPolicy:
connectionPool:
tcp:
connectTimeout: 1500ms
Thanks!
Kyle