Istioctl proxy-status keeps running STALE across services in the mesh, causing sporadic 404s for requests through the ingressgateway’s blackhole:80 route

Hello,
We have been seeing an issue with the istioctl proxy-status in our cluster alternating between SYNCED and STALE state across services in the mesh since earlier today, resulting in sporadic 404s for calls through the ingressgateway.

When inspecting the routes on the ingressgateway, it alternates between having the single blackhole:80 route reporting the 404s, and at other times all the valid (191) routes.

We’re running istio version 1.1.3 in EKS, and currently have 4 replicas running for pilot. There are 100 services in the cluster, across 186 pods running on 22 worker nodes in AWS.

We are also trying to scale pilot to see if there is any impact (cpu and mem are somewhat higher than baseline right now, but not at capacity yet).

While it appears to be an issue with the ingress’s sidecar not being able to reach pilot discovery container consistently to resolve the routes for the services, it is not clear as to why this would happen. We do see some of these warnings in the ingress gateway sidecar container’s logs:

[2019-05-22 20:23:47.935][19][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 14, no healthy upstream
[2019-05-22 20:23:47.935][19][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:49] Unable to establish new stream
[2019-05-22 20:24:01.436][19][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13,

Routes reported by ingressgateway when proxy-status is SYNCED…

istioctl proxy-config route istio-ingressgateway-747ff57cc5-lzd94 -n istio-system
NOTE: This output only contains routes loaded via RDS.
NAME VIRTUAL HOSTS
http.80 191
1

when proxy-status is STALE…

istioctl proxy-config route istio-ingressgateway-747ff57cc5-lzd94 -n istio-system
NOTE: This output only contains routes loaded via RDS.
NAME VIRTUAL HOSTS
http.80 1
1

istioctl proxy-config route istio-ingressgateway-747ff57cc5-lzd94 -n istio-system -o json

[

{

    "name": "http.80",

    "virtualHosts": [

        {

            "name": "blackhole:80",

            "domains": [

                "*"

            ],

            "routes": [

                {

                    "match": {

                        "prefix": "/"

                    },

                    "directResponse": {

                        "status": 404

                    },

                    "perFilterConfig": {

                        "mixer": {

                            "disable_check_calls": true

                        }

                    }

                }

            ]

        }

    ],

    "validateClusters": false

},

{

    "virtualHosts": [

        {

            "name": "backend",

            "domains": [

                "*"

            ],

            "routes": [

                {

                    "match": {

                        "prefix": "/stats/prometheus"

                    },

                    "route": {

                        "cluster": "prometheus_stats"

                    }

                }

            ]

        }

    ]

}

]

Have also looked at the possibility of a bad route/serviceentry/virtualservice in the cluster but nothing has jumped out yet.

Would appreciate any help/pointers to troubleshoot this issue further.

thanks,
Aish

I work with Aish,

The ingress gateways keep ping ponging in-between properly configured and only configured with the blackhole route.

It almost seems load related. It seems to occur more frequently when pilot is consuming more CPU resources. I am speculating as to whether there is some issue where under high load conditions https://github.com/istio/istio/blob/790c86828a989f317ae7baed53340c900c9ea669/pilot/pkg/networking/core/v1alpha3/gateway.go#L278 evaluates to true but I don’t have enough evidence at this point.

Just wanted to add that this issue looks quite similar to what we are observing. Ref: https://github.com/istio/istio/issues/13822. CC: @mandarjog

We went from 1.0.2 to 1.1.3 and are seeing these sync issues intermittently. The issue is def not just the tool misreporting the status cause our requests are failing with 404s when this happens causing a high error rate across services in the mesh. And pilot has been working harder since the upgrade too.

Thanks,
Aish

We were able to get past the proxy sync issues/404s by removing a couple of unused gateway configurations that were applied to the mesh, which both used the same overlapping host entry (’*’) bound to port 80.

1st gateway:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: istio-autogenerated-k8s-ingress
  namespace: istio-system
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP2

2nd gateway:

apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: global-gateway
  namespace: default
spec:
  selector:
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP

Note that the protocols were diff between the 2 (HTTP and HTTP2) so wondering if that is something of interest in this particular case or not.

Also, the behavior looks to be the same as what is documented in this istio bug - https://github.com/istio/istio/pull/14080 - which this was fixed in 1.1.7. Would appreciate any thoughts from the istio dev community on this.

thanks,
Aish