Egress Gateway 404s traffic after initial start under load for some pods, eventually returns 200

Hello all, I have a slightly curious question related to use of the egress gateway in google cloud. We recently have configured an egress gateway with one custom lua filter to try to capture external traffic to a key dependency and things seem to have worked well in our dev and staging environments. Unfortunately when we got to prod we hadn’t load tested well and appeared to have a lot of traffic 404. The nice thing is, the filter we used worked and we could see that a few pods 404’ed traffic for up to 45 minutes before traffic eventually returned to normal while some worked flawlessly during that time. The only thing I can think of is during that time, we had a scaling event because of the load and had a pod or two that failed to load and apply all of the gateway, virtual service, and destination rules we had set, then when the load dropped rules were applied and life was back to normal. does anyone have any experience with this? we’re using 1.8.2 for reference and installing from a helm chart instead of using the google cloud istio switch available.

For reference, our setup looks like the following

apiVersion: networking.istio.io/v1alpha3
metadata:
  creationTimestamp: '2021-04-20T19:59:45Z'
  generation: 22
  name: egress-gateway
  namespace: istio-gateway
  resourceVersion: '235919503'
  selfLink: >-
    /apis/networking.istio.io/v1alpha3/namespaces/istio-gateway/gateways/egress-gateway
  uid: 2bde9a47-67f2-4ba1-97e7-b2d7f27f0daa
spec:
  selector:
    istio: egressgateway
  servers:
    - hosts:
        - edition.cnn.com
      port:
        name: https-port-for-tls-origination
        number: 443
        protocol: HTTPS
      tls:
        mode: ISTIO_MUTUAL
---
kind: VirtualService
apiVersion: networking.istio.io/v1alpha3
metadata:
  generation: 5
  name: cnn-egress-routing-ee9wrx7a
  namespace: istio-gateway
spec:
  gateways:
    - egress-gateway
    - mesh
  hosts:
    - edition.cnn.com
  http:
    - match:
        - gateways:
            - mesh
          port: 80
      route:
        - destination:
            host: istio-egressgateway.istio-gateway.svc.cluster.local
            port:
              number: 443
            subset: cnn-internal-to-gateway
          weight: 100
    - match:
        - gateways:
            - egress-gateway
          port: 443
      retries:
        attempts: 3
        perTryTimeout: 10s
        retryOn: '5xx,reset'
      route:
        - destination:
            host: edition.cnn.com
            port:
              number: 443
            subset: cnn-gateway-to-endpoint
          weight: 100
  tls:
    - match:
        - gateways:
            - mesh
          port: 443
          sniHosts:
            - edition.cnn.com
      route:
        - destination:
            host: edition.cnn.com
            port:
              number: 443
            subset: cnn-direct
          weight: 100
---
kind: DestinationRule
apiVersion: networking.istio.io/v1alpha3
metadata:
  name: egress-gateway-dr-haaktks3
  namespace: istio-gateway
spec:
  host: istio-egressgateway.istio-gateway.svc.cluster.local
  subsets:
    - name: cnn-internal-to-gateway
      trafficPolicy:
        loadBalancer:
          simple: ROUND_ROBIN
        portLevelSettings:
          - port:
              number: 443
            tls:
              mode: ISTIO_MUTUAL
              sni: edition.cnn.com
---
kind: DestinationRule
apiVersion: networking.istio.io/v1alpha3
metadata:
  creationTimestamp: '2021-05-14T14:03:53Z'
  generation: 3
  name: cnn-api-dr-yuyf5amk
  namespace: istio-gateway
spec:
  host: edition.cnn.com
  subsets:
    - name: cnn-gateway-to-endpoint
      trafficPolicy:
        loadBalancer:
          simple: ROUND_ROBIN
        portLevelSettings:
          - port:
              number: 443
            tls:
              mode: SIMPLE
              sni: edition.cnn.com
    - name: cnn-direct
      trafficPolicy:
        loadBalancer:
          simple: ROUND_ROBIN
        portLevelSettings:
          - port:
              number: 443
            tls:
              mode: DISABLE
              sni: edition.cnn.com
---
kind: ServiceEntry
apiVersion: networking.istio.io/v1alpha3
metadata:
  creationTimestamp: '2021-05-14T14:03:52Z'
  name: cnn-api-endpoint-oldkmufs
  namespace: istio-gateway
spec:
  hosts:
    - edition.cnn.com
  ports:
    - name: cnn-https
      number: 443
      protocol: HTTPS
    - name: cnn-http
      number: 80
      protocol: HTTP
  resolution: DNS
1 Like

I guess one more interesting thing is, I notice an example egress gateway disconnects and reconnects to the istio service, then almost immediately we get a log line in istiod of
warn constructed http route config for route https.443.https-port-for-tls-origination.egress-gateway.istio-gateway on port 443 with no vhosts; Setting up a default 404 vhost
After about 33 minutes, we reconnect again to the istio service and all is well with no such message in the istiod logs.

And I think I’ve solved my own problem here. Looking at creation timestamps for our actual service, the istio virtual service was created before the gateway object. We know if there’s no gateway, we’ll get a guaranteed 404 from testing. And this happened because we’re using pulumi, which is a tool similar to terraform in that you can easily have multiple resources created out of order.

Good lesson folks, double check ordering if you use infrastructure as code at all.