Too many HTTPS external services on same port cause Envoy to drop connections

I’m using Istio 1.4.5 with outboundTrafficPolicy=REGISTRY_ONLY and several HTTPS external services exposed as ServiceEntry on standard port 443.

e.g. an external service

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: simulator
  namespace: external
spec:
  hosts:
    - example.net
  exportTo:
    - "*"
  ports:
    - number: 443
      name: https-simulator
      protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL

All these services are then accessed via Envoy listener 0.0.0.0_443.

My understanding is that Pilot then pushes filter chain updates to Envoy for that listener on a non-deterministic order. That has been verified by comparing two Envoy config dumps for the same proxy where the only change was the order of the external services in the filter chain.

It then causes the listener to be drained and all connections to be dropped.

I’d like to know if my understanding is correct and what’s the recommended way to overcome this problem.

Is it have a VirtualService for each of the ServiceEntry routing traffic from a different port recommended?

e.g. Routing alternative port 7000 to standard 443

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: simulator
  namespace: external
spec:
  hosts:
    - example
  exportTo:
    - "*"
  ports:
    - number: 443
      name: https-simulator
      protocol: TLS
    - number: 7000
      name: https-alt
      protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: simulator
  namespace: external
spec:
  hosts:
    - example
  tls:
    - match:
      - port: 7000
        sniHosts:
          - example
      route:
        - destination:
            host: example
            port:
              number: 443

I had to end up aggregating all external services running on port 443 into a single ServiceEntry. It doesn’t look great but I haven’t had any connection being dropped (envoy listeners being drained) since.

Would anyone be able to provide additional input on what’s the best way moving forward here?

@douglas-reid - tagging you as saw you working on a few cases related if you don’t mind having a look.

I found a discussion of what I believe is a related issue https://github.com/istio/istio/issues/20477

The gist is that if detecting the protocol for a TLS request against all the registered ServiceEntries takes more than 100ms, the request fails with a 502 error. How long protocol detection takes seems to grow with the number of TLS ServiceEntries in play. I suspect that by aggregating all your hosts into a single ServiceEntry, you brought the evaluation time to under 100ms.

Setting global.proxy.protocolDetectionTimeout=0s is supposed to mitigate the issue, but there is some discussion in the linked issue that that doesn’t work. You could write an EnvoyFilter to patch Envoy’s configuration directly.

Alternatively, I found that setting the ServiceEntry spec.ports.protocol: TCP also fixes the issue, but that effectively allows any https traffic out of your cluster, which defeats the purpose of REGISTRY_ONLY.

Thanks for your response. I’m already running with protocolDetectionTimeout=0s.

The problem is actually when pilot pushes xDS configuration to envoy it replaces the 0.0.0.0_443 listener even though there were no changes. Somewhat related to https://github.com/istio/istio/issues/11971.

From envoy logs:

[external/envoy/source/server/listener_impl.cc:273] warm complete. updating active listener: name=0.0.0.0_443, hash=6539101047026873500, address=0.0.0.0:443
[external/envoy/source/server/listener_impl.cc:273] draining listener: name=0.0.0.0_443, hash=15072743390866864361, address=0.0.0.0:443
[external/envoy/source/server/lds_api.cc:63] lds: add/update listener '0.0.0.0_443'

So envoy sees a different Hash that’s why it creates a new listener. I can’t understand why it would generate a new Hash.

Have seen a few other issues where LDS updates were not sorted beforehand causing the listeners to be drained:

RBAC: https://github.com/istio/istio/issues/17347
Consul: https://github.com/istio/istio/issues/11971

Wondering if there’s a kubernetes specific code not sorting the filters (hosts) based on ServiceEntries?

We are running into the same issue with multiple HTTPS service entries causing redundant, frequent listener updates for 0.0.0.0_443. The only change in the configuration is the ordering of server names which should not cause a listener update, IMO.

Any guidance on a solution would be greatly appreciated. The above mentioned solutions seem like a work-around or aren’t applicable in our current environment.

There’s a change in istio 1.6 that seems to be addressing this issue.

I haven’t upgraded istio to that version yet, please let us know if you did.

And all the fixes suggested here are workarounds indeed as no proper fix available.