Too many HTTPS external services on same port cause Envoy to drop connections

I’m using Istio 1.4.5 with outboundTrafficPolicy=REGISTRY_ONLY and several HTTPS external services exposed as ServiceEntry on standard port 443.

e.g. an external service

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: simulator
  namespace: external
spec:
  hosts:
    - example.net
  exportTo:
    - "*"
  ports:
    - number: 443
      name: https-simulator
      protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL

All these services are then accessed via Envoy listener 0.0.0.0_443.

My understanding is that Pilot then pushes filter chain updates to Envoy for that listener on a non-deterministic order. That has been verified by comparing two Envoy config dumps for the same proxy where the only change was the order of the external services in the filter chain.

It then causes the listener to be drained and all connections to be dropped.

I’d like to know if my understanding is correct and what’s the recommended way to overcome this problem.

Is it have a VirtualService for each of the ServiceEntry routing traffic from a different port recommended?

e.g. Routing alternative port 7000 to standard 443

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: simulator
  namespace: external
spec:
  hosts:
    - example
  exportTo:
    - "*"
  ports:
    - number: 443
      name: https-simulator
      protocol: TLS
    - number: 7000
      name: https-alt
      protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: simulator
  namespace: external
spec:
  hosts:
    - example
  tls:
    - match:
      - port: 7000
        sniHosts:
          - example
      route:
        - destination:
            host: example
            port:
              number: 443

I had to end up aggregating all external services running on port 443 into a single ServiceEntry. It doesn’t look great but I haven’t had any connection being dropped (envoy listeners being drained) since.

Would anyone be able to provide additional input on what’s the best way moving forward here?

@douglas-reid - tagging you as saw you working on a few cases related if you don’t mind having a look.

I found a discussion of what I believe is a related issue https://github.com/istio/istio/issues/20477

The gist is that if detecting the protocol for a TLS request against all the registered ServiceEntries takes more than 100ms, the request fails with a 502 error. How long protocol detection takes seems to grow with the number of TLS ServiceEntries in play. I suspect that by aggregating all your hosts into a single ServiceEntry, you brought the evaluation time to under 100ms.

Setting global.proxy.protocolDetectionTimeout=0s is supposed to mitigate the issue, but there is some discussion in the linked issue that that doesn’t work. You could write an EnvoyFilter to patch Envoy’s configuration directly.

Alternatively, I found that setting the ServiceEntry spec.ports.protocol: TCP also fixes the issue, but that effectively allows any https traffic out of your cluster, which defeats the purpose of REGISTRY_ONLY.