Too many HTTPS external services on same port cause Envoy to drop connections

luisds · March 20, 2020, 4:38pm

I’m using Istio 1.4.5 with outboundTrafficPolicy=REGISTRY_ONLY and several HTTPS external services exposed as ServiceEntry on standard port 443.

e.g. an external service

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: simulator
  namespace: external
spec:
  hosts:
    - example.net
  exportTo:
    - "*"
  ports:
    - number: 443
      name: https-simulator
      protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL

All these services are then accessed via Envoy listener 0.0.0.0_443.

My understanding is that Pilot then pushes filter chain updates to Envoy for that listener on a non-deterministic order. That has been verified by comparing two Envoy config dumps for the same proxy where the only change was the order of the external services in the filter chain.

It then causes the listener to be drained and all connections to be dropped.

I’d like to know if my understanding is correct and what’s the recommended way to overcome this problem.

Is it have a VirtualService for each of the ServiceEntry routing traffic from a different port recommended?

e.g. Routing alternative port 7000 to standard 443

apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: simulator
  namespace: external
spec:
  hosts:
    - example
  exportTo:
    - "*"
  ports:
    - number: 443
      name: https-simulator
      protocol: TLS
    - number: 7000
      name: https-alt
      protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: simulator
  namespace: external
spec:
  hosts:
    - example
  tls:
    - match:
      - port: 7000
        sniHosts:
          - example
      route:
        - destination:
            host: example
            port:
              number: 443

luisds · March 30, 2020, 1:16pm

I had to end up aggregating all external services running on port 443 into a single ServiceEntry. It doesn’t look great but I haven’t had any connection being dropped (envoy listeners being drained) since.

Would anyone be able to provide additional input on what’s the best way moving forward here?

@douglas-reid - tagging you as saw you working on a few cases related if you don’t mind having a look.

rmichela · March 31, 2020, 7:15pm

I found a discussion of what I believe is a related issue https://github.com/istio/istio/issues/20477

The gist is that if detecting the protocol for a TLS request against all the registered ServiceEntries takes more than 100ms, the request fails with a 502 error. How long protocol detection takes seems to grow with the number of TLS ServiceEntries in play. I suspect that by aggregating all your hosts into a single ServiceEntry, you brought the evaluation time to under 100ms.

Setting global.proxy.protocolDetectionTimeout=0s is supposed to mitigate the issue, but there is some discussion in the linked issue that that doesn’t work. You could write an EnvoyFilter to patch Envoy’s configuration directly.

Alternatively, I found that setting the ServiceEntry spec.ports.protocol: TCP also fixes the issue, but that effectively allows any https traffic out of your cluster, which defeats the purpose of REGISTRY_ONLY.

luisds · April 1, 2020, 4:04pm

Thanks for your response. I’m already running with protocolDetectionTimeout=0s.

The problem is actually when pilot pushes xDS configuration to envoy it replaces the 0.0.0.0_443 listener even though there were no changes. Somewhat related to https://github.com/istio/istio/issues/11971.

From envoy logs:

[external/envoy/source/server/listener_impl.cc:273] warm complete. updating active listener: name=0.0.0.0_443, hash=6539101047026873500, address=0.0.0.0:443
[external/envoy/source/server/listener_impl.cc:273] draining listener: name=0.0.0.0_443, hash=15072743390866864361, address=0.0.0.0:443
[external/envoy/source/server/lds_api.cc:63] lds: add/update listener '0.0.0.0_443'

So envoy sees a different Hash that’s why it creates a new listener. I can’t understand why it would generate a new Hash.

luisds · April 29, 2020, 1:24pm

Have seen a few other issues where LDS updates were not sorted beforehand causing the listeners to be drained:

RBAC: https://github.com/istio/istio/issues/17347
Consul: https://github.com/istio/istio/issues/11971

Wondering if there’s a kubernetes specific code not sorting the filters (hosts) based on ServiceEntries?

koalaty-code · July 20, 2020, 9:20pm

We are running into the same issue with multiple HTTPS service entries causing redundant, frequent listener updates for 0.0.0.0_443. The only change in the configuration is the ordering of server names which should not cause a listener update, IMO.

Any guidance on a solution would be greatly appreciated. The above mentioned solutions seem like a work-around or aren’t applicable in our current environment.

luisds · July 21, 2020, 8:14am

There’s a change in istio 1.6 that seems to be addressing this issue.

I haven’t upgraded istio to that version yet, please let us know if you did.

And all the fixes suggested here are workarounds indeed as no proper fix available.

Topic		Replies	Views
Multiple TCP ServiceEntries with same Port	2	6655	July 17, 2020
Understanding how Service Entry, Virtual Service, Destination Rule works on outbound connections?	1	1925	December 23, 2020
Forwarding port on 127.0.0.1 to external	1	826	September 22, 2021
External service https downgraded to http	1	815	June 7, 2023
Outbound traffic to exernal service (SOLVED) Networking	11	2577	April 19, 2019

Too many HTTPS external services on same port cause Envoy to drop connections

Related topics