All these services are then accessed via Envoy listener 0.0.0.0_443.
My understanding is that Pilot then pushes filter chain updates to Envoy for that listener on a non-deterministic order. That has been verified by comparing two Envoy config dumps for the same proxy where the only change was the order of the external services in the filter chain.
It then causes the listener to be drained and all connections to be dropped.
I’d like to know if my understanding is correct and what’s the recommended way to overcome this problem.
Is it have a VirtualService for each of the ServiceEntry routing traffic from a different port recommended?
e.g. Routing alternative port 7000 to standard 443
I had to end up aggregating all external services running on port 443 into a single ServiceEntry. It doesn’t look great but I haven’t had any connection being dropped (envoy listeners being drained) since.
Would anyone be able to provide additional input on what’s the best way moving forward here?
@douglas-reid - tagging you as saw you working on a few cases related if you don’t mind having a look.
The gist is that if detecting the protocol for a TLS request against all the registered ServiceEntries takes more than 100ms, the request fails with a 502 error. How long protocol detection takes seems to grow with the number of TLS ServiceEntries in play. I suspect that by aggregating all your hosts into a single ServiceEntry, you brought the evaluation time to under 100ms.
Setting global.proxy.protocolDetectionTimeout=0s is supposed to mitigate the issue, but there is some discussion in the linked issue that that doesn’t work. You could write an EnvoyFilter to patch Envoy’s configuration directly.
Alternatively, I found that setting the ServiceEntry spec.ports.protocol: TCP also fixes the issue, but that effectively allows any https traffic out of your cluster, which defeats the purpose of REGISTRY_ONLY.
Thanks for your response. I’m already running with protocolDetectionTimeout=0s.
The problem is actually when pilot pushes xDS configuration to envoy it replaces the 0.0.0.0_443 listener even though there were no changes. Somewhat related to https://github.com/istio/istio/issues/11971.
We are running into the same issue with multiple HTTPS service entries causing redundant, frequent listener updates for 0.0.0.0_443. The only change in the configuration is the ordering of server names which should not cause a listener update, IMO.
Any guidance on a solution would be greatly appreciated. The above mentioned solutions seem like a work-around or aren’t applicable in our current environment.