Ingress gateway SNI proxy sending RST after Client Hello

We’re attempting to setup multicluster (replicated control plane), but hitting an issue with the SNI proxy (tcp/15443) in the ingress gateway.

The client istio-proxy connects to ingress, sends TLS Client Hello (with SNI), and ingress send an ACKs for the Client Hello. But instead of ingress then sending back Server Hello with a certificate, it issues a TCP RST. Client proxy then tries another two times before failing and returning a 503

k8s: v1.18.3
Istio: v1.7.0
Rancher: v2.4.5

We’re following https://istio.io/latest/docs/setup/install/multicluster/gateways (including sample cacerts). Manifests have been installed using istioctl against values-istio-multicluster-gateways.yaml

External access is via NodePort
Port: tls 15443/TCP
TargetPort: 15443/TCP
NodePort: tls 30163/TCP
Endpoints: 10.42.0.72:15443

Ingress appears to be pulling CA certificates from SDS correctly (and confirmed the data decodes back to the sample cacerts)
2020-09-05T01:18:57.866787Z info GET https://10.43.0.1:443/api/v1/namespaces/istio-system/secrets?limit=500&resourceVersion=0 200 OK in 4 milliseconds
2020-09-05T01:18:57.866803Z info Response Headers:
2020-09-05T01:18:57.866806Z info Audit-Id: 677db825-6cd6-4e9b-83f5-e7a4b4d4312f
2020-09-05T01:18:57.866809Z info Cache-Control: no-cache, private
2020-09-05T01:18:57.866811Z info Content-Type: application/json
2020-09-05T01:18:57.866813Z info Date: Sat, 05 Sep 2020 01:18:57 GMT
2020-09-05T01:18:57.867035Z info Response Body: {“kind”:“SecretList”,

2020-09-05T01:18:57.868317Z warn secretfetcher failed load server cert/key pair from secret cacerts: server cert or private key is empty
2020-09-05T01:18:57.868896Z info GET https://10.43.0.1:443/api/v1/namespaces/istio-system/secrets?allowWatchBookmarks=true&resourceVersion=2497786&timeout=5m19s&timeoutSeconds=319&watch=true 200 OK in 0 milliseconds
2020-09-05T01:18:57.868905Z info Response Headers:
2020-09-05T01:18:57.868908Z info Cache-Control: no-cache, private
2020-09-05T01:18:57.868910Z info Content-Type: application/json
2020-09-05T01:18:57.868912Z info Date: Sat, 05 Sep 2020 01:18:57 GMT
2020-09-05T01:18:57.961725Z debug caches populated
2020-09-05T01:18:57.962040Z info sds SDS gRPC server for workload UDS starts, listening on “./etc/istio/proxy/SDS”
2020-09-05T01:18:57.962120Z info sds SDS gRPC server for gateway controller starts, listening on “./var/run/ingress_gateway/sds”
2020-09-05T01:18:57.962162Z info Starting proxy agent
2020-09-05T01:18:57.962127Z info sds Start SDS grpc server
2020-09-05T01:18:57.962202Z info Opening status port 15020
2020-09-05T01:18:57.962294Z info Received new config, creating new Envoy epoch 0
2020-09-05T01:18:57.962351Z info Epoch 0 starting
2020-09-05T01:18:57.962367Z info sds Start SDS grpc server for ingress gateway proxy
2020-09-05T01:18:57.967501Z info Envoy command: [-c etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45 --parent-shutdown-time-s 60 --service-cluster istio-ingressgateway --service-node router~10.42.0.72~istio-ingressgateway-85d8dd7994-28gbc.istio-system~istio-system.svc.cluster.local --local-address-ip-version v4 --log-format-prefix-with-location 0 --log-format %Y-%m-%dT%T.%fZ %l envoy %n %v -l warning --component-log-level misc:error]
2020-09-05T01:18:58.004271Z warning envoy runtime Unable to use runtime singleton for feature envoy.reloadable_features.activate_fds_next_event_loop
2020-09-05T01:18:58.031737Z warning envoy config StreamAggregatedResources gRPC config stream closed: 14, no healthy upstream
2020-09-05T01:18:58.031767Z warning envoy config Unable to establish new stream
2020-09-05T01:18:58.039132Z info sds resource:default new connection
2020-09-05T01:18:58.039201Z info sds Skipping waiting for gateway secret
2020-09-05T01:18:58.219285Z info cache Root cert has changed, start rotating root cert for SDS clients
2020-09-05T01:18:58.219308Z info cache GenerateSecret default
2020-09-05T01:18:58.219764Z info sds resource:default pushed key/cert pair to proxy
2020-09-05T01:18:58.221865Z warning envoy main there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
2020-09-05T01:18:58.604333Z info sds resource:ROOTCA new connection
2020-09-05T01:18:58.604417Z info sds Skipping waiting for gateway secret
2020-09-05T01:18:58.604442Z info cache Loaded root cert from certificate ROOTCA
2020-09-05T01:18:58.604601Z info sds resource:ROOTCA pushed root cert to proxy
2020-09-05T01:18:59.039710Z info Envoy proxy is ready

Client shows the 503 to one of the cluster nodes (172.25.22.20)
[2020-09-05T01:54:45.307Z] "HEAD /headers HTTP/1.1" 503 UF,URX "-" "-" 0 0 18 - "-" "curl/7.69.1" "3f4dd68f-c346-4ade-a69f-7a7771a5d7f8" "httpbin.bar.global:8000" "172.25.22.20:30163" outbound|8000||httpbin.bar.global - 240.0.0.2:8000 10.42.2.52:59116 - default

Ingress is showing the connection received, but NR state
2020-09-05T01:54:43.037907Z trace envoy connection [C1096] write ready
[2020-09-05T01:54:45.309Z] “- - -” 0 NR “-” “-” 0 0 0 - “-” “-” “-” “-” “-” - - 10.42.0.72:15443 172.25.22.20:40152 - -
[2020-09-05T01:54:45.322Z] “- - -” 0 NR “-” “-” 0 0 0 - “-” “-” “-” “-” “-” - - 10.42.0.72:15443 172.25.22.20:11885 - -
[2020-09-05T01:54:45.325Z] “- - -” 0 NR “-” “-” 0 0 0 - “-” “-” “-” “-” “-” - - 10.42.0.72:15443 172.25.22.20:58423 - -
2020-09-05T01:54:53.038229Z trace envoy http [C4] message complete

Wireshark shows the SNI being set as outbound_.8000_._.httpbin.bar.global

Ingress shows what look like correct paths
SERVICE FQDN PORT SUBSET DIRECTION TYPE DESTINATION RULE
BlackHoleCluster - - - STATIC
agent - - - STATIC
default-http-backend.ingress-nginx.svc.cluster.local 80 - outbound EDS
httpbin.bar.svc.cluster.local 8000 - outbound EDS
httpbin.unite-core-stage1.svc.cluster.local 8000 - outbound EDS
istio-egressgateway.istio-system.svc.cluster.local 80 - outbound EDS
istio-egressgateway.istio-system.svc.cluster.local 443 - outbound EDS
istio-egressgateway.istio-system.svc.cluster.local 15443 - outbound EDS
istio-ingressgateway.istio-system.svc.cluster.local 80 - outbound EDS
istio-ingressgateway.istio-system.svc.cluster.local 443 - outbound EDS
istio-ingressgateway.istio-system.svc.cluster.local 15021 - outbound EDS
istio-ingressgateway.istio-system.svc.cluster.local 15443 - outbound EDS
istiocoredns.istio-system.svc.cluster.local 53 - outbound EDS
istiod.istio-system.svc.cluster.local 443 - outbound EDS
istiod.istio-system.svc.cluster.local 853 - outbound EDS
istiod.istio-system.svc.cluster.local 15010 - outbound EDS
istiod.istio-system.svc.cluster.local 15012 - outbound EDS
istiod.istio-system.svc.cluster.local 15014 - outbound EDS
kube-dns.kube-system.svc.cluster.local 53 - outbound EDS
kube-dns.kube-system.svc.cluster.local 9153 - outbound EDS
kubernetes.default.svc.cluster.local 443 - outbound EDS
metrics-server.kube-system.svc.cluster.local 443 - outbound EDS
outbound_.15010_..istiod.istio-system.svc.cluster.local - - - EDS
outbound
.15012_..istiod.istio-system.svc.cluster.local - - - EDS
outbound
.15014_..istiod.istio-system.svc.cluster.local - - - EDS
outbound
.15021_..istio-ingressgateway.istio-system.svc.cluster.local - - - EDS
outbound
.15443_..istio-egressgateway.istio-system.svc.cluster.local - - - EDS
outbound
.15443_..istio-ingressgateway.istio-system.svc.cluster.local - - - EDS
outbound
.443_..istio-egressgateway.istio-system.svc.cluster.local - - - EDS
outbound
.443_..istio-ingressgateway.istio-system.svc.cluster.local - - - EDS
outbound
.443_..istiod.istio-system.svc.cluster.local - - - EDS
outbound
.443_..kubernetes.default.svc.cluster.local - - - EDS
outbound
.443_..metrics-server.kube-system.svc.cluster.local - - - EDS
outbound
.53_..istiocoredns.istio-system.svc.cluster.local - - - EDS
outbound
.53_..kube-dns.kube-system.svc.cluster.local - - - EDS
outbound
.8000_..httpbin.bar.svc.cluster.local - - - EDS
outbound
.8000_..httpbin.unite-core-stage1.svc.cluster.local - - - EDS
outbound
.80_..default-http-backend.ingress-nginx.svc.cluster.local - - - EDS
outbound
.80_..istio-egressgateway.istio-system.svc.cluster.local - - - EDS
outbound
.80_..istio-ingressgateway.istio-system.svc.cluster.local - - - EDS
outbound
.80_..sleep.unite-core-stage1.svc.cluster.local - - - EDS
outbound
.853_..istiod.istio-system.svc.cluster.local - - - EDS
outbound
.9153_._.kube-dns.kube-system.svc.cluster.local - - - EDS
prometheus_stats - - - STATIC
sds-grpc - - - STATIC
sleep.unite-core-stage1.svc.cluster.local 80 - outbound EDS
xds-grpc - - - STRICT_DNS
zipkin - - - STRICT_DNS

I must be missing something obvious here. I would guess one of the following:

  1. The SNI gateway is not actually running in TLS mode
  2. The client is meant to be mapping the .global hostname to svc.cluster.local when populating the SNI field in the Client Hello
  3. The ingress is meant to have a route for the .global path (I’ve been assuming it’s doing automatic translation)

Any ideas on how to troubleshoot this further?

-Neil

Hi!
Have you managed to fix your problem?
I am having similar issues, after upgrading my clusters from istio 1.5.x to 1.7.3. Right before the upgrade, everything was running smoothly with our own CA’s.

I have managed to run ksniff on the cluster, and the behaviour is exactly the same as yours… After the Client Hello we see a TCP ACK from the IngressGateway and right after a RST,ACK

Thanks

Our issue was a mistake within the Rancher-provided helm chart for Istio (which was itself caused by an incorrect documentation update on the Istio side for v1.5). The end result was that the environment variable ISTIO_META_ROUTER_MODE was being set to ‘standard’ instead of ‘sni-dnat’ inside the ingress.

The behavior of the ingress is that it will reset the socket if it doesn’t have a listener for the SNI being presented. I haven’t looked through the code, but assume ‘sni-dnat’ is responsible for translating the synthetic .global address into .svc.cluster.local before the listener and route are evaluated.

We added “gateways.istio-ingressgateway.env.ISTIO_META_ROUTER_MODE=sni-dnat” to helm variables to fix it.

Unfortunately that doesn’t seem to be our case…
I’ve checked the environment variables in the pod and they seem to be correct.
Have any other clues where the issue might be?

We even tried to purge all istio configurations and cleanly reinstall 1.7.3, but the issue prevails…

I wished istio could be more “debug” friendly than it actually is…

Thank You

@blackenz Have you managed to solve your problem with SNI proxy port(15443)?

@ngarratt Can you share listener config for SNI port(15443) on Istio Gateway? I am running into exact issue on v1.6.9