Hi,
After trying to upgrade a v1.17 Kubernetes multicluster setup, which was running on istio 1.5.1, to version 1.7.3 i started to see some problems regarding Raw TCP services running across multicluster.
I followed the upgrade steps recommended by Istio, but somehow Raw TCP services seem to be failing.
We are running two simple services on cluster2:
- Python server running a SimpleHTTPServer;
- Netcat listener
The services are configured as follows:
kind: Service
apiVersion: v1
metadata:
name: python-multicluster-test
namespace: default
labels:
app: python-multicluster-test
spec:
ports:
- name: http-python
protocol: TCP
port: 8000
targetPort: 8000
selector:
app: python-multicluster-test
type: ClusterIP
sessionAffinity: None
---
kind: Service
apiVersion: v1
metadata:
name: istio-multicluster-test
namespace: default
spec:
ports:
- name: tcp-nc
protocol: TCP
port: 7000
targetPort: 7000
selector:
app: istio-multicluster-test
type: ClusterIP
Both clusters are provisioned with the following IstioOperator CR:
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: istio-controlplane
namespace: istio-system
spec:
addonComponents:
ingressGateways:
enabled: true
istiocoredns:
enabled: true
kiali:
enabled: true
prometheus:
enabled: true
meshConfig:
accessLogEncoding: JSON
profile: default
values:
gateways:
istio-ingressgateway:
sds:
enabled: true
global:
controlPlaneSecurityEnabled: true
hub: docker.io/istio
multiCluster:
enabled: true
podDNSSearchNamespaces:
- global
On Cluster1 i have the following ServiceEntries:
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: istio-multicluster-test-se
namespace: default
spec:
addresses:
- 240.0.0.87
endpoints:
- address: <REDACTED>
ports:
netcat: 31912
hosts:
- istio-multicluster-test.default.global
location: MESH_INTERNAL
ports:
- name: netcat
number: 7000
protocol: TCP
resolution: DNS
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: python-multicluster-test-se
namespace: default
spec:
addresses:
- 240.0.0.88
endpoints:
- address: <REDACTED>
ports:
http1: 31912
hosts:
- python-multicluster-test.default.global
location: MESH_INTERNAL
ports:
- name: http1
number: 8000
protocol: TCP
resolution: DNS
Following the bug reported in Replicated control planes regression bug · Issue #27909 · istio/istio · GitHub i had to edit multicluster ingress gateway EnvoyFilter, in order to enable the correct rewrite that transforms the *.global domain in *.svc.cluster.local.
When trying to connect from cluster1 to cluster 2’s istio-multicluster-test-se, we are having TCP RSTs. IngressGateway logs show the following behavior:
2020-10-27T16:27:20.652671Z debug envoy filter tls inspector: new connection accepted
2020-10-27T16:27:20.652741Z trace envoy filter tls inspector: recv: -1
2020-10-27T16:27:20.653023Z trace envoy filter tls inspector: recv: 6596
2020-10-27T16:27:20.653113Z debug envoy filter tls:onServerName(), requestedServerName: outbound_.7000_..istio-multicluster-test.default.global
2020-10-27T16:27:20.653140Z trace envoy filter tls inspector: done: true
2020-10-27T16:27:20.653264Z debug envoy filter [C8190] new tcp proxy session
2020-10-27T16:27:20.653292Z trace envoy connection [C8190] readDisable: disable=true disable_count=0 state=0 buffer_length=0
2020-10-27T16:27:20.653315Z trace envoy filter [C8190] sni_cluster: new connection with server name outbound.7000_..istio-multicluster-test.default.global
2020-10-27T16:27:20.653331Z trace envoy filter [C8190] tcp_cluster_rewrite: new connection with server name outbound.7000_..istio-multicluster-test.default.global
2020-10-27T16:27:20.653357Z trace envoy filter [C8190] tcp_cluster_rewrite: final tcp proxy cluster name outbound.7000_..istio-multicluster-test.default.svc.cluster.local
2020-10-27T16:27:20.653421Z debug envoy filter [C8190] Creating connection to cluster outbound.7000_..istio-multicluster-test.default.svc.cluster.local
2020-10-27T16:27:20.653465Z debug envoy pool creating a new connection
2020-10-27T16:27:20.653534Z debug envoy pool [C8191] connecting
2020-10-27T16:27:20.653563Z debug envoy connection [C8191] connecting to 10.44.0.17:7000
2020-10-27T16:27:20.653729Z debug envoy connection [C8191] connection in progress
2020-10-27T16:27:20.653778Z debug envoy pool queueing request due to no available connections
2020-10-27T16:27:20.653791Z debug envoy conn_handler [C8190] new connection
2020-10-27T16:27:20.653800Z trace envoy main item added to deferred deletion list (size=1)
2020-10-27T16:27:20.653809Z trace envoy main clearing deferred deletion list (size=1)
2020-10-27T16:27:20.653828Z trace envoy connection [C8190] socket event: 2
2020-10-27T16:27:20.653834Z trace envoy connection [C8190] write ready
2020-10-27T16:27:20.655909Z trace envoy connection [C8191] socket event: 2
2020-10-27T16:27:20.655952Z trace envoy connection [C8191] write ready
2020-10-27T16:27:20.655966Z debug envoy connection [C8191] connected
2020-10-27T16:27:20.655979Z debug envoy pool [C8191] assigning connection
2020-10-27T16:27:20.655995Z trace envoy connection [C8190] readDisable: disable=false disable_count=1 state=0 buffer_length=0
2020-10-27T16:27:20.656019Z debug envoy filter TCP:onUpstreamEvent(), requestedServerName: outbound.7000_._.istio-multicluster-test.default.global
2020-10-27T16:27:20.656047Z trace envoy connection [C8190] socket event: 3
2020-10-27T16:27:20.656054Z trace envoy connection [C8190] write ready
2020-10-27T16:27:20.656061Z trace envoy connection [C8190] read ready. dispatch_buffered_data=false
2020-10-27T16:27:20.656100Z trace envoy connection [C8190] read returns: 6596
2020-10-27T16:27:20.656135Z trace envoy connection [C8190] read error: Resource temporarily unavailable
2020-10-27T16:27:20.656161Z trace envoy filter [C8190] downstream connection received 6596 bytes, end_stream=false
2020-10-27T16:27:20.656171Z trace envoy filter Alpn Protocol Not Found. Expected istio-peer-exchange, Got
At first, i thought that there might be something wrong with the ALPN filter, but when i tried to connect to the service python-multicluster-test-se, the logs seem to be pretty similar, though, the connection isn’t dropped, and i get a correct response in cluster1. The following logs show this behavior:
2020-10-27T16:29:07.557939Z debug envoy filter tls inspector: new connection accepted
2020-10-27T16:29:07.558023Z trace envoy filter tls inspector: recv: 6597
2020-10-27T16:29:07.558125Z debug envoy filter tls:onServerName(), requestedServerName: outbound_.8000_..python-multicluster-test.default.global
2020-10-27T16:29:07.558264Z debug envoy filter [C8245] new tcp proxy session
2020-10-27T16:29:07.558292Z trace envoy connection [C8245] readDisable: disable=true disable_count=0 state=0 buffer_length=0
2020-10-27T16:29:07.558316Z trace envoy filter [C8245] sni_cluster: new connection with server name outbound.8000_..python-multicluster-test.default.global
2020-10-27T16:29:07.558334Z trace envoy filter [C8245] tcp_cluster_rewrite: new connection with server name outbound.8000_..python-multicluster-test.default.global
2020-10-27T16:29:07.558361Z trace envoy filter [C8245] tcp_cluster_rewrite: final tcp proxy cluster name outbound.8000_..python-multicluster-test.default.svc.cluster.local
2020-10-27T16:29:07.558409Z debug envoy filter [C8245] Creating connection to cluster outbound.8000_..python-multicluster-test.default.svc.cluster.local
2020-10-27T16:29:07.558459Z debug envoy pool creating a new connection
2020-10-27T16:29:07.558542Z debug envoy pool [C8246] connecting
2020-10-27T16:29:07.558559Z debug envoy connection [C8246] connecting to 10.36.0.5:8000
2020-10-27T16:29:07.558879Z debug envoy connection [C8246] connection in progress
2020-10-27T16:29:07.558972Z debug envoy pool queueing request due to no available connections
2020-10-27T16:29:07.558987Z debug envoy conn_handler [C8245] new connection
2020-10-27T16:29:07.559037Z trace envoy connection [C8245] socket event: 2
2020-10-27T16:29:07.559046Z trace envoy connection [C8245] write ready
2020-10-27T16:29:07.559053Z trace envoy connection [C8246] socket event: 2
2020-10-27T16:29:07.559059Z trace envoy connection [C8246] write ready
2020-10-27T16:29:07.559069Z debug envoy connection [C8246] connected
2020-10-27T16:29:07.559079Z debug envoy pool [C8246] assigning connection
2020-10-27T16:29:07.559094Z trace envoy connection [C8245] readDisable: disable=false disable_count=1 state=0 buffer_length=0
2020-10-27T16:29:07.559113Z debug envoy filter TCP:onUpstreamEvent(), requestedServerName: outbound.8000_._.python-multicluster-test.default.global
2020-10-27T16:29:07.559136Z trace envoy connection [C8245] socket event: 3
2020-10-27T16:29:07.559143Z trace envoy connection [C8245] write ready
2020-10-27T16:29:07.559149Z trace envoy connection [C8245] read ready. dispatch_buffered_data=false
2020-10-27T16:29:07.559177Z trace envoy connection [C8245] read returns: 6597
2020-10-27T16:29:07.559201Z trace envoy connection [C8245] read error: Resource temporarily unavailable
2020-10-27T16:29:07.559216Z trace envoy filter [C8245] downstream connection received 6597 bytes, end_stream=false
2020-10-27T16:29:07.559230Z trace envoy filter Alpn Protocol Not Found. Expected istio-peer-exchange, Got
The cluster configuration extracted from the config_dumps on the ingressgateway are the following:
{
"version_info": "2020-10-27T11:20:25Z/13",
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "outbound|7000||istio-multicluster-test.default.svc.cluster.local",
"type": "EDS",
"eds_cluster_config": {
"eds_config": {
"ads": {},
"resource_api_version": "V3"
},
"service_name": "outbound|7000||istio-multicluster-test.default.svc.cluster.local"
},
"connect_timeout": "10s",
"circuit_breakers": {
"thresholds": [
{
"max_connections": 4294967295,
"max_pending_requests": 4294967295,
"max_requests": 4294967295,
"max_retries": 4294967295
}
]
},
"filters": [
{
"name": "istio.metadata_exchange",
"typed_config": {
"@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
"type_url": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange",
"value": {
"protocol": "istio-peer-exchange"
}
}
}
],
"transport_socket_matches": [
{
"name": "tlsMode-istio",
"match": {
"tlsMode": "istio"
},
"transport_socket": {
"name": "envoy.transport_sockets.tls",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
"common_tls_context": {
"alpn_protocols": [
"istio-peer-exchange",
"istio"
],
"tls_certificate_sds_secret_configs": [
{
"name": "default",
"sds_config": {
"api_config_source": {
"api_type": "GRPC",
"grpc_services": [
{
"envoy_grpc": {
"cluster_name": "sds-grpc"
}
}
],
"transport_api_version": "V3"
},
"initial_fetch_timeout": "0s",
"resource_api_version": "V3"
}
}
],
"combined_validation_context": {
"default_validation_context": {
"match_subject_alt_names": [
{
"exact": "spiffe://cluster.local/ns/default/sa/default"
}
]
},
"validation_context_sds_secret_config": {
"name": "ROOTCA",
"sds_config": {
"api_config_source": {
"api_type": "GRPC",
"grpc_services": [
{
"envoy_grpc": {
"cluster_name": "sds-grpc"
}
}
],
"transport_api_version": "V3"
},
"initial_fetch_timeout": "0s",
"resource_api_version": "V3"
}
}
}
},
"sni": "outbound_.7000_._.istio-multicluster-test.default.svc.cluster.local"
}
}
},
{
"name": "tlsMode-disabled",
"match": {},
"transport_socket": {
"name": "envoy.transport_sockets.raw_buffer"
}
}
]
},
"last_updated": "2020-10-27T11:55:13.237Z"
},
{
"version_info": "2020-10-27T15:58:01Z/17",
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "outbound|8000||python-multicluster-test.default.svc.cluster.local",
"type": "EDS",
"eds_cluster_config": {
"eds_config": {
"ads": {},
"resource_api_version": "V3"
},
"service_name": "outbound|8000||python-multicluster-test.default.svc.cluster.local"
},
"connect_timeout": "10s",
"circuit_breakers": {
"thresholds": [
{
"max_connections": 4294967295,
"max_pending_requests": 4294967295,
"max_requests": 4294967295,
"max_retries": 4294967295
}
]
},
"filters": [
{
"name": "istio.metadata_exchange",
"typed_config": {
"@type": "type.googleapis.com/udpa.type.v1.TypedStruct",
"type_url": "type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange",
"value": {
"protocol": "istio-peer-exchange"
}
}
}
],
"transport_socket_matches": [
{
"name": "tlsMode-istio",
"match": {
"tlsMode": "istio"
},
"transport_socket": {
"name": "envoy.transport_sockets.tls",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
"common_tls_context": {
"alpn_protocols": [
"istio-peer-exchange",
"istio"
],
"tls_certificate_sds_secret_configs": [
{
"name": "default",
"sds_config": {
"api_config_source": {
"api_type": "GRPC",
"grpc_services": [
{
"envoy_grpc": {
"cluster_name": "sds-grpc"
}
}
],
"transport_api_version": "V3"
},
"initial_fetch_timeout": "0s",
"resource_api_version": "V3"
}
}
],
"combined_validation_context": {
"default_validation_context": {
"match_subject_alt_names": [
{
"exact": "spiffe://cluster.local/ns/default/sa/default"
}
]
},
"validation_context_sds_secret_config": {
"name": "ROOTCA",
"sds_config": {
"api_config_source": {
"api_type": "GRPC",
"grpc_services": [
{
"envoy_grpc": {
"cluster_name": "sds-grpc"
}
}
],
"transport_api_version": "V3"
},
"initial_fetch_timeout": "0s",
"resource_api_version": "V3"
}
}
}
},
"sni": "outbound_.8000_._.python-multicluster-test.default.svc.cluster.local"
}
}
},
{
"name": "tlsMode-disabled",
"match": {},
"transport_socket": {
"name": "envoy.transport_sockets.raw_buffer"
}
}
]
},
"last_updated": "2020-10-27T15:58:01.465Z"
},
Was there any regression regarding TCP multicluster services? At this point i am stuck, with little ideas… I have even purged old versions and tried a clean install of 1.7.3 versions.
Has anyone else experienced similar problems?
Thank You