I’m running istio version 1.05 (also tried 1.06). I used helm for the installation with the following settings:
helm template install/kubernetes/helm/istio \
--name istio --namespace istio-system \
--set tracing.enabled=true \
--set servicegraph.enabled=false \
--set global.meshExpansion=true \
--set grafana.enabled=true \
--set global.mtls.enabled=true \
--set global.k8sIngressHttps=true \
--set sidecarInjectorWebhook.enabled=true \
--set certmanager.enabled=true \
--set certmanager.tag=v0.5.2 \
--set kiali.enabled=true \
> istio.yaml
The problem I’m facing is that communication from outside the mesh works just fine. The frontend (see below) is reachable. But the service-to-service communication inside the mesh does not work. every request to a valid dns entry results in 404.
# envoy proxy log
2019-03-07T14:16:34.918Z] "GET /example-backend/infoHTTP/1.1" 404 NR 0 0 0 - "-" "curl/7.52.1" "ad7593ed-7743-49a0-a673-fd29a2e7663d" "example-backend:9292" "-" - - 10.2.200.109:9292 10.1.1.4:50460
My services are setup like this:
Backend Services:
apiVersion: v1
kind: Service
metadata:
name: example-backend
labels:
app: example-backend
spec:
type: ClusterIP
ports:
- port: 9292
name: http
selector:
app: example-backend
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: example-backend
spec:
replicas: 2
selector:
matchLabels:
app: auth-tokens
track: stable
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: example-backend
version: 0.0.1
track: stable
spec:
containers:
- name: example-backend
image: eu.gcr.io/freska-staging/example-backend:0.0.1
imagePullPolicy: Always
ports:
- containerPort: 9292
env:
[...]
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: example-backend
labels:
app: example-backend
access: internal
spec:
hosts:
- example-backend.default.svc.cluster.local
gateways:
- mesh
http:
- match:
- uri:
prefix: /v2/example-backend/
- uri:
prefix: /example-backend/info
route:
- destination:
host: example-backend
subset: stable
port:
number: 9292
weight: 100
- destination:
host: example-backend
subset: canary
port:
number: 9292
weight: 0
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: example-backend-canary-traffic
labels:
app: example-backend
spec:
host: example-backend
trafficPolicy:
tls:
mode: ISTIO_MUTUAL
loadBalancer:
simple: LEAST_CONN
subsets:
- name: stable
labels:
track: stable
- name: canary
labels:
track: canary
Frontend Service:
The basic setup is identical, but the virtual service is slightly different
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: frontend
spec:
hosts:
- "frontend.example.io"
- customer-accounts
gateways:
- public-api-gateway
- mesh
http:
- match:
- uri:
prefix: /welcome
- uri:
prefix: /products
route:
- destination:
host: frontend
subset: stable
port:
number: 9292
weight: 100
- destination:
host: frontend
subset: canary
port:
number: 9292
weight: 0
I must admit the “GET /example-backend/infoHTTP/1.1” is surprising; I expected a space before HTTP/1.1.
I have a “4 point” debugging strategy.
Point 1 is the app itself. Verify that your service really exposes the URL you are trying by doing kubectl exec <any-example-backend-pod> -c istio-proxy -- curl localhost:9292/example-backend/info
.
Point 2 is the sidecar of the app. Normally I would check here next. You have global.mtls.enabled=true
. You can simulate an mTLS request coming into example-backend like this:
PODNAME=...
SERVICE=example-backend
NS=default
SVCPORT=9292
PATH=/example-backend/info
PODIP=$(kubectl get pod $PODNAME -o jsonpath='{.status.podIP}')
kubectl exec $PODNAME -c istio-proxy -- curl -v https://$SERVICE.$NS.svc.cluster.local:$SVCPORT/$PATH --resolve "$SERVICE.$NS.svc.cluster.local:$SVCPORT:$PODIP" --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem --insecure
If this works go on to point 3 by running the same test from the sidecar of frontend.
If this works go on to point 4 by installing curl
or a similar tool in the application container of frontend and trying to contact using HTTP.
1 Like
Thanks for the quick reply and the detailed instructions!
The missing space it just a stupid c/p mistake I think.
I went through the 4 step process you suggested, this is what I found:
- works. the service does expose the endpoint as expected
- also works.
- here I’m not sure. The combination of frontend PODNAME and backend PODIP works. Both frontend PODNAME and PODIP fails. I will add the both below. The service called “customer-accounts” is our frontend, whereas “customer-notifications” is the backend service.
* Added customer-notifications.customer-team.svc.cluster.local:9292:10.1.2.16 to DNS cache
* Hostname customer-notifications.customer-team.svc.cluster.local was found in DNS cache
* Trying 10.1.2.16...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to customer-notifications.customer-team.svc.cluster.local (10.1.2.16) port 9292 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* server certificate verification SKIPPED
* server certificate status verification SKIPPED
* error fetching CN from cert:The requested data were not available.
* common name: (does not match 'customer-notifications.customer-team.svc.cluster.local')
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: O=#1300
* start date: Wed, 27 Feb 2019 13:58:40 GMT
* expire date: Tue, 28 May 2019 13:58:40 GMT
* issuer: O=k8s.cluster.local
* compression: NULL
* ALPN, server accepted to use http/1.1
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0> GET //customer-notifications/info HTTP/1.1
> Host: customer-notifications.customer-team.svc.cluster.local:9292
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 404 Not Found
< content-type: text/plain
< x-cascade: pass
< content-length: 40
< x-envoy-upstream-service-time: 7
< date: Fri, 08 Mar 2019 07:19:11 GMT
< server: envoy
< x-envoy-decorator-operation: customer-accounts.customer-team.svc.cluster.local:9292/*
<
{ [40 bytes data]
100 40 100 40 0 0 84 0 --:--:-- --:--:-- --:--:-- 84
* Connection #0 to host customer-notifications.customer-team.svc.cluster.local left intact
* Added customer-notifications.customer-team.svc.cluster.local:9292:10.1.0.4 to DNS cache
* Hostname customer-notifications.customer-team.svc.cluster.local was found in DNS cache
* Trying 10.1.0.4...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to customer-notifications.customer-team.svc.cluster.local (10.1.0.4) port 9292 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* server certificate verification SKIPPED
* server certificate status verification SKIPPED
* error fetching CN from cert:The requested data were not available.
* common name: (does not match 'customer-notifications.customer-team.svc.cluster.local')
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: O=#1300
* start date: Wed, 27 Feb 2019 13:58:40 GMT
* expire date: Tue, 28 May 2019 13:58:40 GMT
* issuer: O=k8s.cluster.local
* compression: NULL
* ALPN, server accepted to use http/1.1
> GET //customer-notifications/info HTTP/1.1
> Host: customer-notifications.customer-team.svc.cluster.local:9292
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json
< x-content-type-options: nosniff
< content-length: 418
< x-envoy-upstream-service-time: 1
< date: Fri, 08 Mar 2019 07:21:02 GMT
< server: envoy
< x-envoy-decorator-operation: customer-notifications.customer-team.svc.cluster.local:9292/*
<
{ [418 bytes data]
100 418 100 418 0 0 3531 0 --:--:-- --:--:-- --:--:-- 3512
* Connection #0 to host customer-notifications.customer-team.svc.cluster.local left intact
{"name":"customer-notifications","version":"0.0.1","environment":"staging" [.....] }%
The only thing I see is the not matching CN name. Can this be the root cause? And how would I change that?
- fails
This is not a problem that I have seen.
I suspect the problem is not your
* error fetching CN from cert:The requested data were not available.
* common name: (does not match 'customer-notifications.customer-team.svc.cluster.local')
because I have seen that warning on working systems. The --insecure
option of curl should let curl skip that.
Your subject: O=#1300
is bothering me because I have never seen that.
The response of 404 indicates that TLS / HTTP negotiation all worked. If there had been any security problems the connection would have been dropped.
The lack of a “GET /” line for the non-working example bothers me.
I haven’t seen any of these things and don’t know how to solve them.
Anyone?
This morning, I setup a fresh cluster and deployed two services:
At first things looked promising, but then it stopped working again. I found that istioctl proxy-status
shows that the envoy proxy of the service that is not able to communicate (customer-accounts) is not fullt synced.
PROXY CDS LDS EDS RDS PILOT VERSION
customer-notifications-stable-6b7ccc8cd8-fb782.customer-team SYNCED SYNCED SYNCED (100%) SYNCED istio-pilot-868cdfb5f7-fk5vb 1.0.2
customer-notifications-stable-6b7ccc8cd8-sbx68.customer-team SYNCED SYNCED SYNCED (100%) SYNCED istio-pilot-868cdfb5f7-fk5vb 1.0.2
customer-accounts-stable-75f59dbf46-t8jd4.customer-team SYNCED SYNCED SYNCED (100%) STALE istio-pilot-868cdfb5f7-fk5vb 1.0.2
customer-accounts-stable-75f59dbf46-vlwnt.customer-team SYNCED SYNCED SYNCED (100%) STALE istio-pilot-868cdfb5f7-fk5vb 1.0.2
istio-egressgateway-88887488d-65fw4.istio-system SYNCED SYNCED SYNCED (100%) NOT SENT istio-pilot-868cdfb5f7-fk5vb 1.0.2
istio-ingressgateway-58c77897cc-crcr5.istio-system SYNCED SYNCED SYNCED (100%) SYNCED istio-pilot-868cdfb5f7-fk5vb 1.0.2
Does someone know why this might be?
It started happening with the 2nd time I deployed that particular service.
It turns out that restarting the pilot temporarily fixes the issue, but newly created pod’s proxies don’t seem to be synced.
I am having the same issue where some pods just went to “STALE” state and restarting pilot didnt helped there. I am stuck with this and no call is coming into my environment.
is it a rare problem or pilot was not tested for large workloads and more than a few microservices ?
We are suddenly seeing the same issue attempting to resolve internal, standard K8S services by their names in our cluster. We are using 1.1 in a split-horizon EDS installation. I’m not sure what approach I should be taking in diagnosing this exactly, as it seems to be an issue with Istio just failing to distribute proxy configuration which is a vital artery of Istio’s entire function. What precisely should users do in this case? Has a root cause been found for this issue? I expect turbulence in the more advanced Istio features we use (i.e. cross-cluster resolution) but failing to resolve a standard K8S service in the same cluster seems like a non-starter.
Hi @ed.snible
thanks for those very useful debugging steps.
My knowledge of Istio is limited.
I have established in my case step 2 is the issue
I am running istio 1.1.0
Point 2 is the sidecar of the app. Normally I would check here next.
Do you have any suggestions on where to look to fix the particular issue ?
Any help appreciated
Jason