Internal requests failing with 404

I’m running istio version 1.05 (also tried 1.06). I used helm for the installation with the following settings:

helm template install/kubernetes/helm/istio \
   --name istio --namespace istio-system \
   --set tracing.enabled=true \
   --set servicegraph.enabled=false \
   --set global.meshExpansion=true \
   --set grafana.enabled=true \
   --set global.mtls.enabled=true \
   --set global.k8sIngressHttps=true \
   --set sidecarInjectorWebhook.enabled=true \
   --set certmanager.enabled=true \
   --set certmanager.tag=v0.5.2 \
   --set kiali.enabled=true \
> istio.yaml

The problem I’m facing is that communication from outside the mesh works just fine. The frontend (see below) is reachable. But the service-to-service communication inside the mesh does not work. every request to a valid dns entry results in 404.

# envoy proxy log
2019-03-07T14:16:34.918Z] "GET /example-backend/infoHTTP/1.1" 404 NR 0 0 0 - "-" "curl/7.52.1" "ad7593ed-7743-49a0-a673-fd29a2e7663d" "example-backend:9292" "-" - - 10.2.200.109:9292 10.1.1.4:50460

My services are setup like this:

Backend Services:
apiVersion: v1
kind: Service
metadata:
  name: example-backend
  labels:
    app: example-backend
spec:
  type: ClusterIP
  ports:
  - port: 9292
    name: http
  selector:
    app: example-backend

---

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: example-backend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: auth-tokens
      track: stable
  strategy:
   rollingUpdate:
     maxSurge: 1
     maxUnavailable: 1
   type: RollingUpdate
  template:
    metadata:
      labels:
        app: example-backend
        version: 0.0.1
        track: stable
    spec:
      containers:
      - name: example-backend
        image: eu.gcr.io/freska-staging/example-backend:0.0.1
        imagePullPolicy: Always
        ports:
        - containerPort: 9292
        env:
        [...]
---

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: example-backend
  labels:
    app: example-backend
    access: internal
spec:
  hosts:
    - example-backend.default.svc.cluster.local
  gateways:
    - mesh
  http:
  - match:
      - uri:
          prefix: /v2/example-backend/
      - uri:
          prefix: /example-backend/info
    route:
      - destination:
          host: example-backend
          subset: stable
          port:
            number: 9292
        weight: 100
      - destination:
          host: example-backend
          subset: canary
          port:
            number: 9292
        weight: 0

---

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: example-backend-canary-traffic
  labels:
    app: example-backend
spec:
  host: example-backend
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
    loadBalancer:
      simple: LEAST_CONN
  subsets:
    - name: stable
      labels:
        track: stable
    - name: canary
      labels:
        track: canary

Frontend Service:
The basic setup is identical, but the virtual service is slightly different

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: frontend
spec:
  hosts:
    - "frontend.example.io"
    - customer-accounts
  gateways:
    - public-api-gateway
    - mesh
  http:
  - match:
      - uri:
          prefix: /welcome
      - uri:
          prefix: /products
    route:
      - destination:
          host: frontend
          subset: stable
          port:
            number: 9292
        weight: 100
      - destination:
          host: frontend
          subset: canary
          port:
            number: 9292
        weight: 0

I must admit the “GET /example-backend/infoHTTP/1.1” is surprising; I expected a space before HTTP/1.1.

I have a “4 point” debugging strategy.

Point 1 is the app itself. Verify that your service really exposes the URL you are trying by doing kubectl exec <any-example-backend-pod> -c istio-proxy -- curl localhost:9292/example-backend/info.

Point 2 is the sidecar of the app. Normally I would check here next. You have global.mtls.enabled=true. You can simulate an mTLS request coming into example-backend like this:

PODNAME=...
SERVICE=example-backend
NS=default
SVCPORT=9292
PATH=/example-backend/info
PODIP=$(kubectl get pod $PODNAME -o jsonpath='{.status.podIP}')
kubectl exec $PODNAME -c istio-proxy -- curl -v https://$SERVICE.$NS.svc.cluster.local:$SVCPORT/$PATH --resolve "$SERVICE.$NS.svc.cluster.local:$SVCPORT:$PODIP" --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem --insecure

If this works go on to point 3 by running the same test from the sidecar of frontend.

If this works go on to point 4 by installing curl or a similar tool in the application container of frontend and trying to contact using HTTP.

1 Like

Thanks for the quick reply and the detailed instructions! :slight_smile:
The missing space it just a stupid c/p mistake I think.

I went through the 4 step process you suggested, this is what I found:

  1. works. the service does expose the endpoint as expected
  2. also works.
  3. here I’m not sure. The combination of frontend PODNAME and backend PODIP works. Both frontend PODNAME and PODIP fails. I will add the both below. The service called “customer-accounts” is our frontend, whereas “customer-notifications” is the backend service.
* Added customer-notifications.customer-team.svc.cluster.local:9292:10.1.2.16 to DNS cache
* Hostname customer-notifications.customer-team.svc.cluster.local was found in DNS cache
*   Trying 10.1.2.16...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to customer-notifications.customer-team.svc.cluster.local (10.1.2.16) port 9292 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification SKIPPED
* 	 server certificate status verification SKIPPED
* error fetching CN from cert:The requested data were not available.
* 	 common name:  (does not match 'customer-notifications.customer-team.svc.cluster.local')
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: O=#1300
* 	 start date: Wed, 27 Feb 2019 13:58:40 GMT
* 	 expire date: Tue, 28 May 2019 13:58:40 GMT
* 	 issuer: O=k8s.cluster.local
* 	 compression: NULL
* ALPN, server accepted to use http/1.1
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0> GET //customer-notifications/info HTTP/1.1
> Host: customer-notifications.customer-team.svc.cluster.local:9292
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 404 Not Found
< content-type: text/plain
< x-cascade: pass
< content-length: 40
< x-envoy-upstream-service-time: 7
< date: Fri, 08 Mar 2019 07:19:11 GMT
< server: envoy
< x-envoy-decorator-operation: customer-accounts.customer-team.svc.cluster.local:9292/*
<
{ [40 bytes data]
100    40  100    40    0     0     84      0 --:--:-- --:--:-- --:--:--    84
* Connection #0 to host customer-notifications.customer-team.svc.cluster.local left intact
* Added customer-notifications.customer-team.svc.cluster.local:9292:10.1.0.4 to DNS cache
* Hostname customer-notifications.customer-team.svc.cluster.local was found in DNS cache
*   Trying 10.1.0.4...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to customer-notifications.customer-team.svc.cluster.local (10.1.0.4) port 9292 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification SKIPPED
* 	 server certificate status verification SKIPPED
* error fetching CN from cert:The requested data were not available.
* 	 common name:  (does not match 'customer-notifications.customer-team.svc.cluster.local')
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: O=#1300
* 	 start date: Wed, 27 Feb 2019 13:58:40 GMT
* 	 expire date: Tue, 28 May 2019 13:58:40 GMT
* 	 issuer: O=k8s.cluster.local
* 	 compression: NULL
* ALPN, server accepted to use http/1.1
> GET //customer-notifications/info HTTP/1.1
> Host: customer-notifications.customer-team.svc.cluster.local:9292
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json
< x-content-type-options: nosniff
< content-length: 418
< x-envoy-upstream-service-time: 1
< date: Fri, 08 Mar 2019 07:21:02 GMT
< server: envoy
< x-envoy-decorator-operation: customer-notifications.customer-team.svc.cluster.local:9292/*
<
{ [418 bytes data]
100   418  100   418    0     0   3531      0 --:--:-- --:--:-- --:--:--  3512
* Connection #0 to host customer-notifications.customer-team.svc.cluster.local left intact
{"name":"customer-notifications","version":"0.0.1","environment":"staging" [.....] }%

The only thing I see is the not matching CN name. Can this be the root cause? And how would I change that?

  1. fails

This is not a problem that I have seen.

I suspect the problem is not your

* error fetching CN from cert:The requested data were not available.
* 	 common name:  (does not match 'customer-notifications.customer-team.svc.cluster.local')

because I have seen that warning on working systems. The --insecure option of curl should let curl skip that.

Your subject: O=#1300 is bothering me because I have never seen that.
The response of 404 indicates that TLS / HTTP negotiation all worked. If there had been any security problems the connection would have been dropped.
The lack of a “GET /” line for the non-working example bothers me.

I haven’t seen any of these things and don’t know how to solve them.

Anyone?

This morning, I setup a fresh cluster and deployed two services:
At first things looked promising, but then it stopped working again. I found that istioctl proxy-status shows that the envoy proxy of the service that is not able to communicate (customer-accounts) is not fullt synced.

PROXY                                                                  CDS        LDS        EDS               RDS          PILOT                            VERSION
customer-notifications-stable-6b7ccc8cd8-fb782.customer-team           SYNCED     SYNCED     SYNCED (100%)     SYNCED       istio-pilot-868cdfb5f7-fk5vb     1.0.2
customer-notifications-stable-6b7ccc8cd8-sbx68.customer-team           SYNCED     SYNCED     SYNCED (100%)     SYNCED       istio-pilot-868cdfb5f7-fk5vb     1.0.2
customer-accounts-stable-75f59dbf46-t8jd4.customer-team                SYNCED     SYNCED     SYNCED (100%)     STALE        istio-pilot-868cdfb5f7-fk5vb     1.0.2
customer-accounts-stable-75f59dbf46-vlwnt.customer-team                SYNCED     SYNCED     SYNCED (100%)     STALE        istio-pilot-868cdfb5f7-fk5vb     1.0.2
istio-egressgateway-88887488d-65fw4.istio-system                       SYNCED     SYNCED     SYNCED (100%)     NOT SENT     istio-pilot-868cdfb5f7-fk5vb     1.0.2
istio-ingressgateway-58c77897cc-crcr5.istio-system                     SYNCED     SYNCED     SYNCED (100%)     SYNCED       istio-pilot-868cdfb5f7-fk5vb     1.0.2

Does someone know why this might be?
It started happening with the 2nd time I deployed that particular service.

It turns out that restarting the pilot temporarily fixes the issue, but newly created pod’s proxies don’t seem to be synced.

I am having the same issue where some pods just went to “STALE” state and restarting pilot didnt helped there. I am stuck with this and no call is coming into my environment.

is it a rare problem or pilot was not tested for large workloads and more than a few microservices ?

We are suddenly seeing the same issue attempting to resolve internal, standard K8S services by their names in our cluster. We are using 1.1 in a split-horizon EDS installation. I’m not sure what approach I should be taking in diagnosing this exactly, as it seems to be an issue with Istio just failing to distribute proxy configuration which is a vital artery of Istio’s entire function. What precisely should users do in this case? Has a root cause been found for this issue? I expect turbulence in the more advanced Istio features we use (i.e. cross-cluster resolution) but failing to resolve a standard K8S service in the same cluster seems like a non-starter.

Hi @ed.snible

thanks for those very useful debugging steps.
My knowledge of Istio is limited.
I have established in my case step 2 is the issue

I am running istio 1.1.0

Point 2 is the sidecar of the app. Normally I would check here next.

Do you have any suggestions on where to look to fix the particular issue ?
Any help appreciated

Jason