Internal requests failing with 404

popcorn4dinner · March 7, 2019, 2:23pm

I’m running istio version 1.05 (also tried 1.06). I used helm for the installation with the following settings:

helm template install/kubernetes/helm/istio \
   --name istio --namespace istio-system \
   --set tracing.enabled=true \
   --set servicegraph.enabled=false \
   --set global.meshExpansion=true \
   --set grafana.enabled=true \
   --set global.mtls.enabled=true \
   --set global.k8sIngressHttps=true \
   --set sidecarInjectorWebhook.enabled=true \
   --set certmanager.enabled=true \
   --set certmanager.tag=v0.5.2 \
   --set kiali.enabled=true \
> istio.yaml

The problem I’m facing is that communication from outside the mesh works just fine. The frontend (see below) is reachable. But the service-to-service communication inside the mesh does not work. every request to a valid dns entry results in 404.

# envoy proxy log
2019-03-07T14:16:34.918Z] "GET /example-backend/infoHTTP/1.1" 404 NR 0 0 0 - "-" "curl/7.52.1" "ad7593ed-7743-49a0-a673-fd29a2e7663d" "example-backend:9292" "-" - - 10.2.200.109:9292 10.1.1.4:50460

My services are setup like this:

Backend Services:
apiVersion: v1
kind: Service
metadata:
  name: example-backend
  labels:
    app: example-backend
spec:
  type: ClusterIP
  ports:
  - port: 9292
    name: http
  selector:
    app: example-backend

---

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: example-backend
spec:
  replicas: 2
  selector:
    matchLabels:
      app: auth-tokens
      track: stable
  strategy:
   rollingUpdate:
     maxSurge: 1
     maxUnavailable: 1
   type: RollingUpdate
  template:
    metadata:
      labels:
        app: example-backend
        version: 0.0.1
        track: stable
    spec:
      containers:
      - name: example-backend
        image: eu.gcr.io/freska-staging/example-backend:0.0.1
        imagePullPolicy: Always
        ports:
        - containerPort: 9292
        env:
        [...]
---

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: example-backend
  labels:
    app: example-backend
    access: internal
spec:
  hosts:
    - example-backend.default.svc.cluster.local
  gateways:
    - mesh
  http:
  - match:
      - uri:
          prefix: /v2/example-backend/
      - uri:
          prefix: /example-backend/info
    route:
      - destination:
          host: example-backend
          subset: stable
          port:
            number: 9292
        weight: 100
      - destination:
          host: example-backend
          subset: canary
          port:
            number: 9292
        weight: 0

---

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: example-backend-canary-traffic
  labels:
    app: example-backend
spec:
  host: example-backend
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
    loadBalancer:
      simple: LEAST_CONN
  subsets:
    - name: stable
      labels:
        track: stable
    - name: canary
      labels:
        track: canary

Frontend Service:
The basic setup is identical, but the virtual service is slightly different

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: frontend
spec:
  hosts:
    - "frontend.example.io"
    - customer-accounts
  gateways:
    - public-api-gateway
    - mesh
  http:
  - match:
      - uri:
          prefix: /welcome
      - uri:
          prefix: /products
    route:
      - destination:
          host: frontend
          subset: stable
          port:
            number: 9292
        weight: 100
      - destination:
          host: frontend
          subset: canary
          port:
            number: 9292
        weight: 0

ed.snible · March 7, 2019, 3:33pm

I must admit the “GET /example-backend/infoHTTP/1.1” is surprising; I expected a space before HTTP/1.1.

I have a “4 point” debugging strategy.

Point 1 is the app itself. Verify that your service really exposes the URL you are trying by doing kubectl exec <any-example-backend-pod> -c istio-proxy -- curl localhost:9292/example-backend/info.

Point 2 is the sidecar of the app. Normally I would check here next. You have global.mtls.enabled=true. You can simulate an mTLS request coming into example-backend like this:

PODNAME=...
SERVICE=example-backend
NS=default
SVCPORT=9292
PATH=/example-backend/info
PODIP=$(kubectl get pod $PODNAME -o jsonpath='{.status.podIP}')
kubectl exec $PODNAME -c istio-proxy -- curl -v https://$SERVICE.$NS.svc.cluster.local:$SVCPORT/$PATH --resolve "$SERVICE.$NS.svc.cluster.local:$SVCPORT:$PODIP" --key /etc/certs/key.pem --cert /etc/certs/cert-chain.pem --cacert /etc/certs/root-cert.pem --insecure

If this works go on to point 3 by running the same test from the sidecar of frontend.

If this works go on to point 4 by installing curl or a similar tool in the application container of frontend and trying to contact using HTTP.

popcorn4dinner · March 8, 2019, 7:30am

Thanks for the quick reply and the detailed instructions!
The missing space it just a stupid c/p mistake I think.

I went through the 4 step process you suggested, this is what I found:

works. the service does expose the endpoint as expected
also works.
here I’m not sure. The combination of frontend PODNAME and backend PODIP works. Both frontend PODNAME and PODIP fails. I will add the both below. The service called “customer-accounts” is our frontend, whereas “customer-notifications” is the backend service.

* Added customer-notifications.customer-team.svc.cluster.local:9292:10.1.2.16 to DNS cache
* Hostname customer-notifications.customer-team.svc.cluster.local was found in DNS cache
*   Trying 10.1.2.16...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to customer-notifications.customer-team.svc.cluster.local (10.1.2.16) port 9292 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification SKIPPED
* 	 server certificate status verification SKIPPED
* error fetching CN from cert:The requested data were not available.
* 	 common name:  (does not match 'customer-notifications.customer-team.svc.cluster.local')
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: O=#1300
* 	 start date: Wed, 27 Feb 2019 13:58:40 GMT
* 	 expire date: Tue, 28 May 2019 13:58:40 GMT
* 	 issuer: O=k8s.cluster.local
* 	 compression: NULL
* ALPN, server accepted to use http/1.1
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0> GET //customer-notifications/info HTTP/1.1
> Host: customer-notifications.customer-team.svc.cluster.local:9292
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 404 Not Found
< content-type: text/plain
< x-cascade: pass
< content-length: 40
< x-envoy-upstream-service-time: 7
< date: Fri, 08 Mar 2019 07:19:11 GMT
< server: envoy
< x-envoy-decorator-operation: customer-accounts.customer-team.svc.cluster.local:9292/*
<
{ [40 bytes data]
100    40  100    40    0     0     84      0 --:--:-- --:--:-- --:--:--    84
* Connection #0 to host customer-notifications.customer-team.svc.cluster.local left intact

* Added customer-notifications.customer-team.svc.cluster.local:9292:10.1.0.4 to DNS cache
* Hostname customer-notifications.customer-team.svc.cluster.local was found in DNS cache
*   Trying 10.1.0.4...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to customer-notifications.customer-team.svc.cluster.local (10.1.0.4) port 9292 (#0)
* found 1 certificates in /etc/certs/root-cert.pem
* found 592 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* 	 server certificate verification SKIPPED
* 	 server certificate status verification SKIPPED
* error fetching CN from cert:The requested data were not available.
* 	 common name:  (does not match 'customer-notifications.customer-team.svc.cluster.local')
* 	 server certificate expiration date OK
* 	 server certificate activation date OK
* 	 certificate public key: RSA
* 	 certificate version: #3
* 	 subject: O=#1300
* 	 start date: Wed, 27 Feb 2019 13:58:40 GMT
* 	 expire date: Tue, 28 May 2019 13:58:40 GMT
* 	 issuer: O=k8s.cluster.local
* 	 compression: NULL
* ALPN, server accepted to use http/1.1
> GET //customer-notifications/info HTTP/1.1
> Host: customer-notifications.customer-team.svc.cluster.local:9292
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json
< x-content-type-options: nosniff
< content-length: 418
< x-envoy-upstream-service-time: 1
< date: Fri, 08 Mar 2019 07:21:02 GMT
< server: envoy
< x-envoy-decorator-operation: customer-notifications.customer-team.svc.cluster.local:9292/*
<
{ [418 bytes data]
100   418  100   418    0     0   3531      0 --:--:-- --:--:-- --:--:--  3512
* Connection #0 to host customer-notifications.customer-team.svc.cluster.local left intact
{"name":"customer-notifications","version":"0.0.1","environment":"staging" [.....] }%

The only thing I see is the not matching CN name. Can this be the root cause? And how would I change that?

fails

ed.snible · March 8, 2019, 7:39pm

This is not a problem that I have seen.

I suspect the problem is not your

* error fetching CN from cert:The requested data were not available.
* 	 common name:  (does not match 'customer-notifications.customer-team.svc.cluster.local')

because I have seen that warning on working systems. The --insecure option of curl should let curl skip that.

Your subject: O=#1300 is bothering me because I have never seen that.
The response of 404 indicates that TLS / HTTP negotiation all worked. If there had been any security problems the connection would have been dropped.
The lack of a “GET /” line for the non-working example bothers me.

I haven’t seen any of these things and don’t know how to solve them.

Anyone?

popcorn4dinner · March 11, 2019, 1:50pm

This morning, I setup a fresh cluster and deployed two services:
At first things looked promising, but then it stopped working again. I found that istioctl proxy-status shows that the envoy proxy of the service that is not able to communicate (customer-accounts) is not fullt synced.

PROXY                                                                  CDS        LDS        EDS               RDS          PILOT                            VERSION
customer-notifications-stable-6b7ccc8cd8-fb782.customer-team           SYNCED     SYNCED     SYNCED (100%)     SYNCED       istio-pilot-868cdfb5f7-fk5vb     1.0.2
customer-notifications-stable-6b7ccc8cd8-sbx68.customer-team           SYNCED     SYNCED     SYNCED (100%)     SYNCED       istio-pilot-868cdfb5f7-fk5vb     1.0.2
customer-accounts-stable-75f59dbf46-t8jd4.customer-team                SYNCED     SYNCED     SYNCED (100%)     STALE        istio-pilot-868cdfb5f7-fk5vb     1.0.2
customer-accounts-stable-75f59dbf46-vlwnt.customer-team                SYNCED     SYNCED     SYNCED (100%)     STALE        istio-pilot-868cdfb5f7-fk5vb     1.0.2
istio-egressgateway-88887488d-65fw4.istio-system                       SYNCED     SYNCED     SYNCED (100%)     NOT SENT     istio-pilot-868cdfb5f7-fk5vb     1.0.2
istio-ingressgateway-58c77897cc-crcr5.istio-system                     SYNCED     SYNCED     SYNCED (100%)     SYNCED       istio-pilot-868cdfb5f7-fk5vb     1.0.2

Does someone know why this might be?
It started happening with the 2nd time I deployed that particular service.

popcorn4dinner · March 11, 2019, 2:53pm

It turns out that restarting the pilot temporarily fixes the issue, but newly created pod’s proxies don’t seem to be synced.

Sourabh_Wadhwa · March 12, 2019, 12:37am

I am having the same issue where some pods just went to “STALE” state and restarting pilot didnt helped there. I am stuck with this and no call is coming into my environment.

is it a rare problem or pilot was not tested for large workloads and more than a few microservices ?

dkushner · May 13, 2019, 5:53pm

We are suddenly seeing the same issue attempting to resolve internal, standard K8S services by their names in our cluster. We are using 1.1 in a split-horizon EDS installation. I’m not sure what approach I should be taking in diagnosing this exactly, as it seems to be an issue with Istio just failing to distribute proxy configuration which is a vital artery of Istio’s entire function. What precisely should users do in this case? Has a root cause been found for this issue? I expect turbulence in the more advanced Istio features we use (i.e. cross-cluster resolution) but failing to resolve a standard K8S service in the same cluster seems like a non-starter.

jrhoward · July 23, 2019, 5:51am

Hi @ed.snible

thanks for those very useful debugging steps.
My knowledge of Istio is limited.
I have established in my case step 2 is the issue

I am running istio 1.1.0

Point 2 is the sidecar of the app. Normally I would check here next.

Do you have any suggestions on where to look to fix the particular issue ?
Any help appreciated

Jason

Topic		Replies	Views
Internal requests 404s with istio 1.1.7	0	642	June 7, 2019
Can't hit anything external in 1.1.1	1	1226	March 30, 2019
404 status when curl from pod where istio injection is enabled and works fine in non istio namespace	5	1972	October 31, 2022
Request by domain name between services return 404 Networking	16	1992	December 22, 2020
Multi-cluster communication with custom hostnames and HTTPS requests fails Networking	0	804	January 26, 2022

Internal requests failing with 404

Related topics