Gateways return 503 after adding x-datadog-trace-id to logs

Chris_Thompson · April 27, 2021, 3:13pm

I’m deploying Istio 1.6.14 on GKE 1.16 with the IstioOperator spec. After updating the spec to use a custom accessLogFormat, the Istio Gateways started returning 503’s with the NR flag for about 80% of requests. These are requests that returned 200s both before this deploy and after it was reverted, so there was no change to the requests themselves.

Here is the access log format I used:

  accessLogFormat: |
    {
      "accessLogFormat": "{\"traceID\": \"%REQ(x-datadog-trace-id)%\",\"protocol\": \"%PROTOCOL%\",\"upstream_service_time\": \"%REQ(x-envoy-upstream-service-time)%\",\"upstream_local_address\": \"%UPSTREAM_LOCAL_ADDRESS%\",\"duration\": \"%DURATION%\",\"upstream_transport_failure_reason\": \"%UPSTREAM_TRANSPORT_FAILURE_REASON%\",\"route_name\": \"%ROUTE_NAME%\",\"downstream_local_address\": \"%DOWNSTREAM_LOCAL_ADDRESS%\",\"user_agent\": \"%REQ(USER-AGENT)%\",\"response_code\": \"%RESPONSE_CODE%\",\"response_flags\": \"%RESPONSE_FLAGS%\",\"start_time\": \"%START_TIME%\",\"method\": \"%REQ(:METHOD)%\",\"request_id\": \"%REQ(X-REQUEST-ID)%\",\"upstream_host\": \"%UPSTREAM_HOST%\",\"x_forwarded_for\": \"%REQ(X-FORWARDED-FOR)%\",\"requested_server_name\": \"%REQUESTED_SERVER_NAME%\",\"bytes_received\": \"%BYTES_RECEIVED%\",\"istio_policy_status\": \"-\",\"bytes_sent\": \"%BYTES_SENT%\",\"upstream_cluster\": \"%UPSTREAM_CLUSTER%\",\"downstream_remote_address\": \"%DOWNSTREAM_REMOTE_ADDRESS%\",\"authority\": \"%REQ(:AUTHORITY)%\",\"path\": \"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\"}"
    }

The only change to the format is the inclusion of the x-datadog-trace-id field. The rest of the fields should be included in the default gateway logs. Other changes in this deploy including enabling telemetry v1/v2, setting trace sampling to 100, setting the correct address of the datadog tracer (I was running into this issue.). I don’t think that the telemetry/tracing changes are related to the issue because reverting only the log format changes resolved the issue.

istioctl ps showed that the gateway proxies all had up to date RDS configs and the pods were not exceeding their CPU/Memory limits.

So, what could have caused this issue? Is it possible I put the gateway proxies under too much load to route traffic properly? Any advice would be greatly appreciated. Thank you!

Topic		Replies	Views
Random 503's with UC response flag, with envoy triggering multiple call backs Networking	0	1145	June 11, 2020
Access log format Config	3	2984	February 6, 2019
Debugging istio envoy 503	0	1252	May 19, 2020
Pod that return 503 are not called Networking	1	834	September 18, 2019
Istio- AKS- Micrososft App gateway Networking	3	473	September 16, 2021

Gateways return 503 after adding x-datadog-trace-id to logs

Related topics