Gateways return 503 after adding x-datadog-trace-id to logs

I’m deploying Istio 1.6.14 on GKE 1.16 with the IstioOperator spec. After updating the spec to use a custom accessLogFormat, the Istio Gateways started returning 503’s with the NR flag for about 80% of requests. These are requests that returned 200s both before this deploy and after it was reverted, so there was no change to the requests themselves.

Here is the access log format I used:

  accessLogFormat: |
      "accessLogFormat": "{\"traceID\": \"%REQ(x-datadog-trace-id)%\",\"protocol\": \"%PROTOCOL%\",\"upstream_service_time\": \"%REQ(x-envoy-upstream-service-time)%\",\"upstream_local_address\": \"%UPSTREAM_LOCAL_ADDRESS%\",\"duration\": \"%DURATION%\",\"upstream_transport_failure_reason\": \"%UPSTREAM_TRANSPORT_FAILURE_REASON%\",\"route_name\": \"%ROUTE_NAME%\",\"downstream_local_address\": \"%DOWNSTREAM_LOCAL_ADDRESS%\",\"user_agent\": \"%REQ(USER-AGENT)%\",\"response_code\": \"%RESPONSE_CODE%\",\"response_flags\": \"%RESPONSE_FLAGS%\",\"start_time\": \"%START_TIME%\",\"method\": \"%REQ(:METHOD)%\",\"request_id\": \"%REQ(X-REQUEST-ID)%\",\"upstream_host\": \"%UPSTREAM_HOST%\",\"x_forwarded_for\": \"%REQ(X-FORWARDED-FOR)%\",\"requested_server_name\": \"%REQUESTED_SERVER_NAME%\",\"bytes_received\": \"%BYTES_RECEIVED%\",\"istio_policy_status\": \"-\",\"bytes_sent\": \"%BYTES_SENT%\",\"upstream_cluster\": \"%UPSTREAM_CLUSTER%\",\"downstream_remote_address\": \"%DOWNSTREAM_REMOTE_ADDRESS%\",\"authority\": \"%REQ(:AUTHORITY)%\",\"path\": \"%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%\"}"

The only change to the format is the inclusion of the x-datadog-trace-id field. The rest of the fields should be included in the default gateway logs. Other changes in this deploy including enabling telemetry v1/v2, setting trace sampling to 100, setting the correct address of the datadog tracer (I was running into this issue.). I don’t think that the telemetry/tracing changes are related to the issue because reverting only the log format changes resolved the issue.

istioctl ps showed that the gateway proxies all had up to date RDS configs and the pods were not exceeding their CPU/Memory limits.

So, what could have caused this issue? Is it possible I put the gateway proxies under too much load to route traffic properly? Any advice would be greatly appreciated. Thank you!