Ingress Gateway pods Segfault

Hey all, I’m having an issue with Istio Ingress Gateway pods crashing with a segfault when load increases. I would see error messages in the logs like this.

Epoch 0 exited with error: signal: segmentation fault (core dumped)

About 50 users are browsing through the gateway generating moderate activity. At first, I had 20 ingress gateway pods with 1GB / 500m CPU pod requests, and 8GB / 4CPU limits. So there should be plenty of resources available for 50 users.

When activity picks up, big groups of Ingress Gateway pods at a time would occasionally change to “Completed” and then “CrashLoopBackoff”. It was surprising to see these pods to crash often times groups, but they’d also crash individually sometimes. Sometimes, all pods would be in CrashLoopBackoff state, breaking ingress completely. Scaling up to 40 pods causes the crash loop backoff to happen less frequently, but it would still happen. The nodes that these pods reside on each have 90% memory available, and the Ingress Gateway pods seem to be using only 200MB to 900MB each, according to p8s. So unless the memory suddenly spikes, there seems to be plenty of memory available.

I was able to get the full stracktrace here. My ingress gateway has a bunch of different Lua filters used for auth on different parts of the site. I suspect the Lua filters might be a problem, but don’t know how to confirm and debug.

What can I try to debug this issue? Is this a familiar issue?

"[Envoy (Epoch 0)] [2020-12-14 18:20:07.323][56][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Segmentation fault, suspect faulting address 0x8
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.323][56][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:70] Backtrace (use tools/stack_decode.py to get line numbers):
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.323][56][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:71] Envoy version: 5a93703db9baee125294bb50c9541a1a11d526b4/1.13.4/Clean/RELEASE/BoringSSL
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.325][56][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #0: __restore_rt [0x7f0948f3a8a0]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.334][56][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #1: luaL_openlibs [0x55e8d1f207b7]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.342][56][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #2: Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::LuaThreadLocal::LuaThreadLocal() [0x55e8d1eb6ec7]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.350][56][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #3: std::__1::__function::__func<>::operator()() [0x55e8d1eb712c]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.907][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Segmentation fault, suspect faulting address 0x8
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.907][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:70] Backtrace (use tools/stack_decode.py to get line numbers):
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.907][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:71] Envoy version: 5a93703db9baee125294bb50c9541a1a11d526b4/1.13.4/Clean/RELEASE/BoringSSL
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.908][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #0: __restore_rt [0x7f6b6a37f8a0]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.919][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #1: luaL_openlibs [0x55e6c91497b7]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.929][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #2: Envoy::Extensions::Filters::Common::Lua::ThreadLocalState::LuaThreadLocal::LuaThreadLocal() [0x55e6c90dfec7]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.938][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #3: std::__1::__function::__func<>::operator()() [0x55e6c90e012c]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.947][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #4: std::__1::__function::__func<>::operator()() [0x55e6ca20a0a8]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.956][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #5: std::__1::__function::__func<>::operator()() [0x55e6ca20b2d8]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.965][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #6: Envoy::Event::DispatcherImpl::runPostCallbacks() [0x55e6ca276986]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.974][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #7: event_process_active_single_queue [0x55e6ca5aa806]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.984][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #8: event_base_loop [0x55e6ca5a938e]
"[Envoy (Epoch 0)] [2020-12-14 18:20:07.993][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #9: Envoy::Server::WorkerImpl::threadRoutine() [0x55e6ca26d308]
"[Envoy (Epoch 0)] [2020-12-14 18:20:08.002][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #10: Envoy::Thread::ThreadImplPosix::ThreadImplPosix()::$_0::__invoke() [0x55e6ca775ad3]
"[Envoy (Epoch 0)] [2020-12-14 18:20:08.002][47][critical][backtrace] [bazel-out/k8-opt/bin/external/envoy/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:75] #11: start_thread [0x7f6b6a3746db]

Istio version: 1.5.8
Envoy version: 1.13.4

Based on the stack trace it does look like something in the lua filter. Can you please post the filters here?

Also this looks similar https://github.com/envoyproxy/envoy/issues/10241

Thank you @nick_tetrate for your reply. Thank you also for that link. Here is an example of the Lua filter that I’m using. There is a copy of this filter per app.

---
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name:  my-filter
  namespace: "istio-system"
  labels:
    app: my-app
    chart: my-chart
    release: my-release
#
spec:
  workloadSelector:
    labels:
      istio: ingressgateway
  configPatches:
  - applyTo: HTTP_FILTER
    match:
      context: GATEWAY
      listener:
        filterChain:
          filter:
            name: "envoy.http_connection_manager"
            subFilter:
              name: "envoy.router"
    patch:
      operation: INSERT_BEFORE
      value: # lua filter specification
        name: envoy.lua
        typed_config:
          "@type": "type.googleapis.com/envoy.config.filter.http.lua.v2.Lua"
          inlineCode: |
          
            function dump(o)
                if type(o) == 'table' then
                    local s = '{ '
                    for k,v in pairs(o) do
                        if type(k) ~= 'number' then k = '"'..k..'"' end
                        s = s .. '['..k..'] = ' .. dump(v) .. ','
                    end
                    return s .. '} '
                else
                    return tostring(o)
                end
            end
            function split(str, pat)
                -- source: http://lua-users.org/wiki/SplitJoin
                local t = {}  -- NOTE: use {n = 0} in Lua-5.0
                local fpat = "(.-)" .. pat
                local last_end = 1
                local s, e, cap = str:find(fpat, 1)
                while s do
                    if s ~= 1 or cap ~= "" then
                        table.insert(t,cap)
                    end
                    last_end = e+1
                    s, e, cap = str:find(fpat, last_end)
                end
                if last_end <= #str then
                    cap = str:sub(last_end)
                    table.insert(t, cap)
                end
                return t
            end
            function prefix_match(str, start)
                return str:sub(1, #start) == start
            end
            function exact_match(str, start)
                return str == start
            end 
            function process(request_handle, request_tracker, auth_url_domain, auth_url_path, auth_signin_domain, auth_signin_path, auth_signin_disable_redirect)

                -- http request headers from the original request
                request_handle:logDebug(request_tracker.."  ORIGINATING HEADERS")
                for key,value in pairs(request_handle:headers()) do
                    request_handle:logDebug(string.format("%s   originating header %s :: %s", request_tracker, key, value or ""))
                end
                request_handle:logDebug(string.format("%s  DISABLE_REDIRECT %s", request_tracker, auth_signin_disable_redirect))

                -- originating request path: i.e. /my-namespace/foo/
                local originating_url_path = request_handle:headers():get(":path")
                request_handle:logDebug(string.format("%s  ORIGINATING_URL_PATH: %s", request_tracker, originating_url_path))

                -- originating_scheme: the scheme of the originating host (i.e. https)
                local originating_scheme = request_handle:headers():get(":scheme")
                local scheme_str = (originating_scheme or "https")
                request_handle:logDebug(string.format("%s  SCHEME: %s", request_tracker, scheme_str))

                -- originating_host: the domain of the originating request  i.e. foo.example.com
                local originating_host = request_handle:headers():get(":authority")
                request_handle:logDebug(string.format("%s  ORIGINATING_HOST: %s", request_tracker, originating_host))

                -- originating_method: http method of originating request  i.e. GET
                local originating_method = request_handle:headers():get(":method")
                request_handle:logDebug(string.format("%s  ORIGINATING_METHOD: %s", request_tracker, originating_method))

                request_handle:logDebug(string.format("%s  AUTH_URL_DOMAIN: %s", request_tracker, auth_url_domain))

                -- i.e. /my-namespace/oauth2/start?rd=https://foo.example.com/my-namespace/
                local auth_signin_path_args = string.format("%s?rd=%s://%s%s", auth_signin_path, scheme_str, originating_host, originating_url_path)
                -- i.e. https://my-namespace-auth-service.my-namespace.svc.cluster.local
                local auth_signin_scheme_domain = string.format("%s://%s", scheme_str, auth_signin_domain)
                -- i.e. https://my-namespace-auth-service.my-namespace.svc.cluster.local/my-namespace/oauth2/start?rd=https://foo.example.com/my-namespace/
                local auth_signin_url = string.format("%s%s", auth_signin_scheme_domain, auth_signin_path_args)
                request_handle:logDebug(string.format("%s  AUTH_SIGNIN_URL: %s", request_tracker, auth_signin_url))

                -- url and args that gets hit each request to authenticate: i.e. /my-namespace/oauth2/auth?rd=https://foo.example.com/my-namespace/
                local auth_url_path_w_args = string.format("%s?rd=%s://%s%s", auth_url_path, scheme_str, originating_host, originating_url_path)
                request_handle:logDebug(string.format("%s  AUTH_URL_PATH_W_ARGS: %s", request_tracker, auth_url_path_w_args))

                -- auth_server_domain_and_scheme: i.e. https://foo.example.com
                local auth_server_domain_and_scheme = string.format("%s://%s", scheme_str, auth_signin_domain)
                -- absolute_original_url: i.e. https://foo.example.com/my-namespace/
                local absolute_original_url = string.format("%s%s", auth_server_domain_and_scheme, originating_url_path)
                request_handle:logDebug(string.format("%s  ABSOLUTE_ORIGINAL_URL: %s", request_tracker, absolute_original_url))

                -- A table (kinda like a Python dictionary) of http headers and http request parameters
                -- Note: Values that start with a colon are not really headers; they are request parameters,
                -- but httpCall seems to call them headers
                local auth_request_headers = {
                    [":method"] = 'GET',
                    [":path"] = auth_url_path_w_args,
                    [":authority"] = auth_url_domain,
                    ["Host"] = auth_url_domain,
                    ["X-Original-URL"] = absolute_original_url,
                    ["X-Original-Method"] = originating_method,
                    ["X-Auth-Request-Redirect"] = originating_url_path,
                    ["X-Sent-From"] = "istio-ingress-gateway",
                    ["Cookie"] = request_handle:headers():get("Cookie"),
                    ["Authorization"] = request_handle:headers():get("Authorization"),
                }

                request_handle:logDebug(string.format("%s  /oauth2/auth REQUEST headers: %s", request_tracker, dump(auth_request_headers)))

                response_headers, body = request_handle:httpCall(
                    string.format("outbound|80||%s", auth_url_domain),
                    auth_request_headers,
                    nil,
                    5000)

                request_handle:logDebug(string.format("%s  /oauth2/auth RESPONSE headers: %s", request_tracker, dump(response_headers)))

                local status = tonumber(response_headers[":status"])
                request_handle:logDebug(string.format("%s a RESPONSE status: %s", request_tracker, status))
                -- check if the /oauth2/auth endpoint returns a non-200 status code (i.e. 401 or 403)
                if status == 200 or status == 202 then
                    request_handle:logInfo(string.format("%s  AUTH SUCCESSFUL", request_tracker))
                else
                    local response_headers = {}
                    local response_body = nil
                    if string.lower(auth_signin_disable_redirect) == "true"
                    then
                        -- if this option is the string "true", do not redirect the user,
                        -- but instead return a blank 40x
                        request_handle:logInfo(string.format("%s  Auth failed.", request_tracker))
                        response_headers[":status"] = 401
                        response_headers["Content-Type"] = "application/json"
                        response_body = '{"message": "Auth Failed"}'
                    else
                        -- redirect to signin page
                        request_handle:logInfo(string.format("%s  Auth failed. Redirect to: %s", request_tracker, auth_signin_url))
                        response_headers[":status"] = 302
                        response_headers["Location"] = auth_signin_url
                    end

                    -- process the response
                    return request_handle:respond(
                        response_headers,
                        response_body
                    )
                end
            end
            function envoy_on_request(request_handle)
                -- This is a special function that Envoy calls on every http request
                -- This will filter out all requests except for those destined for a
                -- particular app (hosted on a particular url prefix).
                -- START HERE

                -- Environment-specific variables
                -- The name of the filter
                local filter_name = "my-filter-name"
                -- The path that the tenant uses to root everything under
                local tenant_path = "/my-namespace/" -- i.e. /my-namespace/
                -- The path that this app uses to root everything under. If the app uses a 
                -- subdomain instead of a path, then the base_path will likely be "/"
                local base_path = "/my-namespace/foo/"  -- i.e. /my-namespace/foo/
                -- The base_path can be matched exactly or based on a prefix.
                local base_path_match = "prefix" -- can be "exact" or "prefix"
                -- The external domain that the user sees when they visit the app. (i.e. foo.bar.example.com or foo.example.com)
                local external_domain = "foo.example.com"
                -- Disable the redirect to the auth signin page if set to the string "true"
                local auth_signin_disable_redirect = "false"
                -- The external dns name of the domain hosting the auth server (i.e. foo.example.com)
                local auth_signin_domain = "foo.example.com"
                -- The url path to the auth login (start) page (i.e. /my-namespace/oauth2/start)
                local auth_signin_path_static = "/my-namespace/oauth2/start"
                -- The internal k8s service name of the oauth service
                local auth_url_domain = "my-namespace-auth-service.my-namespace.my-namespace.svc.cluster.local"
                -- The url path to the auth verify page (i.e. /my-namespace/oauth2/auth)
                local auth_url_path_static = "/my-namespace/oauth2/auth"

                -- Get the request path
                local url = request_handle:headers():get(":path")
                local authority = request_handle:headers():get(":authority")
                local authority_minus_port = split(authority, ":")[1]

                -- Create a random number to make it easier to associate individual requests in the logs
                local request_tracker = string.format("RID:%s", tostring(math.random(1000000, 9999999)))

                request_handle:logInfo(
                    string.format("%s ########### BEGIN REQUEST %s: match(%s:%s) url(%s:%s)",
                                    request_tracker, filter_name, external_domain, base_path, authority_minus_port, url))
                
                -- Should we match the path on a prefix or exact path
                local match_function = prefix_match -- my-namespace is prefix
                if base_path_match == "exact"
                then
                    match_function = exact_match
                elseif base_path_match == "prefix"
                then
                    match_function = prefix_match
                end

                -- do NOT require login for oauth urls (i.e. /my-namespace/oauth2/* )
                local anon_url = prefix_match(url, string.format("%soauth2/", tenant_path))
                if anon_url
                then
                    request_handle:logInfo(string.format("%s  AUTH URL: %s", request_tracker, url))
                elseif (match_function(url, base_path) and authority_minus_port == external_domain)
                then
                    request_handle:logInfo(string.format("%s  PRIVATE URL: %s", request_tracker, url))
                    process(request_handle, request_tracker,
                            auth_url_domain, auth_url_path_static, 
                            auth_signin_domain, auth_signin_path_static, auth_signin_disable_redirect)
                else
                    request_handle:logInfo(string.format("%s  NO MATCH: %s", request_tracker, url))
                request_handle:logInfo(string.format("%s ########### END REQUEST: %s", request_tracker, url))
            end

This is the segfault that syslog reported:

Dec 14 21:50:07 my-hostname kubelet[8243]: I1214 21:50:07.235019    8243 setters.go:73] Using node IP: "10.1.2.3"
Dec 14 21:50:12 my-hostname kernel: envoy[28875]: segfault at 8 ip 00005561c43d5a83 sp 00007fff9f79b3e0 error 4 in envoy[5561c3fbe000+1fd1000]
Dec 14 21:50:12 my-hostname kernel: envoy[29114]: segfault at 8 ip 00005561c43d5a83 sp 00007f5b2ede6560 error 4 in envoy[5561c3fbe000+1fd1000]
Dec 14 21:50:12 my-hostname kernel: Code: 81 c4 d0 00 00 00 5b 41 5e 5d c3 48 89 df e8 94 e9 fe ff eb ba 66 90 55 48 89 e5 41 57 41 56 53 50 41 89 d6 49 89 f7 48 89 fb <8b> 47 08 8b 48 20 3b 48 24 0f 83 08 01 00 00 48 8b 43 10 8b 40 f8
Dec 14 21:50:12 my-hostname kernel: envoy[21510]: segfault at 8 ip 0000561ee4b5da83 sp 00007f0d1debf560 error 4
Dec 14 21:50:12 my-hostname kernel: Code: 81 c4 d0 00 00 00 5b 41 5e 5d c3 48 89 df e8 94 e9 fe ff eb ba 66 90 55 48 89 e5 41 57 41 56 53 50 41 89 d6 49 89 f7 48 89 fb <8b> 47 08 8b 48 20 3b 48 24 0f 83 08 01 00 00 48 8b 43 10 8b 40 f8
Dec 14 21:50:12 my-hostname kernel: Code: 81 c4 d0 00 00 00 5b 41 5e 5d c3 48 89 df e8 94 e9 fe ff eb ba 66 90 55 48 89 e5 41 57 41 56 53 50 41 89 d6 49 89 f7 48 89 fb <8b> 47 08 8b 48 20 3b 48 24 0f 83 08 01 00 00 48 8b 43 10 8b 40 f8
Dec 14 21:50:13 my-hostname kernel: envoy[22027]: segfault at 8 ip 000056081f57fa83 sp 00007f2d97e84560 error 4 in envoy[56081f168000+1fd1000]
Dec 14 21:50:13 my-hostname kernel: envoy[21800]: segfault at 8 ip 000056081f57fa83 sp 00007fff94471ba0 error 4
Dec 14 21:50:13 my-hostname kernel: Code: 81 c4 d0 00 00 00 5b 41 5e 5d c3 48 89 df e8 94 e9 fe ff eb ba 66 90 55 48 89 e5 41 57 41 56 53 50 41 89 d6 49 89 f7 48 89 fb <8b> 47 08 8b 48 20 3b 48 24 0f 83 08 01 00 00 48 8b 43 10 8b 40 f8
Dec 14 21:50:13 my-hostname kernel: Code: 81 c4 d0 00 00 00 5b 41 5e 5d c3 48 89 df e8 94 e9 fe ff eb ba 66 90 55 48 89 e5 41 57 41 56 53 50 41 89 d6 49 89 f7 48 89 fb <8b> 47 08 8b 48 20 3b 48 24 0f 83 08 01 00 00 48 8b 43 10 8b 40 f8

can you update your Istio/Envoy version to see if the patch in envoy fixes your issue? It appears the bug was merged right after 1.13.4 but its hard to say

Thanks @nick_tetrate. Yeah, I’m trying to update istio to 1.8.1 (but since there’s a lot that’s changed between 1.5 and 1.8, it’s not super easy for my setup). Is there a way to update Envoy independently of Istio? If it’s possible, would that be a bad idea?

i would avoid updating envoy directly. though you could probably get the fix by migrating to 1.6 or 1.7 which might be easier. i would also try that.