Istio healthz/ready endpoint not responding correctly, leading to failures

Context -
As discussed in https://github.com/istio/istio/issues/11130, I was facing same issues in running elasticsearch-curator with istio i.e. - the application’s (curator) outgoing communication is routed through the istio envoy proxy, but the proxy was not fully up before the application itself started, resulting in failures.

Istio version used: 1.4.6/1.5.2
To resolve this, I modified my application container’s entrypoint such that it waits for the istio envoy to be ready (by checking status of healthz/ready endpoint) and then start the curator process.

  while true; do
    curl -s http://localhost:15020/healthz/ready
    if [ $? -eq 0 ]; then
     echo "Istio envoy proxy is ready"
     break
    else
     sleep 3
    fi
  done
fi
/usr/bin/curator $@ 

This worked well for me, but intermittently I can see errors appearing again.

Issue -
As you can see from below logs, when the aplication checked the status of the istio proxy healthcheck endpoint /healthz/ready,
it responds at 01:00:11,594, the application prints the msg “Istio envoy proxy is ready” and then starts the app proccess (which fails due to n/w unreachability).

But from istio proxy logs, we see the proxy became ready only after -
2020-11-20T01:00:13.015142Z info Envoy proxy is ready

Application logs:

kubectl logs elk-curator-1605834000-8cdkt -c curator
Istio envoy proxy is ready
2020-11-20 01:00:11,594 INFO Preparing Action ID: 1, “delete_indices”
2020-11-20 01:00:11,594 INFO Creating client object and testing connection
2020-11-20 01:00:11,598 INFO Instantiating client object
2020-11-20 01:00:11,599 INFO Testing client connectivity
2020-11-20 01:00:11,900 ERROR HTTP N/A error: HTTPConnectionPool(host=‘elasticsearch’, port=9200): Max retries exceeded with url: / (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x7fd122821f50>: Failed to establish a new connection: [Errno 111] Connection refused’))
2020-11-20 01:00:11,900 CRITICAL Curator cannot proceed. Exiting.

Istio envoy proxy logs

kubectl logs elk-curator-1605834000-8cdkt -c istio-proxy | more
2020-11-20T01:00:07.620339Z info FLAG: --binaryPath=“/usr/local/bin/envoy”
2020-11-20T01:00:07.620376Z info FLAG: --concurrency=“2”
2020-11-20T01:00:07.620381Z info FLAG: --configPath=“/etc/istio/proxy”
2020-11-20T01:00:07.620385Z info FLAG: --connectTimeout=“10s”
2020-11-20T01:00:07.620388Z info FLAG: --controlPlaneAuthPolicy=“NONE”
2020-11-20T01:00:07.620393Z info FLAG: --controlPlaneBootstrap=“true”
2020-11-20T01:00:07.620395Z info FLAG: --customConfigFile=“”
2020-11-20T01:00:07.620421Z info FLAG: --datadogAgentAddress=“”
2020-11-20T01:00:07.620424Z info FLAG: --disableInternalTelemetry=“false”
2020-11-20T01:00:07.620427Z info FLAG: --discoveryAddress=“istio-pilot.istio-system:15010”
2020-11-20T01:00:07.620430Z info FLAG: --dnsRefreshRate=“300s”
2020-11-20T01:00:07.620436Z info FLAG: --domain=“sps-me.svc.cluster.local”
2020-11-20T01:00:07.620439Z info FLAG: --drainDuration=“45s”
2020-11-20T01:00:07.620442Z info FLAG: --envoyAccessLogService=“”
2020-11-20T01:00:07.620444Z info FLAG: --envoyMetricsService=“”
2020-11-20T01:00:07.620447Z info FLAG: --help=“false”
2020-11-20T01:00:07.620450Z info FLAG: --id=“”
2020-11-20T01:00:07.620453Z info FLAG: --ip=“”
2020-11-20T01:00:07.620455Z info FLAG: --lightstepAccessToken=“”
2020-11-20T01:00:07.620458Z info FLAG: --lightstepAddress=“”
2020-11-20T01:00:07.620460Z info FLAG: --lightstepCacertPath=“”
2020-11-20T01:00:07.620463Z info FLAG: --lightstepSecure=“false”
2020-11-20T01:00:07.620465Z info FLAG: --log_as_json=“false”
2020-11-20T01:00:07.620467Z info FLAG: --log_caller=“”
2020-11-20T01:00:07.620470Z info FLAG: --log_output_level=“default:info”
2020-11-20T01:00:07.620472Z info FLAG: --log_rotate=“”
2020-11-20T01:00:07.620475Z info FLAG: --log_rotate_max_age=“30”
2020-11-20T01:00:07.620478Z info FLAG: --log_rotate_max_backups=“1000”
2020-11-20T01:00:07.620481Z info FLAG: --log_rotate_max_size=“104857600”
2020-11-20T01:00:07.620483Z info FLAG: --log_stacktrace_level=“default:none”
2020-11-20T01:00:07.620489Z info FLAG: --log_target=“[stdout]”
2020-11-20T01:00:07.620492Z info FLAG: --mixerIdentity=“”
2020-11-20T01:00:07.620494Z info FLAG: --outlierLogPath=“”
2020-11-20T01:00:07.620497Z info FLAG: --parentShutdownDuration=“1m0s”
2020-11-20T01:00:07.620500Z info FLAG: --pilotIdentity=“”
2020-11-20T01:00:07.620504Z info FLAG: --proxyAdminPort=“15000”
2020-11-20T01:00:07.620507Z info FLAG: --proxyComponentLogLevel=“misc:error”
2020-11-20T01:00:07.620509Z info FLAG: --proxyLogLevel=“warning”
2020-11-20T01:00:07.620512Z info FLAG: --serviceCluster=“elk-curator-1605834000.sps-me”
2020-11-20T01:00:07.620515Z info FLAG: --serviceregistry=“Kubernetes”
2020-11-20T01:00:07.620517Z info FLAG: --statsdUdpAddress=“”
2020-11-20T01:00:07.620520Z info FLAG: --statusPort=“15020”
2020-11-20T01:00:07.620523Z info FLAG: --stsPort=“0”
2020-11-20T01:00:07.620525Z info FLAG: --templateFile=“”
2020-11-20T01:00:07.620528Z info FLAG: --tokenManagerPlugin=“GoogleTokenExchange”
2020-11-20T01:00:07.620530Z info FLAG: --trust-domain=“”
2020-11-20T01:00:07.620533Z info FLAG: --zipkinAddress=“jaeger.istio-system:9411”
2020-11-20T01:00:07.620555Z info Version istio-release-1.5.2-68d381dde45b34f82a7247f20840829f1ee56fc1-dirty-Clean
2020-11-20T01:00:07.620701Z info Obtained private IP [192.168.58.41]
2020-11-20T01:00:07.620734Z info Proxy role: &model.Proxy{ClusterID:“”, Type:“sidecar”, IPAddresses:string{“192.168.58
.41”, “192.168.58.41”}, ID:“elk-curator-1605834000-8cdkt.sps-me”, Locality:(*envoy_api_v2_core.Locality)(nil), DNSDo
main:“sps-me.svc.cluster.local”, ConfigNamespace:“”, Metadata:(*model.NodeMetadata)(nil), SidecarScope:(*model.SidecarScop
e)(nil), MergedGateway:(*model.MergedGateway)(nil), ServiceInstances:*model.ServiceInstance(nil), WorkloadLabels:labels.Colle
ction(nil), IstioVersion:(*model.IstioVersion)(nil)}
2020-11-20T01:00:07.620740Z info PilotSAN string(nil)
2020-11-20T01:00:07.620743Z info MixerSAN string(nil)
2020-11-20T01:00:07.621188Z info Effective config: binaryPath: /usr/local/bin/envoy
concurrency: 2
configPath: /etc/istio/proxy
connectTimeout: 10s
discoveryAddress: istio-pilot.istio-system:15010
drainDuration: 45s
envoyAccessLogService: {}
envoyMetricsService: {}
parentShutdownDuration: 60s
proxyAdminPort: 15000
serviceCluster: elk-curator-1605834000.sps-me
statNameLength: 189
tracing:
zipkin:
address: jaeger.istio-system:9411

2020-11-20T01:00:07.621206Z info JWT policy is third-party-jwt
2020-11-20T01:00:07.621267Z warn Missing JWT token, can’t use in process SDS ./var/run/secrets/tokens/istio-tokenstat ./
var/run/secrets/tokens/istio-token: no such file or directory
2020-11-20T01:00:07.621301Z info PilotSAN string(nil)
2020-11-20T01:00:07.621316Z info Starting proxy agent
2020-11-20T01:00:07.621368Z info Opening status port 15020

2020-11-20T01:00:07.621661Z info Received new config, creating new Envoy epoch 0
2020-11-20T01:00:07.621727Z info Epoch 0 starting
2020-11-20T01:00:07.621754Z info watching /etc/certs for changes
2020-11-20T01:00:07.655353Z warn failed to read pod labels: open ./etc/istio/pod/labels: no such file or directory
2020-11-20T01:00:07.656170Z info Envoy command: [-c /etc/istio/proxy/envoy-rev0.json --restart-epoch 0 --drain-time-s 45
–parent-shutdown-time-s 60 --service-cluster elk-curator-1605834000.sps-me --service-node sidecar~192.168.58.41~bt
el-elk-curator-1605834000-8cdkt.sps-me~sps-me.svc.cluster.local --max-obj-name-len 189 --local-address-ip-version v4
–log-format [Envoy (Epoch 0)] [%Y-%m-%d %T.%e][%t][%l][%n] %v -l warning --component-log-level misc:error --concurrency 2]
[Envoy (Epoch 0)] [2020-11-20 01:00:07.713][26][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_vir
tual_includes/grpc_stream_lib/common/config/grpc_stream.h:91] gRPC config stream closed: 14, no healthy upstream
[Envoy (Epoch 0)] [2020-11-20 01:00:07.713][26][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_vir
tual_includes/grpc_stream_lib/common/config/grpc_stream.h:54] Unable to establish new stream
2020-11-20T01:00:08.449484Z info watchFileEvents: “/etc/certs”: MODIFY|ATTRIB
2020-11-20T01:00:08.450026Z info watchFileEvents: “/etc/certs”: MODIFY|ATTRIB
2020-11-20T01:00:08.450059Z info watchFileEvents: “/etc/certs/…2020_11_20_01_00_01.129277881”: MODIFY|ATTRIB
2020-11-20T01:00:08.450097Z info watchFileEvents: “/etc/certs/…2020_11_20_01_00_01.129277881”: MODIFY|ATTRIB
2020-11-20T01:00:09.018001Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updat
es: 1 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2020-11-20T01:00:09.110183Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updat
es: 1 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2020-11-20T01:00:09.568375Z info watchFileEvents: “/etc/certs”: MODIFY|ATTRIB
2020-11-20T01:00:09.568468Z info watchFileEvents: “/etc/certs/…2020_11_20_01_00_01.129277881”: MODIFY|ATTRIB
2020-11-20T01:00:11.013811Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updat
es: 1 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2020-11-20T01:00:13.015142Z info Envoy proxy is ready
2020-11-20T01:00:13.764544Z info watchFileEvents: “/etc/certs”: MODIFY|ATTRIB
2020-11-20T01:00:13.764605Z info watchFileEvents: “/etc/certs/…2020_11_20_01_00_01.129277881”: MODIFY|ATTRIB
2020-11-20T01:00:18.449687Z info watchFileEvents: notifying

Why is the http://localhost:15020/healthz/ready endpoint responding even when its not actually ready? What could be the issue here?
Also, Is there a better way to check for proxy’s readiness than the one used here? Pls note: I am looking for a reliable solution for istio 1.4/1.5 as I cannot upgrade to istio 1.7 currently (so cannot use the values.global.proxy.holdApplicationUntilProxyStarts parameter).

There are a few solutions that do this. ive used this in the past https://github.com/monzo/envoy-preflight