Istio 1.4.9- Spark worker resets connection with driver when sidecars are turned on and redirection is off

I’m running into an interesting issue when turning on sidecars for a Spark Standlone Cluster and was wondering if anyone has seen something similar.My impression was that there shouldn’t be any observable difference between turning off sidecars and turning sidecars on with all traffic bypassing the proxy.

However, that doesn’t seem to be the case. With sidecars on and redirecting all traffic through the proxy, everything works fine (everything also works w/o sidecars). If I annotate the sidecars to bypass the proxy for all traffic, the connection between the Spark driver and worker gets reset within 1 second by the Spark worker.

IP traffic looks identical (using ksniff) to when sidecars are turned off up until when the Spark worker resets the connection. If anyone has thoughts, I’d really appreciate it. This has been driving me crazy for 3 days.Here are the Spark driver logs, which aren’t particularly helpful:

[2020-07-14 15:45:31,829] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task 20/07/14 15:45:31 WARN server.TransportChannelHandler: Exception in connection from /172.31.10.104:41978
[2020-07-14 15:45:31,830] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task java.io.IOException: Connection reset by peer
[2020-07-14 15:45:31,830] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
[2020-07-14 15:45:31,830] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
[2020-07-14 15:45:31,830] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
[2020-07-14 15:45:31,830] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at sun.nio.ch.IOUtil.read(IOUtil.java:192)
[2020-07-14 15:45:31,831] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
[2020-07-14 15:45:31,831] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
[2020-07-14 15:45:31,831] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
[2020-07-14 15:45:31,831] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
[2020-07-14 15:45:31,831] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
[2020-07-14 15:45:31,831] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
[2020-07-14 15:45:31,832] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
[2020-07-14 15:45:31,832] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
[2020-07-14 15:45:31,832] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
[2020-07-14 15:45:31,832] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
[2020-07-14 15:45:31,832] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
[2020-07-14 15:45:31,832] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task at java.lang.Thread.run(Thread.java:748)
[2020-07-14 15:45:31,833] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task 20/07/14 15:45:31 ERROR scheduler.TaskSchedulerImpl: Lost executor 0 on 172.31.10.104: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[2020-07-14 15:45:31,839] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task 20/07/14 15:45:31 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.31.10.104, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[2020-07-14 15:45:31,843] {base_task_runner.py:101} {PID:1 TID:140419165804288} INFO - Job 1140: Subtask spark_task 20/07/14 15:45:31 WARN spark.ExecutorAllocationManager: Attempted to mark unknown executor 0 idle

This definitely isn’t a memory issue, as everything works when redirecting traffic to the proxy (plus I’ve tried allocating twice the memory)

@patrick.chen were you able to get this fixed ? I am facing the same issue