I am running into an issue where an exectutor is creating tasks which never finish, causing the entire process to hang. The tasks normally complete is ~4 seconds, but have been running for an hour and a half. This does not consistently happen, I’d say around 1/3 of the time.
I have 32 nodes that are of c3.8xlarge, so they have Cores: 32 Memory: 60 GiB Storage: 640 SSD GB
I have RDDs from many sources which are being processed and combined together. The task seems to get stuck as a consistent part of the DAG, though I haven’t checked its location enough to be sure its 100% always the same place. At the point I observed, all of the RDDs have been combined into one.
Some odd symptoms I have noticed:
It is only tasks on one executor that are getting stuck. There are some tasks from that executor which finished, but it ends up with 32 stuck tasks.
The stuck tasks have no startup time in the UI. 0ms scheduler delay, 0 ms task de-serialization time, 0 ms shuffle read time. There is only Executor Computing Time.
A thread dump of the stuck executor shows Executor task launch worker-(32-63), instead of the normal Executor task launch worker-(0-31). These are all in thread state “WAITING”. The stage will eventually get to the point where every task except for 32 have completed, so while these are stuck all the other tasks are still being created and finishing normally.
This started happening when I upgraded from Spark 1.6.1 to 2.1.0. There was noproblem with Spark 1.6.1.