Spark Jobs, Stages, Tasks
Every distributed computation is divided in small parts called jobs, stages and tasks. It’s useful to know them especially during monitoring because it helps to detect bottlenecks.
Job -> Stages -> Tasks . So any action is converted into Job which in turn is again divided into Stages, with each stage having its own set of Tasks.
Job
A job is a sequence of stages, triggered by an action such as .count()
, foreachRdd()
, sortBy()
, read()
or write()
.
Stage
Each job in its side is composed of stage(s) submitted to execution by DAG scheduler. It’s a set of operations (= tasks described later) working on the identical functions but applied on data subsets depending on partitions. Stages that are not interdependent can be submitted in parallel to improve the processing throughput.
Each stage can be: shuffle map or result type. The first type represents stages those results are the inputs for next stages. In the other side, the result stage represents stages those results are sent to the driver (= results of Spark actions).
Task
Each stage has task(s). It’s the smallest unit of execution used to compute a new RDD. It’s represented by the command sent by the driver to executor with serialized form of computation. More precisely, the command is represented as a closure with all methods and variables needed to make computation. It’s important to remember that the variables are only the copies of objects declared in driver’s program and that they’re not shared among executors (i.e. each executor will operate on different object that can lead to different values). Tasks are executed on executors and their number depend on the number of partitions – 1 task is needed for 1 partition.