Spark Jobs, Stages, Tasks

by beginnershadoop · Published September 27, 2019 · Updated September 27, 2019

Every distributed computation is divided in small parts called jobs, stages and tasks. It’s useful to know them especially during monitoring because it helps to detect bottlenecks.

Job -> Stages -> Tasks . So any action is converted into Job which in turn is again divided into Stages, with each stage having its own set of Tasks.

Job

A job is a sequence of stages, triggered by an action such as .count(), foreachRdd(), sortBy(), read() or write().

Stage

Each job in its side is composed of stage(s) submitted to execution by DAG scheduler. It’s a set of operations (= tasks described later) working on the identical functions but applied on data subsets depending on partitions. Stages that are not interdependent can be submitted in parallel to improve the processing throughput.

Each stage can be: shuffle map or result type. The first type represents stages those results are the inputs for next stages. In the other side, the result stage represents stages those results are sent to the driver (= results of Spark actions).

Task

Each stage has task(s). It’s the smallest unit of execution used to compute a new RDD. It’s represented by the command sent by the driver to executor with serialized form of computation. More precisely, the command is represented as a closure with all methods and variables needed to make computation. It’s important to remember that the variables are only the copies of objects declared in driver’s program and that they’re not shared among executors (i.e. each executor will operate on different object that can lead to different values). Tasks are executed on executors and their number depend on the number of partitions – 1 task is needed for 1 partition.

Tags: Spark Spark Interview

Spark Jobs, Stages, Tasks

You may also like...

Leave a Reply Cancel reply

Recent Posts

Archives

Categories

Spark Jobs, Stages, Tasks

Job

Stage

Task

Share this:

Related

You may also like...

JDBC in Spark SQL

Network Streaming in Spark

Distribution of Executors, Cores and Memory for a Spark Application

Leave a Reply Cancel reply

Recent Posts

Archives

Categories