Barrier Execution Mode in Spark

by beginnershadoop · Published April 2, 2020 · Updated April 7, 2020

The barrier execution mode is experimental and it only handles limited scenarios. See SPIP: Barrier Execution Mode and Design Doc.

In case of a task failure, instead of only restarting the failed task, Spark will abort the entire stage and re-launch all tasks for this stage.

Use RDD.barrier transformation to mark the current stage as a barrier stage.

barrier(): RDDBarrier[T]

barrier simply creates a RDDBarrier that comes with the barrier-aware mapPartitions transformation.

mapPartitions[S: ClassTag](
  f: Iterator[T] => Iterator[S],
  preservesPartitioning: Boolean = false): RDD[S]

mapPartitions is simply changes the regular RDD.mapPartitions transformation to create a MapPartitionsRDD with the isFromBarrier flag enabled.

Task has a isBarrier flag that says whether this task belongs to a barrier stage (default: false).

Spark must launch all the tasks at the same time for a barrier stage.

An RDD is in a barrier stage, if at least one of its parent RDD(s), or itself, are mapped from an RDDBarrier.

ShuffledRDD has the isBarrier flag always disabled (false).

MapPartitionsRDD is the only one RDD that can have the isBarrier flag enabled.

RDDBarrier.mapPartitions is the only transformation that creates a MapPartitionsRDD with the isFromBarrier flag enabled.

Barrier Stage: