Category: Spark

May 10, 2020

Apache Spark: WindowSpec & Window

WindowSpec is a window specification that defines which rows are included in a window (frame), i.e. the set of rows that are associated with the current row by some relation. WindowSpec takes the following when...

Scala / Spark

April 2, 2020

Barrier Execution Mode in Spark

The barrier execution mode is experimental and it only handles limited scenarios. See SPIP: Barrier Execution Mode and Design Doc. In case of a task failure, instead of only restarting the...

Spark

October 1, 2019

Spark Structured Streaming and Streaming Queries

Structured streaming: Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you...

Python / Scala / Spark / Spark Sql

October 1, 2019

Add constant column in spark

If we want to add a column with default value then we can do in spark. In spark 2.2 there are two ways to add constant value in...

Hadoop / Scala / Spark

October 1, 2019

Salesforce connector in Spark

Salesforce is a customer relationship management solution that brings companies and customers together. It’s one integrated CRM platform that gives all your departments — including marketing, sales, commerce,...

Hadoop / Scala / Spark

September 30, 2019

Distribution of Executors, Cores and Memory for a Spark Application

Resource Allocation is an important aspect during the execution of any spark job. If not configured correctly, a spark job can consume entire cluster resources and make other...

Scala / Spark

September 27, 2019

What are workers, executors, cores in Spark Standalone cluster?

Spark uses a master/slave architecture. As you can see in the figure, it has one central coordinator (Driver) that communicates with many distributed workers (executors). The driver and...

Difference between DataFrame, Dataset, and RDD in Spark

Scala / Spark

September 27, 2019

Difference between DataFrame, Dataset, and RDD in Spark

RDD RDD is a fault-tolerant collection of elements that can be operated on in parallel. DataFrame DataFrame is a Dataset organised into named columns. It is conceptually equivalent to a...

Scala / Spark / spark interview

September 27, 2019

Spark Jobs, Stages, Tasks

Every distributed computation is divided in small parts called jobs, stages and tasks. It’s useful to know them especially during monitoring because it helps to detect bottlenecks. Job -> Stages -> Tasks...

Scala / Spark / spark interview / Spark-Submit

September 27, 2019

Spark Interview Questions : Basic

1. What is SparkContext? “SparkContext” is the main entry point for Spark functionality. A “SparkContext” represents the connection to a Spark cluster, and can be used to create...

Category: Spark

Apache Spark: WindowSpec & Window

Barrier Execution Mode in Spark

Spark Structured Streaming and Streaming Queries

Add constant column in spark

Salesforce connector in Spark

Distribution of Executors, Cores and Memory for a Spark Application

What are workers, executors, cores in Spark Standalone cluster?

Difference between DataFrame, Dataset, and RDD in Spark

Spark Jobs, Stages, Tasks

Spark Interview Questions : Basic

Recent Posts

Archives

Categories