Difference between DataFrame, Dataset, and RDD in Spark

by beginnershadoop · September 27, 2019

`RDD`

RDD is a fault-tolerant collection of elements that can be operated on in parallel.

`DataFrame`

DataFrame is a Dataset organised into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimisations under the hood.

`Dataset`

Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.

Note:
Dataset of Rows (Dataset[Row]) in Scala/Java will often refer as DataFrames.

`Nice comparison of all of them with a code snippet.`

source

Q: Can you convert one to the other like RDD to DataFrame or vice-versa?

Yes, both are possible

1. RDD to DataFrame with .toDF()

val rowsRdd: RDD[Row] = sc.parallelize(
  Seq(
    Row("first", 2.0, 7.0),
    Row("second", 3.5, 2.5),
    Row("third", 7.0, 5.9)
  )
)

val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2")

df.show()
+------+----+----+
|    id|val1|val2|
+------+----+----+
| first| 2.0| 7.0|
|second| 3.5| 2.5|
| third| 7.0| 5.9|
+------+----+----+

more ways: Convert an RDD object to Dataframe in Spark

2. DataFrame/DataSet to RDD with .rdd() method

val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD

Difference between DataFrame, Dataset, and RDD in Spark

You may also like...

Leave a Reply Cancel reply

Recent Posts

Archives

Categories

Difference between DataFrame, Dataset, and RDD in Spark

RDD

DataFrame

Dataset

Nice comparison of all of them with a code snippet.

Yes, both are possible

Share this:

Related

You may also like...

Machine Learning: Logistic Regression using Apache Spark

Spark SQL Using Hive

File Operation in scala

Leave a Reply Cancel reply

Recent Posts

Archives

Categories

`RDD`

`DataFrame`

`Dataset`

`Nice comparison of all of them with a code snippet.`