Difference between DataFrame, Dataset, and RDD in Spark
RDD
RDD
is a fault-tolerant collection of elements that can be operated on in parallel.
DataFrame
DataFrame
is a Dataset organised into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimisations under the hood.
Dataset
Dataset
is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine.
Note:
Dataset of Rows (
Dataset[Row]
) in Scala/Java will often refer as DataFrames.
Nice comparison of all of them with a code snippet.
Q: Can you convert one to the other like RDD to DataFrame or vice-versa?
Yes, both are possible
1. RDD
to DataFrame
with .toDF()
val rowsRdd: RDD[Row] = sc.parallelize(
Seq(
Row("first", 2.0, 7.0),
Row("second", 3.5, 2.5),
Row("third", 7.0, 5.9)
)
)
val df = spark.createDataFrame(rowsRdd).toDF("id", "val1", "val2")
df.show()
+------+----+----+
| id|val1|val2|
+------+----+----+
| first| 2.0| 7.0|
|second| 3.5| 2.5|
| third| 7.0| 5.9|
+------+----+----+
more ways: Convert an RDD object to Dataframe in Spark
2. DataFrame
/DataSet
to RDD
with .rdd()
method
val rowsRdd: RDD[Row] = df.rdd() // DataFrame to RDD