Add constant column in spark
If we want to add a column with default value then we can do in spark. In spark 2.2 there are two ways to add constant value in a column in DataFrame:
1) Using lit
2) Using typedLit.
The difference between the two is that typedLit
can also handle parameterized scala types e.g. List, Seq, and Map
Spark 2.2 introduces typedLit
to support Seq
, Map
, and Tuples
(SPARK-19254) and following calls should be supported (Scala):
1 2 3 4 5 |
import org.apache.spark.sql.functions.typedLit df.withColumn("some_array", typedLit(Seq(1, 2, 3))) df.withColumn("some_struct", typedLit(("foo", 1, .0.3))) df.withColumn("some_map", typedLit(Map("key1" -> 1, "key2" -> 2))) |
Spark 1.3+ (lit
), 1.4+ (array
, struct
), 2.0+ (map
):
The second argument for DataFrame.withColumn
should be a Column
so you have to use a literal:
1 2 3 |
from pyspark.sql.functions import lit df.withColumn('new_column', lit(10)) |
If you need complex columns you can build these using blocks like array
:
1 2 3 4 5 |
from pyspark.sql.functions import array, create_map, struct df.withColumn("some_array", array(lit(1), lit(2), lit(3))) df.withColumn("some_struct", struct(lit("foo"), lit(1), lit(.3))) df.withColumn("some_map", create_map(lit("key1"), lit(1), lit("key2"), lit(2))) |
Exactly the same methods can be used in Scala.
1 2 3 4 |
import org.apache.spark.sql.functions.{array, lit, map, struct} df.withColumn("new_column", lit(10)) df.withColumn("map", map(lit("key1"), lit(1), lit("key2"), lit(2))) |
To provide names for structs
use either alias
on each field:
1 2 3 4 |
df.withColumn( "some_struct", struct(lit("foo").alias("x"), lit(1).alias("y"), lit(0.3).alias("z")) ) |
or cast
on the whole object
1 2 3 4 |
df.withColumn( "some_struct", struct(lit("foo"), lit(1), lit(0.3)).cast("struct<x: string, y: integer, z: double>") ) |
It is also possible, although slower, to use an UDF.