User defined functions(udf) in spark
UDFs or user defined functions are a simple way of adding a function into the SparkSQL language. This function operates on distributed DataFrames and works row by row.
User-Defined Functions (aka UDF) is a feature of Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL’s DSL for transforming Datasets.
Let’s suppose we have a requirement to convert string columns into int. Rule is if column contains “yes” then assign 1 else 0.
def populate: (String => String) = (input: String) => { if (input.equals("yes")) "1" else "0" }
Here, checkAndPopulate is a normal scala function which takes string as parameter and returns string. Function is ready next step is to mark as udf. We can define a new UDF by defining a Scala function as an input parameter of udf function.
import org.apache.spark.sql.functions.udf val populateValueUdf = udf(populate)
val df = spark.createDataFrame(Seq((1, "Harry", "yes"), (2, "Mary", "yes"),(3, "John", "no"),(4, "Brown", "no"))).toDF("Id", "Name", "owns_car") val result = df.withColumn("owns_car",populateValueUdf(col("owns_car"))) result.show
Result:
We can create inline udf, here is the sample program
val checkAndPopulateUdf = udf((colName: String, value: String) => { if (colName.equals(value)) "1" else "0" })
Since UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice. So same problem can be solved without udf
val newdf =df.withColumn("owns_car", when(col("owns_car") === "yes", 1).otherwise(lit(0))) newdf.show