Spark SQL Using Parquet
Today, I’m focusing on how to use parquet format in spark. Please get the more insight about parquet format If you are new to this format.
Parquet: Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
Working with parquet is pretty straightforward because spark provides in-build support for parquet format. To load parquet file you have to provide file location spark will automatically read those data.
val parquetFile = sqlContext.read.parquet("resources/wiki_parquet")
After reading data register that dataframe as temporary table and name your table.
parquetFile.registerTempTable("employee")
Then, you can run your sql query as per your need.
Complete code :
import org.apache.spark.{ SparkConf, SparkContext } import org.apache.spark.sql._ import org.apache.log4j.{ Level, Logger } object SparkUsingParquet { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("Spark SQL parquet").setMaster("local[*]") val sc = new SparkContext(sparkConf) val sqlContext = new SQLContext(sc) val parquetFile = sqlContext.read.parquet("resources/wiki_parquet") parquetFile.registerTempTable("employee") val allrecords = sqlContext.sql("SELECT * FROM employee") allrecords.show() } }