Redshift Database connection in spark
This blog primarily focus on how to connect to redshift from Spark.
Amazon Redshift is a fully managed petabyte-scale data warehouse service. Redshift is designed for analytic workloads and connects to standard SQL-based clients and business intelligence tools.
Before stepping into next level let’s focus on prerequisite to run the sample program.
Prerequisite:
Apache Spark : Assumes user has installed apache spark
Redshift credentials: User has valid redshift credentials.
First, download RedshiftJDBC jar from amazon page, for that use following command
wget https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC42-1.2.8.1005.jar
OR click the link it will automatically downloads the jar
Now, get ready to launch spark-shell. If you see the spark shell command contains packages
Spark provides spark-redshift connector for that we have to provide package information
Amazon provides the aws-java-sdk, which allows us to use amazon service with out complex coding
spark-shell --packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-redshift_2.10:1.1.0,com.amazonaws:aws-java-sdk-s3:1.11.200,com.amazonaws:aws-java-sdk:1.11.200 --jars RedshiftJDBC42-1.2.8.1005.jar
For gradle dependency:
- Add redshift dependency
- add aws-java-sdk and
- add RedshiftJDBC connector jar
dependencies {
compile group: 'com.amazonaws', name: 'aws-java-sdk', version: '<choose-version>'
compile group: 'com.databricks', name: 'spark-redshift_2.11', version: '<choose-version>'
compile files('/user/RedshiftJDBC42-1.2.8.1005.jar')
}
code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
object RedshiftConnection {
def main(args: Array[String]): Unit = {
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
//provide your db name here
val rsDbName = ""
//provide your user name here
val rsUser = ""
//provide your password here
val rsPassword = ""
//provide your url here
val rsURL = "DNS:Port"
val jdbcURL = s"""jdbc:redshift://$rsURL/$rsDbName?user=$rsUser&password=$rsPassword"""
println(jdbcURL)
val sparkSession= spark
val sqlContext = sparkSession.sqlContext
//Load from a query
val Query = "select distinct(class) from temptable"
val dBDF = sqlContext.read.format("com.databricks.spark.redshift").option("url", jdbcURL).option("query", Query).option("tempdir","/tmp/maogautam").load()
dBDF.show()
}
}