Redshift Database connection in spark
This blog primarily focus on how to connect to redshift from Spark.
Amazon Redshift is a fully managed petabyte-scale data warehouse service. Redshift is designed for analytic workloads and connects to standard SQL-based clients and business intelligence tools.
Before stepping into next level let’s focus on prerequisite to run the sample program.
Prerequisite:
Apache Spark : Assumes user has installed apache spark
Redshift credentials: User has valid redshift credentials.
First, download RedshiftJDBC jar from amazon page, for that use following command
wget https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC42-1.2.8.1005.jar
OR click the link it will automatically downloads the jar
Now, get ready to launch spark-shell. If you see the spark shell command contains packages
Spark provides spark-redshift connector for that we have to provide package information
Amazon provides the aws-java-sdk, which allows us to use amazon service with out complex coding
spark-shell --packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-redshift_2.10:1.1.0,com.amazonaws:aws-java-sdk-s3:1.11.200,com.amazonaws:aws-java-sdk:1.11.200 --jars RedshiftJDBC42-1.2.8.1005.jar
For gradle dependency:
- Add redshift dependency
- add aws-java-sdk and
- add RedshiftJDBC connector jar
dependencies { compile group: 'com.amazonaws', name: 'aws-java-sdk', version: '<choose-version>' compile group: 'com.databricks', name: 'spark-redshift_2.11', version: '<choose-version>' compile files('/user/RedshiftJDBC42-1.2.8.1005.jar') }
code:
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext object RedshiftConnection { def main(args: Array[String]): Unit = { import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext //provide your db name here val rsDbName = "" //provide your user name here val rsUser = "" //provide your password here val rsPassword = "" //provide your url here val rsURL = "DNS:Port" val jdbcURL = s"""jdbc:redshift://$rsURL/$rsDbName?user=$rsUser&password=$rsPassword""" println(jdbcURL) val sparkSession= spark val sqlContext = sparkSession.sqlContext //Load from a query val Query = "select distinct(class) from temptable" val dBDF = sqlContext.read.format("com.databricks.spark.redshift").option("url", jdbcURL).option("query", Query).option("tempdir","/tmp/maogautam").load() dBDF.show() } }