Redshift Database connection in spark
This blog primarily focus on how to connect to redshift from Spark.
Amazon Redshift is a fully managed petabyte-scale data warehouse service. Redshift is designed for analytic workloads and connects to standard SQL-based clients and business intelligence tools.
Before stepping into next level let’s focus on prerequisite to run the sample program.
Prerequisite:
Apache Spark : Assumes user has installed apache spark
Redshift credentials: User has valid redshift credentials.
First, download RedshiftJDBC jar from amazon page, for that use following command
1 |
wget https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC42-1.2.8.1005.jar |
OR click the link it will automatically downloads the jar
Now, get ready to launch spark-shell. If you see the spark shell command contains packages
Spark provides spark-redshift connector for that we have to provide package information
Amazon provides the aws-java-sdk, which allows us to use amazon service with out complex coding
1 |
spark-shell --packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-redshift_2.10:1.1.0,com.amazonaws:aws-java-sdk-s3:1.11.200,com.amazonaws:aws-java-sdk:1.11.200 --jars RedshiftJDBC42-1.2.8.1005.jar |
For gradle dependency:
- Add redshift dependency
- add aws-java-sdk and
- add RedshiftJDBC connector jar
1 2 3 4 5 6 |
dependencies { compile group: 'com.amazonaws', name: 'aws-java-sdk', version: '<choose-version>' compile group: 'com.databricks', name: 'spark-redshift_2.11', version: '<choose-version>' compile files('/user/RedshiftJDBC42-1.2.8.1005.jar') } |
code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext object RedshiftConnection { def main(args: Array[String]): Unit = { import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql.SQLContext //provide your db name here val rsDbName = "" //provide your user name here val rsUser = "" //provide your password here val rsPassword = "" //provide your url here val rsURL = "DNS:Port" val jdbcURL = s"""jdbc:redshift://$rsURL/$rsDbName?user=$rsUser&password=$rsPassword""" println(jdbcURL) val sparkSession= spark val sqlContext = sparkSession.sqlContext //Load from a query val Query = "select distinct(class) from temptable" val dBDF = sqlContext.read.format("com.databricks.spark.redshift").option("url", jdbcURL).option("query", Query).option("tempdir","/tmp/maogautam").load() dBDF.show() } } |