Redshift Database connection in spark

by beginnershadoop · Published November 25, 2017 · Updated November 25, 2017

This blog primarily focus on how to connect to redshift from Spark.

Amazon Redshift is a fully managed petabyte-scale data warehouse service. Redshift is designed for analytic workloads and connects to standard SQL-based clients and business intelligence tools.

Before stepping into next level let’s focus on prerequisite to run the sample program.

Prerequisite:

Apache Spark : Assumes user has installed apache spark

Redshift credentials: User has valid redshift credentials.

First, download RedshiftJDBC jar from amazon page, for that use following command

wget https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC42-1.2.8.1005.jar

OR click the link it will automatically downloads the jar

Now, get ready to launch spark-shell. If you see the spark shell command contains packages

Spark provides spark-redshift connector for that we have to provide package information

Amazon provides the aws-java-sdk, which allows us to use amazon service with out complex coding

spark-shell --packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-redshift_2.10:1.1.0,com.amazonaws:aws-java-sdk-s3:1.11.200,com.amazonaws:aws-java-sdk:1.11.200 --jars RedshiftJDBC42-1.2.8.1005.jar

For gradle dependency:

Add redshift dependency
add aws-java-sdk and
add RedshiftJDBC connector jar

dependencies {
compile group: 'com.amazonaws', name: 'aws-java-sdk', version: '<choose-version>'
compile group: 'com.databricks', name: 'spark-redshift_2.11', version: '<choose-version>'
compile files('/user/RedshiftJDBC42-1.2.8.1005.jar')
    
}

code:

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.sql.SQLContext

object RedshiftConnection {

def main(args: Array[String]): Unit = {

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.sql.SQLContext
//provide your db name here 
val rsDbName = ""
//provide your user name here 
val rsUser = ""
//provide your password here 
val rsPassword = ""
//provide your url here 
val rsURL = "DNS:Port"

val jdbcURL = s"""jdbc:redshift://$rsURL/$rsDbName?user=$rsUser&password=$rsPassword"""

println(jdbcURL)

val sparkSession= spark

val sqlContext =  sparkSession.sqlContext

//Load from a query

val Query = "select distinct(class) from  temptable"

val dBDF = sqlContext.read.format("com.databricks.spark.redshift").option("url", jdbcURL).option("query", Query).option("tempdir","/tmp/maogautam").load()

dBDF.show()

  }

}

Redshift Database connection in spark

You may also like...

Leave a Reply Cancel reply

Recent Posts

Archives

Categories

Redshift Database connection in spark

Share this:

Related

You may also like...

Add constant column in spark

Most common issues faced by spark developer and it’s solution

Spark and ElasticSearch integration

Leave a Reply Cancel reply

Recent Posts

Archives

Categories