Accessing Data Stored in Amazon S3 through Spark

by beginnershadoop · Published September 27, 2019 · Updated September 28, 2019

Spark can access files in S3, even when running in local mode, given AWS credentials. By default, with s3a URLs, Spark will search for credentials in a few different places:

Hadoop properties in core-site.xml:
fs.s3a.access.key=xxxx
fs.s3a.secret.key=xxxx

Standard AWS environment variables
AWS_SECRET_ACCESS_KEY and
AWS_ACCESS_KEY_ID

EC2 instance profile, which picks up IAM roles

However it will not by default pick up credentials from the ~/.aws/credentials file, which is useful during local development if you are authenticating to AWS through SAML federation instead of with an IAM user.

The way to make this work is to set the fs.s3a.aws.credentials.provider to com.amazonaws.auth.DefaultAWSCredentialsProviderChain, which will work exactly the same way as the AWS CLI – it will honor the AWS environment variables as well as the credentials file with the AWS_PROFILE environment variable to select from profiles.

Spark code:

val spark = SparkSession
            .builder
            .appName("spark application name")
            .config("spark.hadoop.fs.s3a.access.key", "my access key")
            .config("spark.hadoop.fs.s3a.secret.key", "my secret key")
            .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
            .getOrCreate

You can use this hadoop configuration as alternative to spark.hadoop configuration

  spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","my access key")
  spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key","my secret key")
  spark.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

After configuring s3 through spark you might encounter following issue:

No FileSystem for scheme: s3/s3n/s3a:


java.io.IOException: No FileSystem for scheme: s3n
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
    at scala.Option.getOrElse(Option.scala:120)

You have to include aws related dependencies:

 com.amazonaws:aws-java-sdk-pom:1.10.34
 org.apache.hadoop:hadoop-aws:2.6.0

If you’re using spark shell then

spark-shell --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0

Accessing Data Stored in Amazon S3 through Spark

You may also like...

Leave a Reply Cancel reply

Recent Posts

Archives

Categories

Accessing Data Stored in Amazon S3 through Spark

Share this:

Related

You may also like...

Spark SQL using Avro

Missing Imputation in scala

Spark Interview Questions : Basic

Leave a Reply Cancel reply

Recent Posts

Archives

Categories