Accessing Data Stored in Amazon S3 through Spark
Spark can access files in S3, even when running in local mode, given AWS credentials. By default, with s3a URLs, Spark will search for credentials in a few different places:
Hadoop properties in core-site.xml:
fs.s3a.access.key=xxxx
fs.s3a.secret.key=xxxx
Standard AWS environment variables
AWS_SECRET_ACCESS_KEY and
AWS_ACCESS_KEY_ID
EC2 instance profile, which picks up IAM roles
However it will not by default pick up credentials from the ~/.aws/credentials file, which is useful during local development if you are authenticating to AWS through SAML federation instead of with an IAM user.
The way to make this work is to set the fs.s3a.aws.credentials.provider to com.amazonaws.auth.DefaultAWSCredentialsProviderChain, which will work exactly the same way as the AWS CLI – it will honor the AWS environment variables as well as the credentials file with the AWS_PROFILE environment variable to select from profiles.
Spark code:
val spark = SparkSession
.builder
.appName("spark application name")
.config("spark.hadoop.fs.s3a.access.key", "my access key")
.config("spark.hadoop.fs.s3a.secret.key", "my secret key")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.getOrCreate
You can use this hadoop configuration as alternative to spark.hadoop configuration
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key","my access key")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key","my secret key")
spark.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
After configuring s3 through spark you might encounter following issue:
No FileSystem for scheme: s3/s3n/s3a:
java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
You have to include aws related dependencies:
com.amazonaws:aws-java-sdk-pom:1.10.34 org.apache.hadoop:hadoop-aws:2.6.0
If you’re using spark shell then
spark-shell --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0