Most common issues faced by spark developer and it’s solution

by beginnershadoop · Published August 15, 2017 · Updated September 27, 2019

Timeout waiting for connection from pool

Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool

To resolve this problem increase the number of connection, the default number of connection is 15
--conf "fs.s3a.connection.maximum=1000" --conf "spark.hadoop.fs.s3a.connection.maximum=1000" --conf "spark.speculation=true" --conf "spark.hadoop.fs.s3a.connection.timeout=100000000"

2. Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:

The default spark.sql.broadcastTimeout is 300 Timeout in seconds for the broadcast wait time in broadcast joins.

To overcome this problem increase the timeout time as per required example
--conf "spark.sql.broadcastTimeout= 1200"

3. “org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]”

This is mainly because of network timeout, so there is a spark configuration that will help to avoid this problem
--conf “spark.network.timeout = 800”

4. “No space left on device”

This is primarily due to executor memory, try increasing the executor memory. Example --executor-memory 20G

There are couples of other reason:
Due to heavy shuffle is happening – if this is the scenario then we have to look into join or repartition.

5. Spark SQL fails because “Constant pool has grown past JVM limit of 0xFFFF”

org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection has grown past JVM limit of 0xFFFF

Solution:
This is due to known limitation of Java for generated classes to go beyond 64Kb.
This limitation has been worked around in SPARK-18016 which is fixed in Spark 2.3

Incase of older version of spark try to remove unnecessary column from dataframe/ dataset.

6. No FileSystem for scheme: s3n/s3a

java.io.IOException: No FileSystem for scheme: s3n
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

Solution: Include AWS dependencies

com.amazonaws:aws-java-sdk-pom:1.10.34
org.apache.hadoop:hadoop-aws:2.6.0

If you are trying to access through spark shell then specify dependency as packages

spark-shell --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0

Solution:

Setup spark properties accordingly,

spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem  
spark.hadoop.fs.s3a.access.key=ACCESSKEY  
spark.hadoop.fs.s3a.secret.key=SECRETKEY

Most common issues faced by spark developer and it’s solution

You may also like...

Leave a Reply Cancel reply

Recent Posts

Archives

Categories

Most common issues faced by spark developer and it’s solution

Timeout waiting for connection from pool

2. Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:

3. “org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]”

4. “No space left on device”

5. Spark SQL fails because “Constant pool has grown past JVM limit of 0xFFFF”

6. No FileSystem for scheme: s3n/s3a

Share this:

Related

You may also like...

Spark Interview Questions : Basic

Spark Streaming : Word Count Example

Add constant column in spark

Leave a Reply Cancel reply

Recent Posts

Archives

Categories