Most common issues faced by spark developer and it’s solution
Most common issues faced by spark developer and it’s solution
-
Timeout waiting for connection from pool
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
To resolve this problem increase the number of connection, the default number of connection is 15
--conf "fs.s3a.connection.maximum=1000"
--conf "spark.hadoop.fs.s3a.connection.maximum=1000" --conf "spark.speculation=true" --conf "spark.hadoop.fs.s3a.connection.timeout=100000000"
2. Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
The default spark.sql.broadcastTimeout is 300 Timeout in seconds for the broadcast wait time in broadcast joins.
To overcome this problem increase the timeout time as per required example
--conf "spark.sql.broadcastTimeout= 1200"
3. “org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]”
This is mainly because of network timeout, so there is a spark configuration that will help to avoid this problem
--conf “spark.network.timeout = 800”
4. “No space left on device”
This is primarily due to executor memory, try increasing the executor memory. Example --executor-memory 20G
There are couples of other reason:
Due to heavy shuffle is happening – if this is the scenario then we have to look into join or repartition.
5. Spark SQL fails because “Constant pool has grown past JVM limit of 0xFFFF”
org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection has grown past JVM limit of 0xFFFF
Solution:
This is due to known limitation of Java for generated classes to go beyond 64Kb.
This limitation has been worked around in SPARK-18016 which is fixed in Spark 2.3
Incase of older version of spark try to remove unnecessary column from dataframe/ dataset.
6. No FileSystem for scheme: s3n/s3a
java.io.IOException: No FileSystem for scheme: s3n at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
Solution: Include AWS dependencies
- com.amazonaws:aws-java-sdk-pom:1.10.34
- org.apache.hadoop:hadoop-aws:2.6.0
If you are trying to access through spark shell then specify dependency as packages
spark-shell --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
Solution:
Setup spark properties accordingly,
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.access.key=ACCESSKEY
spark.hadoop.fs.s3a.secret.key=SECRETKEY