Category: Scala
JDBC in Spark SQL
Apache Spark has very powerful built-in API for gathering data from a relational database. Effectiveness and efficiency, following the usual Spark approach, is managed in a transparent way....
Machine Learning: Logistic Regression using Apache Spark
In this blog post, I’ll help you get started using Apache Spark’s spark.ml Logistic Regression for predicting whether or not someone makes more or less than $50,000. Classification Classification...
User defined functions(udf) in spark
UDFs or user defined functions are a simple way of adding a function into the SparkSQL language. This function operates on distributed DataFrames and works row by row....
Redshift Database connection in spark
This blog primarily focus on how to connect to redshift from Spark. Redshift: Amazon Redshift is a fully managed petabyte-scale data warehouse service. Redshift is designed for analytic...
Most common issues faced by spark developer and it’s solution
Most common issues faced by spark developer and it’s solution Timeout waiting for connection from pool Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool To resolve this...
File Operation in scala
File operation is important operation in an application. We might have to provide some configuration information or some input for application in such scenario we have to perform...
Spark and ElasticSearch integration
In this blog, as topic gives a glimpse what it is going to be. Here, I’m going to explain the end to end process of writing and reading data...
Missing Imputation in scala
Imputation: In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as “unit imputation”; when substituting...
Spark SQL Using Parquet
Today, I’m focusing on how to use parquet format in spark. Please get the more insight about parquet format If you are new to this format. Parquet: Apache Parquet is a...