Missing Imputation in python

by beginnershadoop · Published June 19, 2016 · Updated November 18, 2018

Imputation:

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as “unit imputation”; when substituting for a component of a data point, it is known as “item imputation”.

There are many ways to approach missing data. The most common, I believe, is to ignore it. But making no choice means that your statistical software is choosing for you. Most of the time, your software is choosing listwise deletion. Listwise deletion which may or may not be a bad choice, depending on why and how much data are missing.

Another common approach among those who are paying attention is imputation. Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values.

How do you choose that estimate? The following are common methods:

Mean imputation

Simply calculate the mean of the observed values for that variable for all individuals who are non-missing.

It has the advantage of keeping the same mean and the same sample size, but many, many disadvantages. Pretty much every method listed below is better than mean imputation.

Missing imputation algorithm

Read the data
Get all columns name and the type of columns
Replace all missing value(NA, N.A., N.A//,” ”) by null
Set Boolean value for each column whether it contains null value or not. True for those columns which contains null otherwise false
If column type is string then find the most frequent word of that column Else: calculate avg of that column

Impute most frequent word for those column which is string type Else impute average for number

Implementation:

from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql.functions import col, when, mean,avg

sc = SparkContext(appName="Missing Imputation in python")
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/user/manoj/imputation/sample.csv")

column_list = df.dtypes
print(column_list)
string_column_list = filter(lambda x: x[1]=="string",column_list)
print(string_column_list)

number_column_list = filter(lambda x: x[1]!="string",column_list)
print(number_column_list)



def blank_as_null(x):
    return when(col(x) != "NA", col(x) ).otherwise(None)

#replace all missing value by None
exprs = [blank_as_null(x).alias(x) if x in df.columns]
all_blank=df.select(*exprs)


# dfWithfilled=all_blank.na.fill({'uname': "Harry", 'department': 'unknown',"serialno":50}).show()

# keys = ["serialno","uname","department"]
# values = [50,"Harry","Imputation"]
# dictionary = dict(zip(keys, values))
# print dictionary



for column_name in number_column_list:
	df.na.fill(df.na.drop().agg(avg(column_name[0])).first()[0], [column_name[0]])


for string_column in number_column_list:
	df.na.fill(df.na.drop().groupBy(column_name[0]).count().sort(desc("count")).take(5).filter(lambda x: !("".equals(x[0]) || "N.A.".equals(x[0]) || "N.A.//".equals(x[0])))[0].get(0))


df.na.fill(df.na.drop().agg(avg("age")).first()[0], ["age"])

for x in df.columns:
    avg_compute = all_blank.where(df['missing_' + x] != 1).agg(avg(x)).first()[0]
    df = df.withColumn(x + 'mean_miss_imp', when((df['missing_' + x] == 1), 
    avg_compute).otherwise(df[x]))

Tags: python Spark

Missing Imputation in python

You may also like...

Leave a Reply Cancel reply

Recent Posts

Archives

Categories

Missing Imputation in python

Mean imputation

Share this:

Related

You may also like...

Distribution of Executors, Cores and Memory for a Spark Application

Barrier Execution Mode in Spark

Useful commands for hadoop developer

Leave a Reply Cancel reply

Recent Posts

Archives

Categories