Missing Imputation in python
Imputation:
In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as “unit imputation”; when substituting for a component of a data point, it is known as “item imputation”.
There are many ways to approach missing data. The most common, I believe, is to ignore it. But making no choice means that your statistical software is choosing for you. Most of the time, your software is choosing listwise deletion. Listwise deletion which may or may not be a bad choice, depending on why and how much data are missing.
Another common approach among those who are paying attention is imputation. Imputation simply means replacing the missing values with an estimate, then analyzing the full data set as if the imputed values were actual observed values.
How do you choose that estimate? The following are common methods:
Mean imputation
Simply calculate the mean of the observed values for that variable for all individuals who are non-missing.
It has the advantage of keeping the same mean and the same sample size, but many, many disadvantages. Pretty much every method listed below is better than mean imputation.
Missing imputation algorithm
- Read the data
- Get all columns name and the type of columns
- Replace all missing value(NA, N.A., N.A//,” ”) by null
- Set Boolean value for each column whether it contains null value or not. True for those columns which contains null otherwise false
- If column type is string then find the most frequent word of that column Else: calculate avg of that column
- Impute most frequent word for those column which is string type Else impute average for number
Implementation:
from pyspark.sql import SQLContext from pyspark import SparkContext from pyspark.sql.functions import col, when, mean,avg sc = SparkContext(appName="Missing Imputation in python") sqlContext = SQLContext(sc) df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/user/manoj/imputation/sample.csv") column_list = df.dtypes print(column_list) string_column_list = filter(lambda x: x[1]=="string",column_list) print(string_column_list) number_column_list = filter(lambda x: x[1]!="string",column_list) print(number_column_list) def blank_as_null(x): return when(col(x) != "NA", col(x) ).otherwise(None) #replace all missing value by None exprs = [blank_as_null(x).alias(x) if x in df.columns] all_blank=df.select(*exprs) # dfWithfilled=all_blank.na.fill({'uname': "Harry", 'department': 'unknown',"serialno":50}).show() # keys = ["serialno","uname","department"] # values = [50,"Harry","Imputation"] # dictionary = dict(zip(keys, values)) # print dictionary for column_name in number_column_list: df.na.fill(df.na.drop().agg(avg(column_name[0])).first()[0], [column_name[0]]) for string_column in number_column_list: df.na.fill(df.na.drop().groupBy(column_name[0]).count().sort(desc("count")).take(5).filter(lambda x: !("".equals(x[0]) || "N.A.".equals(x[0]) || "N.A.//".equals(x[0])))[0].get(0)) df.na.fill(df.na.drop().agg(avg("age")).first()[0], ["age"]) for x in df.columns: avg_compute = all_blank.where(df['missing_' + x] != 1).agg(avg(x)).first()[0] df = df.withColumn(x + 'mean_miss_imp', when((df['missing_' + x] == 1), avg_compute).otherwise(df[x]))