Handling Missing Values in your Data.

Machine Learning algorithms work well depending on a number of factors; one being the quality of the presented data. In fact, preparing/preprocessing your data to make a magical input is somewhere in the critical zone on a Wizard's(here, Data Scientist) orb, and should, like other factors, be taken seriously in preparing, before pushing it into an ML vanishing cabinet.

We will look at how to achieve quality data using some preprocessing techiques inorder to build a robust machine learning model, and we might be lucky enough to test it on real data.

workflow:

  • Identifying missing values or unknown values.
  • Removing/imputing missing values from the dataset
  • Getting categorical data to play nice with our ML model.
  • Drop this. Drop that. Selecting relevant features.

Having one or more missing values(NaN - Not a Number) lurking around in a dataset is something that shouldn't worry you, if any thing having not to see one in a big dataset should even scare you before you go hunting for the powerhouse behind such beauty disguised as data. Why? Like most scientific data collecting there's prone to be an error; error in reading a value or just sheer oversight, or leaving a particular field blank for a couple of reasons. These empty fields are registed as either NULL or NaN.

Let's delve into some practical techniques for dealing with missing values, by either removing these NaNs or imputing them.

Identifying missing values or unknown values.

For the sake of this post I have prepared a sample data with some intentionally missing values so as to give us a somewhat personal insight at this.

In [11]:
import pandas as pd #for reading data, making dataframes, csv file I/O

df = pd.read_csv('~/datasets/dumdum/dummy_data.csv', sep=',') #import dummy data

df #call data
Out[11]:
Age Salary (NGN) Level Desk Number
0 22 89500.0 2 12.0
1 19 95000.0 4 6.0
2 32 40000.0 0 NaN
3 25 NaN 1 3.0
In [12]:
df.isna().sum() #checking the number of NaN in each column
Out[12]:
Age             0
Salary (NGN)    1
Level           0
Desk Number     1
dtype: int64

As we can see Age has no presence of NaN, Salary and Desk number have a value of 1. By looking at df you can point them out.

Easy peasy lemon squeezy. Let's move on, shall we?

Removing/imputing missing values from the dataset.

Simply removing missing values could be a fun task, albiet a problematic one. Inorder to do this one would have to decide on which to remove; columns(features) or rows(samples). And even then you'd still have to live through the judgemental feeling by the gods of data on why you would snap your fingers on some (potentially) good data.

Again, it's a fun task. See how:

In [13]:
df.dropna(axis=0) #drop rows(axis = 0) that contain Nan
Out[13]:
Age Salary (NGN) Level Desk Number
0 22 89500.0 2 12.0
1 19 95000.0 4 6.0
In [14]:
df.dropna(axis=1) #drop columns(axis=1) that contain Nan
Out[14]:
Age Level
0 22 2
1 19 4
2 32 0
3 25 1

Easy peasy? yeah?

NB:

  1. This method of dropping rows and columns can prove to be detrimental in the long run as (valuable) samples/features could be eliminated, risking a reliable analysis.
  2. The dropna method is a fully equipped method. Do check around for other implementations, including dropping specific rows that contain NaNs, subsets and threshes.

Imputing missing values

This is the recommended approach when it's dealing with missing values. It entails replacing Nan with mean, median or mode of that particular row/column. A convenient way to achieve this is by using the Imputer class from scikit-learn, as shown below:

In [15]:
from sklearn.preprocessing import Imputer
imput = Imputer(missing_values='NaN', strategy='mean', axis=0) #strategies = mean, median, most_frequent.
imput = imput.fit(df.values) #fit the values.
imputed_data = imput.transform(df.values) #then transforms it.
imputed_data #show
Out[15]:
array([[2.20000000e+01, 8.95000000e+04, 2.00000000e+00, 1.20000000e+01],
       [1.90000000e+01, 9.50000000e+04, 4.00000000e+00, 6.00000000e+00],
       [3.20000000e+01, 4.00000000e+04, 0.00000000e+00, 7.00000000e+00],
       [2.50000000e+01, 7.48333333e+04, 1.00000000e+00, 3.00000000e+00]])
In [16]:
imputed_data.astype(int)  #so we can get a better visualization of the above.
Out[16]:
array([[   22, 89500,     2,    12],
       [   19, 95000,     4,     6],
       [   32, 40000,     0,     7],
       [   25, 74833,     1,     3]])

Not much of a task, yeah? no? Okay. We imported the Imputer class from scikit-learn, gave it a 'strategy' to handle the data, fitted the data and then transformed it. Easy peasy--no? Just go over it line by line.

NB:

  1. The strategy most_frequent is suitable for imputing categorical feature values, for example, a feature column that contains an encoding of color names, such as green, orange, and blue.
  2. Understanding the scikit-learn estimator API: Fit-transform - sklearn *image source: imgur
  3. The fit method learns the parameter from the training data, and the transform method use these parameters to transform the data.
  4. There are other ways of handling NaN. Go find out!
Resources:
  1. Python Machine Learning by Sebastian Raschka.
  2. scikit data transformation