Data cleaning with Pandas

Data cleaning

In this entry, you will learn how to clean your data and prepare it for analysis. This step can often take up 80% of the time of the whole project and is essential for robust results. For example, missing values or wrong data types can distort the results or even hinder you from performing certain operations.

As an example, we will use a modified version of the Airbnb dataset from Kaggle. [INSERT LINK TO OUR MODIFIED DATASET] You can find the documentation with information about the columns here. First, we need to import the Pandas library and our dataset.

import pandas as pd
df = pd.read_csv('airbnb_modified.csv')
df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  \
0              False      1    0                10.0   
1              False      0    0                 8.0   
2              False      0    1                 9.0   
3              False      0    1                 9.0   
4               True      0    0                10.0   

   guest_satisfaction_overall  ...      dist  metro_dist  attr_index  \
0                        93.0  ...  5.022964    2.539380   78.690379   
1                        85.0  ...  0.488389    0.239404  631.176378   
2                        87.0  ...  5.748312    3.651621   75.275877   
3                        90.0  ...  0.384862    0.439876  493.272534   
4                        98.0  ...  0.544738    0.318693  552.830324   

   attr_index_norm  rest_index  rest_index_norm      lng       lat       city  \
0         4.166708   98.253896         6.846473  4.90569  52.41772  amsterdam   
1        33.421209  837.280757        58.342928  4.90005  52.37432  amsterdam   
2         3.985908   95.386955         6.646700  4.97512  52.36103  amsterdam   
3        26.119108  875.033098        60.973565  4.89417  52.37663  amsterdam   
4        29.272733  815.305740        56.811677  4.90051  52.37508  amsterdam   

  weekend  
0   False  
1   False  
2   False  
3   False  
4   False  

[5 rows x 21 columns]

Remove and rename columns

First, remove any columns that do not contain useful information or that you will definiteley not need in your analysis. Rename any columns with long or confusing names into something short and self-explanatory. This will make the dataframe less overwhelming and the following analysis much easier.

Have a look at the columns and check which information they contain in the documentation.

# look at all column names
print(df.columns)

Index(['realSum', 'room_type', 'room_shared', 'room_private',
       'person_capacity', 'host_is_superhost', 'multi', 'biz',
       'cleanliness_rating', 'guest_satisfaction_overall', 'bedrooms', 'dist',
       'metro_dist', 'attr_index', 'attr_index_norm', 'rest_index',
       'rest_index_norm', 'lng', 'lat', 'city', 'weekend'],
      dtype='object')

In our case the columns 'attr_index', 'attr_index_norm', 'rest_index' and 'rest_index_norm' do not seem to contain any useful information, so we will remove them.

# remove columns
df.drop(['attr_index', 'attr_index_norm'], axis=1, inplace=True)

# check if it worked
print(df.columns)

Index(['realSum', 'room_type', 'room_shared', 'room_private',
       'person_capacity', 'host_is_superhost', 'multi', 'biz',
       'cleanliness_rating', 'guest_satisfaction_overall', 'bedrooms', 'dist',
       'metro_dist', 'rest_index', 'rest_index_norm', 'lng', 'lat', 'city',
       'weekend'],
      dtype='object')

The names seem okay, but we could shorten 'guest_satisfaction_overall'.

# rename columns
df.rename({'guest_satisfaction_overall': 'guest_satisfaction'}, axis=1, inplace=True)

# check if it worked
print(df.columns)

Index(['realSum', 'room_type', 'room_shared', 'room_private',
       'person_capacity', 'host_is_superhost', 'multi', 'biz',
       'cleanliness_rating', 'guest_satisfaction', 'bedrooms', 'dist',
       'metro_dist', 'rest_index', 'rest_index_norm', 'lng', 'lat', 'city',
       'weekend'],
      dtype='object')

Change data types

Some columns may be stored in an incorrect data type and must be adjusted before you can analyze the data. To do this, you need to change the data types in a similar way to type casting, which you already know. However, this time of columns of the dataframe instead of variables.

We can look at the existing data types with dtypes.

# check the data types of the columns
df.dtypes

realSum               float64
room_type              object
room_shared              bool
room_private             bool
person_capacity        object
host_is_superhost        bool
multi                   int64
biz                     int64
cleanliness_rating    float64
guest_satisfaction    float64
bedrooms                int64
dist                  float64
metro_dist            float64
lng                   float64
lat                   float64
city                   object
weekend                  bool
dtype: object

The data type of the columns 'multi' and 'biz' is integer. However, looking at the documentation and the first few rows of the data, it seems like they contain only two values (0 or 1). We can check this using unique() which shows us all the different values in a column.

# show all different values in these columns
print(df['multi'].unique())
print(df['biz'].unique())

[1 0]
[0 1]

We were correct, there are only 0 and 1. Therefore, the data type boolean would be more fitting. We can select the columns and change the data type with the pandas method astype().

# change the data type of both columns to boolean
df[['multi', 'biz']] = df[['multi', 'biz']].astype('bool')

# check the data types again to see if it worked
df.dtypes

realSum               float64
room_type              object
room_shared              bool
room_private             bool
person_capacity         int64
host_is_superhost        bool
multi                    bool
biz                      bool
cleanliness_rating    float64
guest_satisfaction    float64
bedrooms                int64
dist                  float64
metro_dist            float64
rest_index            float64
rest_index_norm       float64
lng                   float64
lat                   float64
city                   object
weekend                  bool
dtype: object

Furthermore, the values in the column 'person_capacity' have the data type object (pandas' version of a string). In order to be able to calculate with the values, we have to convert them to the data type integer.

Identify missing values

Missing values are typically represented as NaN (= abbreviation for Not a Number). However, it can also happen that missing values are represented differently (e.g.: -1). With isna() you can filter for missing values represented by NaN.

To get an overview of missing values per column, you can use .isna().sum().

# show missing values per column
df.isna().sum()

realSum               100
room_type               0
room_shared             0
room_private            0
person_capacity         0
host_is_superhost       0
multi                   0
biz                     0
cleanliness_rating      0
guest_satisfaction      0
bedrooms                0
dist                    0
metro_dist              0
lng                     0
lat                     0
city                    0
weekend                 0
dtype: int64

We have 100 missing values in the column 'realSum'.

You can use the following code to determine the total number of missing values in the data set.

# show total number of missing values
df.isna().sum().sum()

You can also use isna() to select rows with missing values.

# select rows with missing values
df[df['realSum'].isna()]

       realSum        room_type  room_shared  room_private  person_capacity  \
425        NaN  Entire home/apt        False         False                4   
1639       NaN     Private room        False          True                2   
1739       NaN     Private room        False          True                4   
1952       NaN     Private room        False          True                3   
2082       NaN  Entire home/apt        False         False                3   
...        ...              ...          ...           ...              ...   
48540      NaN  Entire home/apt        False         False                4   
48598      NaN  Entire home/apt        False         False                2   
49729      NaN  Entire home/apt        False         False                3   
50036      NaN  Entire home/apt        False         False                6   
50663      NaN  Entire home/apt        False         False                2   

       host_is_superhost  multi    biz  cleanliness_rating  \
425                False  False  False                 8.0   
1639                True   True  False                10.0   
1739               False   True  False                 9.0   
1952               False  False  False                10.0   
2082                True  False   True                10.0   
...                  ...    ...    ...                 ...   
48540              False  False   True                 8.0   
48598               True  False   True                10.0   
49729               True  False  False                10.0   
50036               True  False   True                 9.0   
50663               True  False   True                10.0   

       guest_satisfaction  bedrooms      dist  metro_dist       lng       lat  \
425                  88.0         1  0.504975    0.338422   4.89518  52.37759   
1639                 91.0         1  0.480411    0.336982   4.89533  52.37733   
1739                 90.0         2  1.651972    0.886285   4.87815  52.36154   
1952                 92.0         1  1.972445    0.292684   4.88810  52.35577   
2082                 98.0         1  1.237211    0.288488  23.72200  37.97900   
...                   ...       ...       ...         ...       ...       ...   
48540                80.0         1  2.674923    0.294553  16.34000  48.20000   
48598                98.0         1  6.626611    1.293201  16.30963  48.16705   
49729                96.0         1  0.625130    0.322278  16.36977  48.20354   
50036                86.0         2  2.159373    1.057287  16.38876  48.19193   
50663                99.0         0  1.391834    0.228494  16.36689  48.22020   

            city  weekend  
425    amsterdam    False  
1639   amsterdam     True  
1739   amsterdam     True  
1952   amsterdam     True  
2082      athens    False  
...          ...      ...  
48540     vienna    False  
48598     vienna    False  
49729     vienna    False  
50036     vienna     True  
50663     vienna     True  

[100 rows x 17 columns]

Handle missing values

Ideally, missing data can be collected subsequently, but this is usually not possible due to the effort and cost involved. Therefore, there are two approaches to dealing with missing data:

removing observations with missing values: this approach is simple but reduces the sample size.
replacing values: Here, missing values can be replaced with, for example, the mean, median, or mode of the corresponding column. This preserves the sample size, but may change the distribution and disturb useful patterns in the data set.

In addition, the importance of the column to the analysis plays a large role in deciding how to handle missing values. In our case, 'realSum' is an essential column for investigating price differences, so replacing missing values in this column could cause problems. Fortunately, 100 values are not very much in a dataset with over 50000 observations (use shape if you want to check this number). Therefore, we can just remove the observations with missing values.

Remove observations with missing values

Pandas provides the dropna() method to easily remove missing values. The axis parameter specifies if you want to remove rows (axis=0, the default) or columns (axis=1). In addition, the how parameter determines whether a row/column is removed if at least one value is missing (how='any') or if all values are missing (how='all'). By default, dropna() removes rows that contain at least one missing value.

So, we can remove all rows that contain at least one missing value like this:

# Print number of rows and columns before operation
print(df.shape)

# Remove all rows with missing values
df.dropna(inplace=True)

# Print number of rows and columns after operation
print(df.shape)
# Print number of missing values to see if it worked 
print(df.isna().sum().sum())

(51727, 17)
(51627, 17)
0

We can also only remove rows which have missing values in a specific column, e.g. 'realSum', using the parameter subset=['realSum'].

# Remove all rows with missing values in the column 'realSum'
df.dropna(subset=['realSum'], inplace=True)

Replace missing values

Instead of removing missing values, we can replace them. For example, we can calculate the mean of the non-missing values of the corresponding column and replace the missing values with this mean. We could also use other numbers (e.g. the median or mode) for imputation. When working with real data, you also need to consider the reasons for the missing data before deciding which approach to use for treatment.

The decision of which data we impute depends on several factors:

Type of data:
- For continuous data, it is common to use the mean or median.
- For categorical data, the mode is often used.
Distribution of the data:
- For normally distributed data, the mean may be an appropriate choice.
- For skewed distributed data, the median might be more appropriate because it is more robust to outliers.
Imputation methods:
- If the data are missing at random, simple imputation methods such as mean, median, or mode may be sufficient.
- If the data are not missing at random, a more advanced imputation method might be required.

To deal with the missing values in the column 'realSum', we first consider the type and distribution of the data. The data is continuous. We can look at the distribution with the following code.

# Import the library needed for creating plots
import matplotlib.pyplot as plt

# Calculate median and mean
print('Median:', df['realSum'].median(), ', Mean:', df['realSum'].mean())

# Create a histogram
plt.hist(df['realSum'], bins=30, edgecolor='black')  # You can adjust the number of bins

plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

Median: 211.3430888246808 , Mean: 279.9866756625171

The distribution looks quite skewed, so in this case the median might be a better choice for imputing missing values. It works as follows:

# calculate the median of 'realSum'
median_value = df['realSum'].median()

# print the median
print(median_value)

# replace all missing values in the colum 'realSum' with the median
df.loc[:, 'realSum'].fillna(value=median_value, inplace=True)

The ouput will show:

211.3430888246808

Handle duplicates

Sometimes, values or entire rows are duplicated. To identify duplicate rows, you can use the duplicated() method. By default, it identifies a row as duplicated if all values in the row are the same as in another row. It marks every occurrence as a duplicate except for the first one.

You can use duplicated to select all duplicate rows like this:

# select duplicates
df.loc[df.duplicated()]

          realSum        room_type  room_shared  room_private  \
51707  393.235965  Entire home/apt        False         False   
51708  212.741169     Private room        False          True   
51709  438.065057  Entire home/apt        False         False   
51710  151.852459  Entire home/apt        False         False   
51711  132.302486  Entire home/apt        False         False   
51712  189.493433  Entire home/apt        False         False   
51713  157.809833  Entire home/apt        False         False   
51714  149.860894  Entire home/apt        False         False   
51715  412.643490  Entire home/apt        False         False   
51716  139.117357  Entire home/apt        False         False   
51717  990.527614  Entire home/apt        False         False   
51718  341.958254  Entire home/apt        False         False   
51719  207.436675     Private room        False          True   
51720  217.166979     Private room        False          True   
51721  149.592128  Entire home/apt        False         False   
51722  789.999765  Entire home/apt        False         False   
51723  454.854358  Entire home/apt        False         False   
51724  364.898872  Entire home/apt        False         False   
51725  204.053012  Entire home/apt        False         False   
51726  486.187074  Entire home/apt        False         False   

       person_capacity  host_is_superhost  multi    biz  cleanliness_rating  \
51707                2              False  False  False                10.0   
51708                2              False  False  False                 9.0   
51709                4              False  False  False                10.0   
51710                4               True  False   True                10.0   
51711                4              False  False   True                 9.0   
51712                4              False  False   True                 8.0   
51713                4              False   True  False                 9.0   
51714                2               True   True  False                10.0   
51715                4              False  False   True                10.0   
51716                4              False  False  False                10.0   
51717                6               True  False   True                 8.0   
51718                4              False  False   True                 8.0   
51719                2              False  False  False                10.0   
51720                2              False  False  False                 6.0   
51721                4               True  False   True                10.0   
51722                6              False  False   True                 8.0   
51723                2              False   True  False                10.0   
51724                4               True  False  False                10.0   
51725                4              False   True  False                10.0   
51726                2               True   True  False                10.0   

       guest_satisfaction  bedrooms      dist  metro_dist       lng       lat  \
51707                98.0         1  4.833158    0.367477  -0.12004  51.55151   
51708                94.0         1  3.359232    0.304258   2.39401  48.84347   
51709                94.0         2  4.137770    0.062634   2.40718  48.84682   
51710               100.0         2  2.508966    0.563467  23.73200  37.99800   
51711                89.0         0  0.462231    0.175751  19.05885  47.49582   
51712                80.0         1  2.510083    1.903344  -9.16650  38.70410   
51713                91.0         1  4.594507    0.492478  12.48000  41.86300   
51714               100.0         1  3.709593    1.360519  12.47000  41.92400   
51715                95.0         1  1.879220    0.159314  12.48111  41.90694   
51716                93.0         1  1.855371    0.467679  19.07420  47.48879   
51717               100.0         3  1.060404    0.192599   2.16000  41.39300   
51718                80.0         1  1.291520    0.027628  16.38056  48.21919   
51719                98.0         1  3.762488    0.785826  -0.07399  51.50305   
51720                60.0         1  3.045480    0.341114  -9.13141  38.73905   
51721                95.0         1  2.581053    0.115898  16.33900  48.20700   
51722                80.0         2  4.318515    0.295255  -0.18656  51.49615   
51723                96.0         0  1.065912    1.134896   4.87807  52.37591   
51724               100.0         1  1.901161    0.247033   2.37701  48.86258   
51725                96.0         1  2.646609    1.045189  16.39382  48.18890   
51726               100.0         1  2.437745    0.171287   2.14456  41.39785   

            city  weekend  
51707     london     True  
51708      paris     True  
51709      paris    False  
51710     athens    False  
51711   budapest     True  
51712     lisbon    False  
51713       rome    False  
51714       rome    False  
51715       rome    False  
51716   budapest    False  
51717  barcelona    False  
51718     vienna     True  
51719     london    False  
51720     lisbon    False  
51721     vienna    False  
51722     london     True  
51723  amsterdam     True  
51724      paris    False  
51725     vienna    False  
51726  barcelona    False

Furthermore, you can look at the total amount of duplicated rows:

# show total amount of duplicates
df.duplicated().sum()

We have 20 duplicates in our data.

These duplicates can be correct, i.e. you just have multiple observations with the same values in reality, or they can be errors in the data. In the latter case they should be removed. To remove them you can use the drop_duplicates() method.

# print the number of rows and columns before the operation
print(df.shape)

# remove duplicates
df.drop_duplicates(inplace=True)

# print the number of rows and columns after the operation
print(df.shape)
# print the number of duplicates to see if it worked
print(df.duplicated().sum())

(51627, 17)
(51607, 17)
0

Generate new features

In this section we will look at creating new features (= columns). For example, we might be interested in offers that could be used by families (i.e. a person_capacity of 3 or more). We can create a new column 'family' with boolean values that indicate if this room/apartement can be used by a family or not. We create this column based on the column 'person_capacity'.

# generate new feature 'family'
df['family'] = df['person_capacity'] >= 3
df.head()

      realSum     room_type  room_shared  room_private  person_capacity  \
0  194.033698  Private room        False          True                2   
1  344.245776  Private room        False          True                4   
2  264.101422  Private room        False          True                2   
3  433.529398  Private room        False          True                4   
4  485.552926  Private room        False          True                2   

   host_is_superhost  multi    biz  cleanliness_rating  guest_satisfaction  \
0              False   True  False                10.0                93.0   
1              False  False  False                 8.0                85.0   
2              False  False   True                 9.0                87.0   
3              False  False   True                 9.0                90.0   
4               True  False  False                10.0                98.0   

   bedrooms      dist  metro_dist      lng       lat       city  weekend  \
0         1  5.022964    2.539380  4.90569  52.41772  amsterdam    False   
1         1  0.488389    0.239404  4.90005  52.37432  amsterdam    False   
2         1  5.748312    3.651621  4.97512  52.36103  amsterdam    False   
3         2  0.384862    0.439876  4.89417  52.37663  amsterdam    False   
4         1  0.544738    0.318693  4.90051  52.37508  amsterdam    False   

   family  
0   False  
1    True  
2   False  
3    True  
4   False

Summary

The data cleaning process heavily depends on your data and can look different every time. However, if you go through the steps in this entry, you are off to a good start:

Remove and rename columns
Change data types
Handle missing values
Handle duplicates
Generate new features (if needed)

In addition, remember to check if everything worked as expected after each step.

References

To be added soon

The author of this entry is Wanja Tolksdorf.