Data cleaning with Pandas
Contents
Data cleaning
In this entry, you will learn how to clean your data and prepare it for analysis. This step can often take up 80% of the time of the whole project and is essential for robust results. For example, missing values or wrong data types can distort the results or even hinder you from performing certain operations.
As an example, we will use a modified version of the Airbnb dataset from Kaggle. [INSERT LINK TO OUR MODIFIED DATASET] You can find the documentation with information about the columns here. First, we need to import the Pandas library and our dataset.
import pandas as pd df = pd.read_csv('airbnb_modified.csv') df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating \ 0 False 1 0 10.0 1 False 0 0 8.0 2 False 0 1 9.0 3 False 0 1 9.0 4 True 0 0 10.0 guest_satisfaction_overall ... dist metro_dist attr_index \ 0 93.0 ... 5.022964 2.539380 78.690379 1 85.0 ... 0.488389 0.239404 631.176378 2 87.0 ... 5.748312 3.651621 75.275877 3 90.0 ... 0.384862 0.439876 493.272534 4 98.0 ... 0.544738 0.318693 552.830324 attr_index_norm rest_index rest_index_norm lng lat city \ 0 4.166708 98.253896 6.846473 4.90569 52.41772 amsterdam 1 33.421209 837.280757 58.342928 4.90005 52.37432 amsterdam 2 3.985908 95.386955 6.646700 4.97512 52.36103 amsterdam 3 26.119108 875.033098 60.973565 4.89417 52.37663 amsterdam 4 29.272733 815.305740 56.811677 4.90051 52.37508 amsterdam weekend 0 False 1 False 2 False 3 False 4 False [5 rows x 21 columns]
Remove and rename columns
First, remove any columns that do not contain useful information or that you will definiteley not need in your analysis. Rename any columns with long or confusing names into something short and self-explanatory. This will make the dataframe less overwhelming and the following analysis much easier.
Have a look at the columns and check which information they contain in the documentation.
# look at all column names print(df.columns)
Index(['realSum', 'room_type', 'room_shared', 'room_private', 'person_capacity', 'host_is_superhost', 'multi', 'biz', 'cleanliness_rating', 'guest_satisfaction_overall', 'bedrooms', 'dist', 'metro_dist', 'attr_index', 'attr_index_norm', 'rest_index', 'rest_index_norm', 'lng', 'lat', 'city', 'weekend'], dtype='object')
In our case the columns 'attr_index', 'attr_index_norm', 'rest_index' and 'rest_index_norm' do not seem to contain any useful information, so we will remove them.
# remove columns df.drop(['attr_index', 'attr_index_norm'], axis=1, inplace=True) # check if it worked print(df.columns)
Index(['realSum', 'room_type', 'room_shared', 'room_private', 'person_capacity', 'host_is_superhost', 'multi', 'biz', 'cleanliness_rating', 'guest_satisfaction_overall', 'bedrooms', 'dist', 'metro_dist', 'rest_index', 'rest_index_norm', 'lng', 'lat', 'city', 'weekend'], dtype='object')
The names seem okay, but we could shorten 'guest_satisfaction_overall'.
# rename columns df.rename({'guest_satisfaction_overall': 'guest_satisfaction'}, axis=1, inplace=True) # check if it worked print(df.columns)
Index(['realSum', 'room_type', 'room_shared', 'room_private', 'person_capacity', 'host_is_superhost', 'multi', 'biz', 'cleanliness_rating', 'guest_satisfaction', 'bedrooms', 'dist', 'metro_dist', 'rest_index', 'rest_index_norm', 'lng', 'lat', 'city', 'weekend'], dtype='object')
Change data types
Some columns may be stored in an incorrect data type and must be adjusted before you can analyze the data. To do this, you need to change the data types in a similar way to type casting, which you already know. However, this time of columns of the dataframe instead of variables.
We can look at the existing data types with dtypes
.
# check the data types of the columns df.dtypes
realSum float64 room_type object room_shared bool room_private bool person_capacity object host_is_superhost bool multi int64 biz int64 cleanliness_rating float64 guest_satisfaction float64 bedrooms int64 dist float64 metro_dist float64 lng float64 lat float64 city object weekend bool dtype: object
The data type of the columns 'multi' and 'biz' is integer. However, looking at the documentation and the first few rows of the data, it seems like they contain only two values (0 or 1). We can check this using unique()
which shows us all the different values in a column.
# show all different values in these columns print(df['multi'].unique()) print(df['biz'].unique())
[1 0] [0 1]
We were correct, there are only 0 and 1. Therefore, the data type boolean
would be more fitting. We can select the columns and change the data type with the pandas method astype()
.
# change the data type of both columns to boolean df[['multi', 'biz']] = df[['multi', 'biz']].astype('bool') # check the data types again to see if it worked df.dtypes
realSum float64 room_type object room_shared bool room_private bool person_capacity int64 host_is_superhost bool multi bool biz bool cleanliness_rating float64 guest_satisfaction float64 bedrooms int64 dist float64 metro_dist float64 rest_index float64 rest_index_norm float64 lng float64 lat float64 city object weekend bool dtype: object
Furthermore, the values in the column 'person_capacity' have the data type object
(pandas' version of a string). In order to be able to calculate with the values, we have to convert them to the data type integer
.
Identify missing values
Missing values are typically represented as NaN
(= abbreviation for Not a Number). However, it can also happen that missing values are represented differently (e.g.: -1). With isna()
you can filter for missing values represented by NaN
.
To get an overview of missing values per column, you can use .isna().sum()
.
# show missing values per column df.isna().sum()
realSum 100 room_type 0 room_shared 0 room_private 0 person_capacity 0 host_is_superhost 0 multi 0 biz 0 cleanliness_rating 0 guest_satisfaction 0 bedrooms 0 dist 0 metro_dist 0 lng 0 lat 0 city 0 weekend 0 dtype: int64
We have 100 missing values in the column 'realSum'.
You can use the following code to determine the total number of missing values in the data set.
# show total number of missing values df.isna().sum().sum()
100
You can also use isna()
to select rows with missing values.
# select rows with missing values df[df['realSum'].isna()]
realSum room_type room_shared room_private person_capacity \ 425 NaN Entire home/apt False False 4 1639 NaN Private room False True 2 1739 NaN Private room False True 4 1952 NaN Private room False True 3 2082 NaN Entire home/apt False False 3 ... ... ... ... ... ... 48540 NaN Entire home/apt False False 4 48598 NaN Entire home/apt False False 2 49729 NaN Entire home/apt False False 3 50036 NaN Entire home/apt False False 6 50663 NaN Entire home/apt False False 2 host_is_superhost multi biz cleanliness_rating \ 425 False False False 8.0 1639 True True False 10.0 1739 False True False 9.0 1952 False False False 10.0 2082 True False True 10.0 ... ... ... ... ... 48540 False False True 8.0 48598 True False True 10.0 49729 True False False 10.0 50036 True False True 9.0 50663 True False True 10.0 guest_satisfaction bedrooms dist metro_dist lng lat \ 425 88.0 1 0.504975 0.338422 4.89518 52.37759 1639 91.0 1 0.480411 0.336982 4.89533 52.37733 1739 90.0 2 1.651972 0.886285 4.87815 52.36154 1952 92.0 1 1.972445 0.292684 4.88810 52.35577 2082 98.0 1 1.237211 0.288488 23.72200 37.97900 ... ... ... ... ... ... ... 48540 80.0 1 2.674923 0.294553 16.34000 48.20000 48598 98.0 1 6.626611 1.293201 16.30963 48.16705 49729 96.0 1 0.625130 0.322278 16.36977 48.20354 50036 86.0 2 2.159373 1.057287 16.38876 48.19193 50663 99.0 0 1.391834 0.228494 16.36689 48.22020 city weekend 425 amsterdam False 1639 amsterdam True 1739 amsterdam True 1952 amsterdam True 2082 athens False ... ... ... 48540 vienna False 48598 vienna False 49729 vienna False 50036 vienna True 50663 vienna True [100 rows x 17 columns]
Handle missing values
Ideally, missing data can be collected subsequently, but this is usually not possible due to the effort and cost involved. Therefore, there are two approaches to dealing with missing data:
- removing observations with missing values: this approach is simple but reduces the sample size.
- replacing values: Here, missing values can be replaced with, for example, the mean, median, or mode of the corresponding column. This preserves the sample size, but may change the distribution and disturb useful patterns in the data set.
In addition, the importance of the column to the analysis plays a large role in deciding how to handle missing values. In our case, 'realSum' is an essential column for investigating price differences, so replacing missing values in this column could cause problems. Fortunately, 100 values are not very much in a dataset with over 50000 observations (use shape
if you want to check this number). Therefore, we can just remove the observations with missing values.
Remove observations with missing values
Pandas provides the dropna()
method to easily remove missing values. The axis
parameter specifies if you want to remove rows (axis=0
, the default) or columns (axis=1
). In addition, the how
parameter determines whether a row/column is removed if at least one value is missing (how='any'
) or if all values are missing (how='all'
). By default, dropna()
removes rows that contain at least one missing value.
So, we can remove all rows that contain at least one missing value like this:
# Print number of rows and columns before operation print(df.shape) # Remove all rows with missing values df.dropna(inplace=True) # Print number of rows and columns after operation print(df.shape) # Print number of missing values to see if it worked print(df.isna().sum().sum())
(51727, 17) (51627, 17) 0
We can also only remove rows which have missing values in a specific column, e.g. 'realSum', using the parameter subset=['realSum']
.
# Remove all rows with missing values in the column 'realSum' df.dropna(subset=['realSum'], inplace=True)
Replace missing values
Instead of removing missing values, we can replace them. For example, we can calculate the mean of the non-missing values of the corresponding column and replace the missing values with this mean. We could also use other numbers (e.g. the median or mode) for imputation. When working with real data, you also need to consider the reasons for the missing data before deciding which approach to use for treatment.
The decision of which data we impute depends on several factors:
-
Type of data:
- For continuous data, it is common to use the mean or median.
- For categorical data, the mode is often used.
-
Distribution of the data:
- For normally distributed data, the mean may be an appropriate choice.
- For skewed distributed data, the median might be more appropriate because it is more robust to outliers.
-
Imputation methods:
- If the data are missing at random, simple imputation methods such as mean, median, or mode may be sufficient.
- If the data are not missing at random, a more advanced imputation method might be required.
To deal with the missing values in the column 'realSum', we first consider the type and distribution of the data. The data is continuous. We can look at the distribution with the following code.
# Import the library needed for creating plots import matplotlib.pyplot as plt # Calculate median and mean print('Median:', df['realSum'].median(), ', Mean:', df['realSum'].mean()) # Create a histogram plt.hist(df['realSum'], bins=30, edgecolor='black') # You can adjust the number of bins plt.xlabel('Price') plt.ylabel('Frequency') plt.show()
Median: 211.3430888246808 , Mean: 279.9866756625171
The distribution looks quite skewed, so in this case the median might be a better choice for imputing missing values. It works as follows:
# calculate the median of 'realSum' median_value = df['realSum'].median() # print the median print(median_value) # replace all missing values in the colum 'realSum' with the median df.loc[:, 'realSum'].fillna(value=median_value, inplace=True)
The ouput will show:
211.3430888246808
Handle duplicates
Sometimes, values or entire rows are duplicated. To identify duplicate rows, you can use the duplicated()
method. By default, it identifies a row as duplicated if all values in the row are the same as in another row. It marks every occurrence as a duplicate except for the first one.
You can use duplicated
to select all duplicate rows like this:
# select duplicates df.loc[df.duplicated()]
realSum room_type room_shared room_private \ 51707 393.235965 Entire home/apt False False 51708 212.741169 Private room False True 51709 438.065057 Entire home/apt False False 51710 151.852459 Entire home/apt False False 51711 132.302486 Entire home/apt False False 51712 189.493433 Entire home/apt False False 51713 157.809833 Entire home/apt False False 51714 149.860894 Entire home/apt False False 51715 412.643490 Entire home/apt False False 51716 139.117357 Entire home/apt False False 51717 990.527614 Entire home/apt False False 51718 341.958254 Entire home/apt False False 51719 207.436675 Private room False True 51720 217.166979 Private room False True 51721 149.592128 Entire home/apt False False 51722 789.999765 Entire home/apt False False 51723 454.854358 Entire home/apt False False 51724 364.898872 Entire home/apt False False 51725 204.053012 Entire home/apt False False 51726 486.187074 Entire home/apt False False person_capacity host_is_superhost multi biz cleanliness_rating \ 51707 2 False False False 10.0 51708 2 False False False 9.0 51709 4 False False False 10.0 51710 4 True False True 10.0 51711 4 False False True 9.0 51712 4 False False True 8.0 51713 4 False True False 9.0 51714 2 True True False 10.0 51715 4 False False True 10.0 51716 4 False False False 10.0 51717 6 True False True 8.0 51718 4 False False True 8.0 51719 2 False False False 10.0 51720 2 False False False 6.0 51721 4 True False True 10.0 51722 6 False False True 8.0 51723 2 False True False 10.0 51724 4 True False False 10.0 51725 4 False True False 10.0 51726 2 True True False 10.0 guest_satisfaction bedrooms dist metro_dist lng lat \ 51707 98.0 1 4.833158 0.367477 -0.12004 51.55151 51708 94.0 1 3.359232 0.304258 2.39401 48.84347 51709 94.0 2 4.137770 0.062634 2.40718 48.84682 51710 100.0 2 2.508966 0.563467 23.73200 37.99800 51711 89.0 0 0.462231 0.175751 19.05885 47.49582 51712 80.0 1 2.510083 1.903344 -9.16650 38.70410 51713 91.0 1 4.594507 0.492478 12.48000 41.86300 51714 100.0 1 3.709593 1.360519 12.47000 41.92400 51715 95.0 1 1.879220 0.159314 12.48111 41.90694 51716 93.0 1 1.855371 0.467679 19.07420 47.48879 51717 100.0 3 1.060404 0.192599 2.16000 41.39300 51718 80.0 1 1.291520 0.027628 16.38056 48.21919 51719 98.0 1 3.762488 0.785826 -0.07399 51.50305 51720 60.0 1 3.045480 0.341114 -9.13141 38.73905 51721 95.0 1 2.581053 0.115898 16.33900 48.20700 51722 80.0 2 4.318515 0.295255 -0.18656 51.49615 51723 96.0 0 1.065912 1.134896 4.87807 52.37591 51724 100.0 1 1.901161 0.247033 2.37701 48.86258 51725 96.0 1 2.646609 1.045189 16.39382 48.18890 51726 100.0 1 2.437745 0.171287 2.14456 41.39785 city weekend 51707 london True 51708 paris True 51709 paris False 51710 athens False 51711 budapest True 51712 lisbon False 51713 rome False 51714 rome False 51715 rome False 51716 budapest False 51717 barcelona False 51718 vienna True 51719 london False 51720 lisbon False 51721 vienna False 51722 london True 51723 amsterdam True 51724 paris False 51725 vienna False 51726 barcelona False
Furthermore, you can look at the total amount of duplicated rows:
# show total amount of duplicates df.duplicated().sum()
20
We have 20 duplicates in our data.
These duplicates can be correct, i.e. you just have multiple observations with the same values in reality, or they can be errors in the data. In the latter case they should be removed. To remove them you can use the drop_duplicates()
method.
# print the number of rows and columns before the operation print(df.shape) # remove duplicates df.drop_duplicates(inplace=True) # print the number of rows and columns after the operation print(df.shape) # print the number of duplicates to see if it worked print(df.duplicated().sum())
(51627, 17) (51607, 17) 0
Generate new features
In this section we will look at creating new features (= columns). For example, we might be interested in offers that could be used by families (i.e. a person_capacity of 3 or more). We can create a new column 'family' with boolean values that indicate if this room/apartement can be used by a family or not. We create this column based on the column 'person_capacity'.
# generate new feature 'family' df['family'] = df['person_capacity'] >= 3 df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True 2 1 344.245776 Private room False True 4 2 264.101422 Private room False True 2 3 433.529398 Private room False True 4 4 485.552926 Private room False True 2 host_is_superhost multi biz cleanliness_rating guest_satisfaction \ 0 False True False 10.0 93.0 1 False False False 8.0 85.0 2 False False True 9.0 87.0 3 False False True 9.0 90.0 4 True False False 10.0 98.0 bedrooms dist metro_dist lng lat city weekend \ 0 1 5.022964 2.539380 4.90569 52.41772 amsterdam False 1 1 0.488389 0.239404 4.90005 52.37432 amsterdam False 2 1 5.748312 3.651621 4.97512 52.36103 amsterdam False 3 2 0.384862 0.439876 4.89417 52.37663 amsterdam False 4 1 0.544738 0.318693 4.90051 52.37508 amsterdam False family 0 False 1 True 2 False 3 True 4 False
Summary
The data cleaning process heavily depends on your data and can look different every time. However, if you go through the steps in this entry, you are off to a good start:
- Remove and rename columns
- Change data types
- Handle missing values
- Handle duplicates
- Generate new features (if needed)
In addition, remember to check if everything worked as expected after each step.
References
To be added soon
The author of this entry is Wanja Tolksdorf.