Data import and inspection in Python

Purpose of data analysis

More and more data is collected every day and it can be used to answer questions and make informed decisions. For example, the data on Airbnb Price Determinants in Europe on Kaggle contains interesting information about Airbnb prices and anemities. It might help us to answer questions like "Which factors influence the price?" or "Is user satisfaction related to cleanliness?". However, can you answer these questions by just looking at the raw table of data? [INSERT SCREENSHOT OF EXCEL TABLE]

Here is where data analysis comes in: Data analysis helps us to extract useful information from piles of raw data.

Overview of data analysis steps

A data analysis process is usually initiated and guided by a question. This question can range from a broad open question (e.g. "Are there any interesting patterns in the Airbnb data?") to a specific hypothesis (e.g. user satisfaction is related to cleanliness).

Most of the time, a data analysis process involves the following steps, although not necessarily in this order. Often it is an iterative process. For example, you might realize that you need to do more data cleaning while visualizing the data.

Import and inspect
Explore and clean
Generate features
Visualize results and/or build models

Data sources

First of all, we need some data. You can either collect data yourself with a survey, experiment, etc. (= primary data) or you use datasets that other people collected and made available online (= secondary data). There are many platforms where you can find datasets. These are some common ones:

Kaggle: https://www.kaggle.com/datasets
Google Dataset Search: https://datasetsearch.research.google.com/
US Government's data: https://data.gov/
DataHub: https://datahub.io/collections
UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/datasets

Harvard Database: https://dataverse.harvard.edu/

For the following exemplary analysis we will use a modified version of the Airbnb dataset from Kaggle.

Pandas - a library to analyze tabular datasets

Data exists in many forms, for example tables, text or images. Here, we will analyze tabular data, stored in spreadsheets. To do this, we can use the Pandas library which provides us with the functions we need to work efficiently with tabular datasets. If you are interested, you can find the documentation here.

To handle tabular data Pandas introduces two additional data structures, the Series and the DataFrame. You can think of a DataFrame as a table and of a Series as a list.

In order to use Pandas, we first need to import the library:

import pandas as pd

Import of datasets

Before we can analyze our data, it must be imported into the current Python environment. The code for importing the file depends on the location of your data and the file format. You might want to import files from the following locations.

Note for JupyterHub users: If you use JupyterHub, you only have 2 options:

(5.1) Importing a file from a URL, if it is hosted online.
(5.3) Importing a file from the cloud, i.e. JuypterHub (relative path).

A note on option (5.3): It is not possible to import a file that lies on the hard drive of your computer directly. You always need to make sure that your data is uploaded on JupyterHub and lies in the same folder as your Juypter Notebook or a sub-folder. Only then will you be able to import the data.

Importing a file from a URL

If a file is hosted online, you can directly import it using the URL, for example Google Drive.

# example code for importing a file from a url
data = pd.read_csv('http://example.com/example_file.csv')

Importing a file from the hard drive

If the file is saved somewhere on your computer, you can use the absolute path to import it.

# example code for importing a file from the absolute path
data = pd.read_csv('C:/Users/Username/Documents/example_file.csv')

Importing a file from the hard drive or cloud (relative path)

If the file is in a directory (= folder) relative to your current working directory, you can use the relative path. For example, your file is stored in the folder 'data' and the folder 'data' lies in the same folder as your Jupyter Notebook.

# example code for importing a file from the relative path
data = pd.read_csv('data/example_file.csv')

If your file lies in the same folder as your Jupyter Notebook, you only need the name of the file.

# example code for importing a file from the same folder
data = pd.read_csv('example_file.csv')

With an absolute path the file can be accessed from anywhere in the system without needing to know the current working directory. However, using a relative path is useful if you want to move your project or send it to a colleague. As long as the relative structure stays the same, the file can still be imported from the same relative path, i.e. you do not need to change your code.

Creating a DataFrame manually

Furthermore, DataFrames can also be created manually with the help of dictionaries.

# a dictionary containing our data
dic = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'city': ['Dallas', 'Dallas', 'Los Angles', 'Chicago', 'Chicago']}

# creating a DataFrame from the dictionary
data = pd.DataFrame(dic)

Importing different file formats

Pandas can read a variety of file formats, such as csv, excel, json, html, SQL, etc. Depending on the file format of your data you need to use a different function to import your data. For example, to import a csv file you use pd.read_csv(), but for an excel file you use pd.read_excel().

# example code for importing a csv file
data = pd.read_csv('example_csv_file.csv')

# example code for importing an excel file
data = pd.read_excel('example_excel_file.xlsx')

For more information on how to import different file formats using Pandas, you can check out the pandas documentation.

(5.6) Airbnb example

Our modified Airbnb dataset is stored online under the url ... as a csv file. Therefore, we can use the following code to import the dataset and store it in the variable 'df'.

# Importing the Dataset and storing it in the variable 'df'
df = pd.read_csv('airbnb_modified.csv')

Let's see if the import worked. With the following pandas method head() we can get a preview of the data.

df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  \
0              False      1    0                10.0   
1              False      0    0                 8.0   
2              False      0    1                 9.0   
3              False      0    1                 9.0   
4               True      0    0                10.0   

   guest_satisfaction_overall  ...      dist  metro_dist  attr_index  \
0                        93.0  ...  5.022964    2.539380   78.690379   
1                        85.0  ...  0.488389    0.239404  631.176378   
2                        87.0  ...  5.748312    3.651621   75.275877   
3                        90.0  ...  0.384862    0.439876  493.272534   
4                        98.0  ...  0.544738    0.318693  552.830324   

   attr_index_norm  rest_index  rest_index_norm      lng       lat       city  \
0         4.166708   98.253896         6.846473  4.90569  52.41772  amsterdam   
1        33.421209  837.280757        58.342928  4.90005  52.37432  amsterdam   
2         3.985908   95.386955         6.646700  4.97512  52.36103  amsterdam   
3        26.119108  875.033098        60.973565  4.89417  52.37663  amsterdam   
4        29.272733  815.305740        56.811677  4.90051  52.37508  amsterdam   

  weekend  
0   False  
1   False  
2   False  
3   False  
4   False  

[5 rows x 21 columns]

As you can see, our data is now stored as a pandas DataFrame.

Note: When you start working with your own data, you will find that there are some difficulties to overcome. For example, some datasets are in incompatible file formats, are so large that they exceed the available memory, or contain unusual file encodings and formatting. With a little research, these are all solvable problems. You can do it!

Inspection of data

When we look at an unknown dataset, we first need to get an idea of the quantity and quality of the data. What data do we have? Is it relevant to our questions? Are there missing data? Is there anything outstanding?

We can get a good first impression by looking at some key figures. The following lines of code give an overview of common Pandas methods that offer general information about a dataframe and are a good starting point for data inspection.

Display your data

# displays the first 5 rows of the dataframe
df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  \
0              False      1    0                10.0   
1              False      0    0                 8.0   
2              False      0    1                 9.0   
3              False      0    1                 9.0   
4               True      0    0                10.0   

   guest_satisfaction_overall  ...      dist  metro_dist  attr_index  \
0                        93.0  ...  5.022964    2.539380   78.690379   
1                        85.0  ...  0.488389    0.239404  631.176378   
2                        87.0  ...  5.748312    3.651621   75.275877   
3                        90.0  ...  0.384862    0.439876  493.272534   
4                        98.0  ...  0.544738    0.318693  552.830324   

   attr_index_norm  rest_index  rest_index_norm      lng       lat       city  \
0         4.166708   98.253896         6.846473  4.90569  52.41772  amsterdam   
1        33.421209  837.280757        58.342928  4.90005  52.37432  amsterdam   
2         3.985908   95.386955         6.646700  4.97512  52.36103  amsterdam   
3        26.119108  875.033098        60.973565  4.89417  52.37663  amsterdam   
4        29.272733  815.305740        56.811677  4.90051  52.37508  amsterdam   

  weekend  
0   False  
1   False  
2   False  
3   False  
4   False  

[5 rows x 21 columns]

# displays the last 5 rows of the dataframe
df.tail()

          realSum        room_type  room_shared  room_private person_capacity  \
51722  789.999765  Entire home/apt        False         False           "6.0"   
51723  454.854358  Entire home/apt        False         False           "2.0"   
51724  364.898872  Entire home/apt        False         False           "4.0"   
51725  204.053012  Entire home/apt        False         False           "4.0"   
51726  486.187074  Entire home/apt        False         False           "2.0"   

       host_is_superhost  multi  biz  cleanliness_rating  \
51722              False      0    1                 8.0   
51723              False      1    0                10.0   
51724               True      0    0                10.0   
51725              False      1    0                10.0   
51726               True      1    0                10.0   

       guest_satisfaction_overall  ...      dist  metro_dist  attr_index  \
51722                        80.0  ...  4.318515    0.295255  344.212057   
51723                        96.0  ...  1.065912    1.134896  419.731443   
51724                       100.0  ...  1.901161    0.247033  288.549988   
51725                        96.0  ...  2.646609    1.045189   99.411062   
51726                       100.0  ...  2.437745    0.171287  287.859569   

       attr_index_norm  rest_index  rest_index_norm       lng       lat  \
51722        23.927522  715.238423        12.801522  -0.18656  51.49615   
51723        22.222858  443.719331        38.681674   4.87807  52.37591   
51724        14.046516  759.885268        44.650506   2.37701  48.86258   
51725         7.119395  129.675670         3.099991  16.39382  48.18890   
51726         9.810718  645.035717        14.199221   2.14456  41.39785   

            city weekend  
51722     london    True  
51723  amsterdam    True  
51724      paris   False  
51725     vienna   False  
51726  barcelona   False  

[5 rows x 21 columns]

The head() method displays the first few rows of the data, while the tail() method displays the last few rows. This is a common way to quickly get an idea of the data and make sure that it has been loaded correctly. When working with large datasets, it is often not practical to print the entire dataset. Printing the first and last few rows can give you a general sense of the data without having to view all of it.

Note: Check out the documentation of the data source to find some explanations about the columns. For example, you can read more about the Airbnb data <a href="https://www.kaggle.com/datasets/thedevastator/airbnb-price-determinants-in-europe" target="_blank">here</a>.

The head() and tail() methods per default display the first/last 5 rows of a dataframe. If you want to see more or less rows, just put the desired amount into the brackets.

Size

# shape of the dataframe (rows, columns)
df.shape

The output shows:

(51727, 21)

The shape command shows the dimensions of the dataframe, i.e. how many rows (= observations) (1205) and columns (= features) (19) it has. This command can help you make sure that the data has been loaded or processed correctly and that you have the expected number of rows and columns.

Basic info

# basic info about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51727 entries, 0 to 51726
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   realSum                     51627 non-null  float64
 1   room_type                   51727 non-null  object 
 2   room_shared                 51727 non-null  bool   
 3   room_private                51727 non-null  bool   
 4   person_capacity             51727 non-null  object 
 5   host_is_superhost           51727 non-null  bool   
 6   multi                       51727 non-null  int64  
 7   biz                         51727 non-null  int64  
 8   cleanliness_rating          51727 non-null  float64
 9   guest_satisfaction_overall  51727 non-null  float64
 10  bedrooms                    51727 non-null  int64  
 11  dist                        51727 non-null  float64
 12  metro_dist                  51727 non-null  float64
 13  attr_index                  51727 non-null  float64
 14  attr_index_norm             51727 non-null  float64
 15  rest_index                  51727 non-null  float64
 16  rest_index_norm             51727 non-null  float64
 17  lng                         51727 non-null  float64
 18  lat                         51727 non-null  float64
 19  city                        51727 non-null  object 
 20  weekend                     51727 non-null  bool   
dtypes: bool(4), float64(11), int64(3), object(3)
memory usage: 6.9+ MB

The info() method gives you an overview of the features (= columns) contained in the dataframe, the number of existing observations per feature, as well as the data type of the values contained in each case. This information is useful for understanding the general structure of the data and for identifying potential issues that need to be addressed, such as missing values or wrong data types.

You can interpret the information as follows: class: DataFrame
RangeIndex: number of rows (observations)
Data columns: number of columns (features)

Table with
#: Column Index
Column: Column Name
Non-Null Count: number of existing (non-null) observations per column
Dtype: data type (int64 and float64 are numerical, object is a string)

dtypes: data types (+ number of columns with that data type)
memory usage: memory used by the DataFrame

Summary statistics

# summary statistics
df.describe()

            realSum         multi           biz  cleanliness_rating  \
count  51627.000000  51727.000000  51727.000000        51727.000000   
mean     279.986676      0.291337      0.350223            9.390570   
std      328.182873      0.454383      0.477044            0.954928   
min       34.779339      0.000000      0.000000            2.000000   
25%      148.752174      0.000000      0.000000            9.000000   
50%      211.343089      0.000000      0.000000           10.000000   
75%      319.753173      1.000000      1.000000           10.000000   
max    18545.450285      1.000000      1.000000           10.000000   

       guest_satisfaction_overall      bedrooms          dist    metro_dist  \
count                51727.000000  51727.000000  51727.000000  51727.000000   
mean                    92.627970      1.158757      3.191094      0.681474   
std                      8.945883      0.627421      2.393481      0.857916   
min                     20.000000      0.000000      0.015045      0.002301   
25%                     90.000000      1.000000      1.453185      0.248420   
50%                     95.000000      1.000000      2.613219      0.413263   
75%                     99.000000      1.000000      4.262975      0.737840   
max                    100.000000     10.000000     25.284557     14.273577   

         attr_index  attr_index_norm    rest_index  rest_index_norm  \
count  51727.000000     51727.000000  51727.000000     51727.000000   
mean     294.195063        13.423440    626.833079        22.785197   
std      224.740898         9.807584    497.882751        17.803206   
min       15.152201         0.926301     19.576924         0.592757   
25%      136.761801         6.380926    250.854114         8.752401   
50%      234.325006        11.467407    522.040377        17.540503   
75%      385.742931        17.415082    832.555492        32.963549   
max     4513.563486       100.000000   6696.156772       100.000000   

                lng           lat  
count  51727.000000  51727.000000  
mean       7.426021     45.671197  
std        9.799519      5.249056  
min       -9.226340     37.953000  
25%       -0.072500     41.399510  
50%        4.873000     47.506690  
75%       13.519480     51.471830  
max       23.786020     52.641410

The describe() method calculates the most important summary statistics, i.e. location and dispersion measures, of all numerical features contained in the dataframe. This can be a useful way to get a general idea of the data and identify potential trends or issues. For example, a minimum value of -1 for bedrooms would hint at a mistake in the data.

You can interpret the output of the describe() method as follows:

count: the number of non-null entries
mean: the mean or 'average' value
std: the standard deviation (i.e. how far the observations spread from the mean value)
min: the minimum value
25%: lower quartile (i.e. 25% of values are below this value)
50%: median (half of the values lie above, the other half below this value)
75%: upper quartile (i.e. 75% of values are below this value)
max: the maximum value

Summary

The Pandas library provides functions to analyze tabular datasets.
The first step of data analysis is to import the dataset. The exact code depends on the location and format of the file.
To get an overview of the data, we can use the following methods: head(), tail(), shape, info(), describe().

References

To be added soon

This author of this entry is Wanja Tolksdorf