Data import and inspection in Python
Contents
Purpose of data analysis
More and more data is collected every day and it can be used to answer questions and make informed decisions. For example, the data on Airbnb Price Determinants in Europe on Kaggle contains interesting information about Airbnb prices and anemities. It might help us to answer questions like "Which factors influence the price?" or "Is user satisfaction related to cleanliness?". However, can you answer these questions by just looking at the raw table of data? [INSERT SCREENSHOT OF EXCEL TABLE]
Here is where data analysis comes in: Data analysis helps us to extract useful information from piles of raw data.
Overview of data analysis steps
A data analysis process is usually initiated and guided by a question. This question can range from a broad open question (e.g. "Are there any interesting patterns in the Airbnb data?") to a specific hypothesis (e.g. user satisfaction is related to cleanliness).
Most of the time, a data analysis process involves the following steps, although not necessarily in this order. Often it is an iterative process. For example, you might realize that you need to do more data cleaning while visualizing the data.
- Import and inspect
- Explore and clean
- Generate features
- Visualize results and/or build models
Data sources
First of all, we need some data. You can either collect data yourself with a survey, experiment, etc. (= primary data) or you use datasets that other people collected and made available online (= secondary data). There are many platforms where you can find datasets. These are some common ones:
- Kaggle: https://www.kaggle.com/datasets
- Google Dataset Search: https://datasetsearch.research.google.com/
- US Government's data: https://data.gov/
- DataHub: https://datahub.io/collections
- UC Irvine Machine Learning Repository: https://archive.ics.uci.edu/datasets
- Harvard Database: https://dataverse.harvard.edu/
For the following exemplary analysis we will use a modified version of the Airbnb dataset from Kaggle.
Pandas - a library to analyze tabular datasets
Data exists in many forms, for example tables, text or images. Here, we will analyze tabular data, stored in spreadsheets. To do this, we can use the Pandas library which provides us with the functions we need to work efficiently with tabular datasets. If you are interested, you can find the documentation here.
To handle tabular data Pandas introduces two additional data structures, the Series and the DataFrame. You can think of a DataFrame as a table and of a Series as a list.
In order to use Pandas, we first need to import the library:
import pandas as pd
Import of datasets
Before we can analyze our data, it must be imported into the current Python environment. The code for importing the file depends on the location of your data and the file format. You might want to import files from the following locations.
Note for JupyterHub users: If you use JupyterHub, you only have 2 options:
- (5.1) Importing a file from a URL, if it is hosted online.
- (5.3) Importing a file from the cloud, i.e. JuypterHub (relative path).
A note on option (5.3): It is not possible to import a file that lies on the hard drive of your computer directly. You always need to make sure that your data is uploaded on JupyterHub and lies in the same folder as your Juypter Notebook or a sub-folder. Only then will you be able to import the data.
Importing a file from a URL
If a file is hosted online, you can directly import it using the URL, for example Google Drive.
# example code for importing a file from a url data = pd.read_csv('http://example.com/example_file.csv')
Importing a file from the hard drive
If the file is saved somewhere on your computer, you can use the absolute path to import it.
# example code for importing a file from the absolute path data = pd.read_csv('C:/Users/Username/Documents/example_file.csv')
Importing a file from the hard drive or cloud (relative path)
If the file is in a directory (= folder) relative to your current working directory, you can use the relative path. For example, your file is stored in the folder 'data' and the folder 'data' lies in the same folder as your Jupyter Notebook.
# example code for importing a file from the relative path data = pd.read_csv('data/example_file.csv')
If your file lies in the same folder as your Jupyter Notebook, you only need the name of the file.
# example code for importing a file from the same folder data = pd.read_csv('example_file.csv')
With an absolute path the file can be accessed from anywhere in the system without needing to know the current working directory. However, using a relative path is useful if you want to move your project or send it to a colleague. As long as the relative structure stays the same, the file can still be imported from the same relative path, i.e. you do not need to change your code.
Creating a DataFrame manually
Furthermore, DataFrames can also be created manually with the help of dictionaries.
# a dictionary containing our data dic = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 'age': [42, 52, 36, 24, 73], 'city': ['Dallas', 'Dallas', 'Los Angles', 'Chicago', 'Chicago']} # creating a DataFrame from the dictionary data = pd.DataFrame(dic)
Importing different file formats
Pandas can read a variety of file formats, such as csv, excel, json, html, SQL, etc. Depending on the file format of your data you need to use a different function to import your data. For example, to import a csv file you use pd.read_csv()
, but for an excel file you use pd.read_excel()
.
# example code for importing a csv file data = pd.read_csv('example_csv_file.csv')
# example code for importing an excel file data = pd.read_excel('example_excel_file.xlsx')
For more information on how to import different file formats using Pandas, you can check out the pandas documentation.
(5.6) Airbnb example
Our modified Airbnb dataset is stored online under the url ... as a csv file. Therefore, we can use the following code to import the dataset and store it in the variable 'df'.
# Importing the Dataset and storing it in the variable 'df' df = pd.read_csv('airbnb_modified.csv')
Let's see if the import worked. With the following pandas method head()
we can get a preview of the data.
df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating \ 0 False 1 0 10.0 1 False 0 0 8.0 2 False 0 1 9.0 3 False 0 1 9.0 4 True 0 0 10.0 guest_satisfaction_overall ... dist metro_dist attr_index \ 0 93.0 ... 5.022964 2.539380 78.690379 1 85.0 ... 0.488389 0.239404 631.176378 2 87.0 ... 5.748312 3.651621 75.275877 3 90.0 ... 0.384862 0.439876 493.272534 4 98.0 ... 0.544738 0.318693 552.830324 attr_index_norm rest_index rest_index_norm lng lat city \ 0 4.166708 98.253896 6.846473 4.90569 52.41772 amsterdam 1 33.421209 837.280757 58.342928 4.90005 52.37432 amsterdam 2 3.985908 95.386955 6.646700 4.97512 52.36103 amsterdam 3 26.119108 875.033098 60.973565 4.89417 52.37663 amsterdam 4 29.272733 815.305740 56.811677 4.90051 52.37508 amsterdam weekend 0 False 1 False 2 False 3 False 4 False [5 rows x 21 columns]
As you can see, our data is now stored as a pandas DataFrame.
Note: When you start working with your own data, you will find that there are some difficulties to overcome. For example, some datasets are in incompatible file formats, are so large that they exceed the available memory, or contain unusual file encodings and formatting. With a little research, these are all solvable problems. You can do it!
Inspection of data
When we look at an unknown dataset, we first need to get an idea of the quantity and quality of the data. What data do we have? Is it relevant to our questions? Are there missing data? Is there anything outstanding?
We can get a good first impression by looking at some key figures. The following lines of code give an overview of common Pandas methods that offer general information about a dataframe and are a good starting point for data inspection.
Display your data
# displays the first 5 rows of the dataframe df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating \ 0 False 1 0 10.0 1 False 0 0 8.0 2 False 0 1 9.0 3 False 0 1 9.0 4 True 0 0 10.0 guest_satisfaction_overall ... dist metro_dist attr_index \ 0 93.0 ... 5.022964 2.539380 78.690379 1 85.0 ... 0.488389 0.239404 631.176378 2 87.0 ... 5.748312 3.651621 75.275877 3 90.0 ... 0.384862 0.439876 493.272534 4 98.0 ... 0.544738 0.318693 552.830324 attr_index_norm rest_index rest_index_norm lng lat city \ 0 4.166708 98.253896 6.846473 4.90569 52.41772 amsterdam 1 33.421209 837.280757 58.342928 4.90005 52.37432 amsterdam 2 3.985908 95.386955 6.646700 4.97512 52.36103 amsterdam 3 26.119108 875.033098 60.973565 4.89417 52.37663 amsterdam 4 29.272733 815.305740 56.811677 4.90051 52.37508 amsterdam weekend 0 False 1 False 2 False 3 False 4 False [5 rows x 21 columns]
# displays the last 5 rows of the dataframe df.tail()
realSum room_type room_shared room_private person_capacity \ 51722 789.999765 Entire home/apt False False "6.0" 51723 454.854358 Entire home/apt False False "2.0" 51724 364.898872 Entire home/apt False False "4.0" 51725 204.053012 Entire home/apt False False "4.0" 51726 486.187074 Entire home/apt False False "2.0" host_is_superhost multi biz cleanliness_rating \ 51722 False 0 1 8.0 51723 False 1 0 10.0 51724 True 0 0 10.0 51725 False 1 0 10.0 51726 True 1 0 10.0 guest_satisfaction_overall ... dist metro_dist attr_index \ 51722 80.0 ... 4.318515 0.295255 344.212057 51723 96.0 ... 1.065912 1.134896 419.731443 51724 100.0 ... 1.901161 0.247033 288.549988 51725 96.0 ... 2.646609 1.045189 99.411062 51726 100.0 ... 2.437745 0.171287 287.859569 attr_index_norm rest_index rest_index_norm lng lat \ 51722 23.927522 715.238423 12.801522 -0.18656 51.49615 51723 22.222858 443.719331 38.681674 4.87807 52.37591 51724 14.046516 759.885268 44.650506 2.37701 48.86258 51725 7.119395 129.675670 3.099991 16.39382 48.18890 51726 9.810718 645.035717 14.199221 2.14456 41.39785 city weekend 51722 london True 51723 amsterdam True 51724 paris False 51725 vienna False 51726 barcelona False [5 rows x 21 columns]
The head()
method displays the first few rows of the data, while the tail()
method displays the last few rows. This is a common way to quickly get an idea of the data and make sure that it has been loaded correctly. When working with large datasets, it is often not practical to print the entire dataset. Printing the first and last few rows can give you a general sense of the data without having to view all of it.
Note: Check out the documentation of the data source to find some explanations about the columns. For example, you can read more about the Airbnb data <a href="https://www.kaggle.com/datasets/thedevastator/airbnb-price-determinants-in-europe" target="_blank">here</a>.
The head()
and tail()
methods per default display the first/last 5 rows of a dataframe. If you want to see more or less rows, just put the desired amount into the brackets.
Size
# shape of the dataframe (rows, columns) df.shape
The output shows:
(51727, 21)
The shape
command shows the dimensions of the dataframe, i.e. how many rows (= observations) (1205) and columns (= features) (19) it has. This command can help you make sure that the data has been loaded or processed correctly and that you have the expected number of rows and columns.
Basic info
# basic info about the dataframe df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51727 entries, 0 to 51726 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 realSum 51627 non-null float64 1 room_type 51727 non-null object 2 room_shared 51727 non-null bool 3 room_private 51727 non-null bool 4 person_capacity 51727 non-null object 5 host_is_superhost 51727 non-null bool 6 multi 51727 non-null int64 7 biz 51727 non-null int64 8 cleanliness_rating 51727 non-null float64 9 guest_satisfaction_overall 51727 non-null float64 10 bedrooms 51727 non-null int64 11 dist 51727 non-null float64 12 metro_dist 51727 non-null float64 13 attr_index 51727 non-null float64 14 attr_index_norm 51727 non-null float64 15 rest_index 51727 non-null float64 16 rest_index_norm 51727 non-null float64 17 lng 51727 non-null float64 18 lat 51727 non-null float64 19 city 51727 non-null object 20 weekend 51727 non-null bool dtypes: bool(4), float64(11), int64(3), object(3) memory usage: 6.9+ MB
The info()
method gives you an overview of the features (= columns) contained in the dataframe, the number of existing observations per feature, as well as the data type of the values contained in each case. This information is useful for understanding the general structure of the data and for identifying potential issues that need to be addressed, such as missing values or wrong data types.
You can interpret the information as follows:
class
: DataFrame
RangeIndex
: number of rows (observations)
Data columns
: number of columns (features)
Table with
#
: Column Index
Column
: Column Name
Non-Null Count
: number of existing (non-null) observations per column
Dtype
: data type (int64 and float64 are numerical, object is a string)
dtypes
: data types (+ number of columns with that data type)
memory usage
: memory used by the DataFrame
Summary statistics
# summary statistics df.describe()
realSum multi biz cleanliness_rating \ count 51627.000000 51727.000000 51727.000000 51727.000000 mean 279.986676 0.291337 0.350223 9.390570 std 328.182873 0.454383 0.477044 0.954928 min 34.779339 0.000000 0.000000 2.000000 25% 148.752174 0.000000 0.000000 9.000000 50% 211.343089 0.000000 0.000000 10.000000 75% 319.753173 1.000000 1.000000 10.000000 max 18545.450285 1.000000 1.000000 10.000000 guest_satisfaction_overall bedrooms dist metro_dist \ count 51727.000000 51727.000000 51727.000000 51727.000000 mean 92.627970 1.158757 3.191094 0.681474 std 8.945883 0.627421 2.393481 0.857916 min 20.000000 0.000000 0.015045 0.002301 25% 90.000000 1.000000 1.453185 0.248420 50% 95.000000 1.000000 2.613219 0.413263 75% 99.000000 1.000000 4.262975 0.737840 max 100.000000 10.000000 25.284557 14.273577 attr_index attr_index_norm rest_index rest_index_norm \ count 51727.000000 51727.000000 51727.000000 51727.000000 mean 294.195063 13.423440 626.833079 22.785197 std 224.740898 9.807584 497.882751 17.803206 min 15.152201 0.926301 19.576924 0.592757 25% 136.761801 6.380926 250.854114 8.752401 50% 234.325006 11.467407 522.040377 17.540503 75% 385.742931 17.415082 832.555492 32.963549 max 4513.563486 100.000000 6696.156772 100.000000 lng lat count 51727.000000 51727.000000 mean 7.426021 45.671197 std 9.799519 5.249056 min -9.226340 37.953000 25% -0.072500 41.399510 50% 4.873000 47.506690 75% 13.519480 51.471830 max 23.786020 52.641410
The describe()
method calculates the most important summary statistics, i.e. location and dispersion measures, of all numerical features contained in the dataframe. This can be a useful way to get a general idea of the data and identify potential trends or issues. For example, a minimum value of -1 for bedrooms would hint at a mistake in the data.
You can interpret the output of the describe()
method as follows:
count
: the number of non-null entries
mean
: the mean or 'average' value
std
: the standard deviation (i.e. how far the observations spread from the mean value)
min
: the minimum value
25%
: lower quartile (i.e. 25% of values are below this value)
50%
: median (half of the values lie above, the other half below this value)
75%
: upper quartile (i.e. 75% of values are below this value)
max
: the maximum value
Summary
- The Pandas library provides functions to analyze tabular datasets.
- The first step of data analysis is to import the dataset. The exact code depends on the location and format of the file.
- To get an overview of the data, we can use the following methods:
head()
,tail()
,shape
,info()
,describe()
.
References
To be added soon
This author of this entry is Wanja Tolksdorf