Dataframe manipulation with Pandas
Dataframe manipulation
In this entry, you will learn how to access and manipulate data in Pandas dataframes. This skill is essential for many data analysis steps, such as data cleaning and visualization.
As an example, we will use a modified version of the Airbnb dataset from Kaggle. [INSERT LINK TO OUR MODIFIED DATASET] You can find the documentation with information about the columns here.
First, we need to import the Pandas library and our dataset.
import pandas as pd df = pd.read_csv('airbnb_modified.csv') df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating \ 0 False 1 0 10.0 1 False 0 0 8.0 2 False 0 1 9.0 3 False 0 1 9.0 4 True 0 0 10.0 guest_satisfaction_overall ... dist metro_dist attr_index \ 0 93.0 ... 5.022964 2.539380 78.690379 1 85.0 ... 0.488389 0.239404 631.176378 2 87.0 ... 5.748312 3.651621 75.275877 3 90.0 ... 0.384862 0.439876 493.272534 4 98.0 ... 0.544738 0.318693 552.830324 attr_index_norm rest_index rest_index_norm lng lat city \ 0 4.166708 98.253896 6.846473 4.90569 52.41772 amsterdam 1 33.421209 837.280757 58.342928 4.90005 52.37432 amsterdam 2 3.985908 95.386955 6.646700 4.97512 52.36103 amsterdam 3 26.119108 875.033098 60.973565 4.89417 52.37663 amsterdam 4 29.272733 815.305740 56.811677 4.90051 52.37508 amsterdam weekend 0 False 1 False 2 False 3 False 4 False [5 rows x 21 columns]
Accessing Data
In order to manipulate the data inside a dataframe it is necessary to access the data. We can access the data via column and row labels. To get an overview of the column and row labels in our dataframe we can use the columns
and index
attributes.
print(df.columns) print(df.index)
Index(['realSum', 'room_type', 'room_shared', 'room_private', 'person_capacity', 'host_is_superhost', 'multi', 'biz', 'cleanliness_rating', 'guest_satisfaction_overall', 'bedrooms', 'dist', 'metro_dist', 'attr_index', 'attr_index_norm', 'rest_index', 'rest_index_norm', 'lng', 'lat', 'city', 'weekend'], dtype='object') RangeIndex(start=0, stop=51727, step=1)
If your dataframe has no index yet, you can improve its accessibility by adding an index yourself. It can be a simple range index from 0 to n or an own customized index for more specific needs.
# creating a simple range index data.index = pd.RangeIndex(start=0, stop=len(data), step=1)
# creating a customized index data.index = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']
There are three different options to access the data: square brackets []
, loc
and iloc
.
Selecting rows and columns with square brackets []
To select a subset of your data, you can use square brackets []
. Inside the brackets you can specify either the columns OR rows you want.
You have several options for specifying the rows or columns:
- a single column label
- a list of column labels
- a slice of rows
- a conditional expression
Note: For building conditions, you can use all of the common operators, such as ==, !=, <, >, <=, etc. If you combine multiple conditions, each condition must be surrounded by parentheses (). In pandas the or operator is represented by the symbol |, the and operator is represented by the symbol &.
Here are some examples:
# selecting a single column: realSum df['realSum']
0 194.033698 1 344.245776 2 264.101422 3 433.529398 4 485.552926 ... 51722 789.999765 51723 454.854358 51724 364.898872 51725 204.053012 51726 486.187074 Name: realSum, Length: 51727, dtype: float64
# selecting multiple columns: realSum & room_type df[['realSum', 'room_type']]
realSum room_type 0 194.033698 Private room 1 344.245776 Private room 2 264.101422 Private room 3 433.529398 Private room 4 485.552926 Private room ... ... ... 51722 789.999765 Entire home/apt 51723 454.854358 Entire home/apt 51724 364.898872 Entire home/apt 51725 204.053012 Entire home/apt 51726 486.187074 Entire home/apt [51727 rows x 2 columns]
# selecting a slice of rows (the 5th row is not included) df[3:5]
realSum room_type room_shared room_private person_capacity \ 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating \ 3 False 0 1 9.0 4 True 0 0 10.0 guest_satisfaction_overall ... dist metro_dist attr_index \ 3 90.0 ... 0.384862 0.439876 493.272534 4 98.0 ... 0.544738 0.318693 552.830324 attr_index_norm rest_index rest_index_norm lng lat city \ 3 26.119108 875.033098 60.973565 4.89417 52.37663 amsterdam 4 29.272733 815.305740 56.811677 4.90051 52.37508 amsterdam weekend 3 False 4 False [2 rows x 21 columns]
# selecting by condition: all rows, for which realSum is between 500-1000€ df[(df['realSum'] > 500) & (df['realSum'] < 1000)]
realSum room_type room_shared room_private person_capacity \ 5 552.808567 Private room False True "3.0" 10 909.474375 Entire home/apt False False "2.0" 12 675.602840 Entire home/apt False False "4.0" 13 552.808567 Entire home/apt False False "2.0" 21 933.845757 Entire home/apt False False "4.0" ... ... ... ... ... ... 51701 750.765491 Entire home/apt False False "6.0" 51702 715.938574 Entire home/apt False False "6.0" 51704 637.168969 Entire home/apt False False "2.0" 51717 990.527614 Entire home/apt False False "6.0" 51722 789.999765 Entire home/apt False False "6.0" host_is_superhost multi biz cleanliness_rating \ 5 False 0 0 8.0 10 False 0 0 10.0 12 False 0 0 8.0 13 True 0 0 10.0 21 False 0 0 10.0 ... ... ... ... ... 51701 False 0 1 10.0 51702 False 0 1 10.0 51704 False 0 0 10.0 51717 True 0 1 8.0 51722 False 0 1 8.0 guest_satisfaction_overall ... dist metro_dist attr_index \ 5 100.0 ... 2.131420 1.904668 174.788957 10 96.0 ... 1.009922 0.917115 409.858124 12 87.0 ... 2.933046 0.628073 214.923342 13 100.0 ... 1.305494 1.342162 325.255952 21 96.0 ... 1.014066 0.377104 477.794070 ... ... ... ... ... ... 51701 96.0 ... 0.378804 0.203138 257.494817 51702 100.0 ... 0.530181 0.135447 219.402478 51704 93.0 ... 0.994051 0.202539 169.073402 51717 100.0 ... 1.060404 0.192599 583.188707 51722 80.0 ... 4.318515 0.295255 344.212057 attr_index_norm rest_index rest_index_norm lng lat \ 5 9.255191 225.201662 15.692376 4.87699 52.38966 10 21.702260 555.114276 38.681161 4.87956 52.36953 12 11.380334 269.624904 18.787851 4.88934 52.34697 13 17.222519 390.912052 27.239314 4.87417 52.37509 21 25.299513 664.053251 46.272186 4.89088 52.36422 ... ... ... ... ... ... 51701 18.440080 548.973296 13.268473 16.37825 48.21001 51702 15.712158 438.756874 10.604584 16.37940 48.21136 51704 12.107921 282.296424 6.822996 16.38568 48.20460 51717 19.876012 1336.989084 29.431245 2.16000 41.39300 51722 23.927522 715.238423 12.801522 -0.18656 51.49615 city weekend 5 amsterdam False 10 amsterdam False 12 amsterdam False 13 amsterdam False 21 amsterdam False ... ... ... 51701 vienna True 51702 vienna True 51704 vienna True 51717 barcelona False 51722 london True [4413 rows x 21 columns]
Selecting rows and columns with loc
If you want more advanced options, for example selecting specific rows and columns at the same time, you need the loc
operator. In the brackets, you specify the rows before the comma and the columns after the comma.
You have multiple options how to specify the rows and columns:
- a single label
- a list of labels
- a slice of labels
- a conditional expression
- a colon (meaning you want to select all rows or columns)
Note: For building conditions, you can use all of the common operators, such as ==, !=, <, >, <=, etc. If you combine multiple conditions, each condition must be surrounded by parentheses (). In pandas the or operator is represented by the symbol |, the and operator is represented by the symbol &.
Here are some examples:
# selecting a value by row and column: row 5 and column room_type df.loc[5, 'room_type']
'Private room'
# selecting multiple rows and columns: row 3 & 5 and columns realSum & room_type df.loc[[3,5], ['realSum', 'room_type']]
name nwspol 3 ESS10SCe03_1 120 5 ESS10SCe03_1 60
# selecting a slice of rows and columns: rows 3-5 (incl. row 5) and columns room_type - person_capacity df.loc[3:5, 'room_type' : 'person_capacity']
room_type room_shared room_private person_capacity 3 Private room False True "4.0" 4 Private room False True "2.0" 5 Private room False True "3.0"
# selecting rows by conditions and specific columns: # all rows, for which realSum is below 1000€ and room_type is private room # columns realSum, room_type and city df.loc[(df['realSum'] > 1000) & (df['room_type'] == 'Private room'), ['realSum', 'room_type', 'city']]
realSum room_type city 137 1474.937314 Private room amsterdam 190 1413.305837 Private room amsterdam 352 1572.891524 Private room amsterdam 552 1072.809505 Private room amsterdam 607 1007.662925 Private room amsterdam ... ... ... ... 42518 2305.192528 Private room rome 45892 1281.883431 Private room rome 47009 2311.738714 Private room rome 48380 13664.305916 Private room vienna 50787 13656.358834 Private room vienna [74 rows x 3 columns]
# selecting all rows and the realSum column df.loc[:, 'realSum']
0 194.033698 1 344.245776 2 264.101422 3 433.529398 4 485.552926 ... 51722 789.999765 51723 454.854358 51724 364.898872 51725 204.053012 51726 486.187074 Name: realSum, Length: 51727, dtype: float64
Selecting rows and columns with iloc
If you want to select rows and columns via their position in the dataframe, you can use iloc
. Again, you specify the rows before the comma and the columns after the comma.
You have multiple options how to specify the rows and columns:
- a single position
- a list of positions
- a slice of positions
- a colon (meaning you want to select all rows or columns)
Here are some examples:
# selecting the value in row 0, column 0 df.iloc[0,0]
194.03369812293488
# selecting a slice of rows: rows 0-2 (excl. row 3) # and selecting columns at position 0 and 2 df.iloc[0:3, [0,2]]
realSum room_shared 0 194.033698 False 1 344.245776 False 2 264.101422 False
When to use []
, loc
or iloc
- You can use square brackets
[]
if you want to make a simple selection of rows OR columns using row/column labels. - You need to use
loc
if you want to select specific rows AND columns using row/column labels. - You need to use
iloc
if you want to select specific rows and/or columns using their positions in the dataframe.
You can also assign new values to a selection based on loc
/iloc
. You will learn how to do this in the next section.
Changing data
Since we know how to access data, we can also modify the values.
Saving your changes
If you make changes to your data, chances are that you want to save your modifications and continue working with the modified dataframe.
There are several options for saving your changes:
- Many - but not all - Pandas methods support the
inplace
parameter. If you setinplace = True
your changes will be applied to the original dataframe. - You can overwrite (a part of) your original dataframe using the assignment operator
=
. It works the same way as overwriting other variables. - You can save your modifications in a new variable using the assignment operator
=
, i.e. create a copy of your original dataframe. If you still need the original dataframe alongside the modified one make a copy.
Note: With 'original dataframe' we mean the dataframe stored in your variable, and not the csv/excel file on your computer. The file is not changed at any point. So don't worry, if you accidentally overwrite your original dataframe you can always import it again from your csv file.
In section (2.2) Renaming columns, all three options will be showcased. In the following sections, we will always overwrite the original dataframe using =
.
Renaming columns
If your dataframe has long unreadable column names, the first thing you should do is changing the column names to make your life easier. Column names should ideally be short, self-explanatory and precisely summarize what the column is about. Do not use spaces because they can cause problems later, but use underscores instead.
You can use the Pandas method rename()
. The axis
parameter specifies if you want to rename rows (axis=0
, the default) or columns (axis=1
).
# rename guest_satisfaction_overall into guest_satisfaction # make this change in the original dataframe with inplace=True df.rename({'guest_satisfaction_overall': 'guest_satisfaction'}, axis=1, inplace=True) df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction ... \ 0 False 1 0 10.0 93.0 ... 1 False 0 0 8.0 85.0 ... 2 False 0 1 9.0 87.0 ... 3 False 0 1 9.0 90.0 ... 4 True 0 0 10.0 98.0 ... dist metro_dist attr_index attr_index_norm rest_index \ 0 5.022964 2.539380 78.690379 4.166708 98.253896 1 0.488389 0.239404 631.176378 33.421209 837.280757 2 5.748312 3.651621 75.275877 3.985908 95.386955 3 0.384862 0.439876 493.272534 26.119108 875.033098 4 0.544738 0.318693 552.830324 29.272733 815.305740 rest_index_norm lng lat city weekend 0 6.846473 4.90569 52.41772 amsterdam False 1 58.342928 4.90005 52.37432 amsterdam False 2 6.646700 4.97512 52.36103 amsterdam False 3 60.973565 4.89417 52.37663 amsterdam False 4 56.811677 4.90051 52.37508 amsterdam False [5 rows x 21 columns]
# rename guest_satisfaction_overall into guest_satisfaction # overwrite the variable df with the changed dataframe df = df.rename({'guest_satisfaction_overall': 'guest_satisfaction'}, axis=1) df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction ... \ 0 False 1 0 10.0 93.0 ... 1 False 0 0 8.0 85.0 ... 2 False 0 1 9.0 87.0 ... 3 False 0 1 9.0 90.0 ... 4 True 0 0 10.0 98.0 ... dist metro_dist attr_index attr_index_norm rest_index \ 0 5.022964 2.539380 78.690379 4.166708 98.253896 1 0.488389 0.239404 631.176378 33.421209 837.280757 2 5.748312 3.651621 75.275877 3.985908 95.386955 3 0.384862 0.439876 493.272534 26.119108 875.033098 4 0.544738 0.318693 552.830324 29.272733 815.305740 rest_index_norm lng lat city weekend 0 6.846473 4.90569 52.41772 amsterdam False 1 58.342928 4.90005 52.37432 amsterdam False 2 6.646700 4.97512 52.36103 amsterdam False 3 60.973565 4.89417 52.37663 amsterdam False 4 56.811677 4.90051 52.37508 amsterdam False [5 rows x 21 columns]
# rename guest_satisfaction_overall into guest_satisfaction # save the changed dataframe in a new variable df_new = df.rename({'guest_satisfaction_overall': 'guest_satisfaction'}, axis=1) df_new.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction ... \ 0 False 1 0 10.0 93.0 ... 1 False 0 0 8.0 85.0 ... 2 False 0 1 9.0 87.0 ... 3 False 0 1 9.0 90.0 ... 4 True 0 0 10.0 98.0 ... dist metro_dist attr_index attr_index_norm rest_index \ 0 5.022964 2.539380 78.690379 4.166708 98.253896 1 0.488389 0.239404 631.176378 33.421209 837.280757 2 5.748312 3.651621 75.275877 3.985908 95.386955 3 0.384862 0.439876 493.272534 26.119108 875.033098 4 0.544738 0.318693 552.830324 29.272733 815.305740 rest_index_norm lng lat city weekend 0 6.846473 4.90569 52.41772 amsterdam False 1 58.342928 4.90005 52.37432 amsterdam False 2 6.646700 4.97512 52.36103 amsterdam False 3 60.973565 4.89417 52.37663 amsterdam False 4 56.811677 4.90051 52.37508 amsterdam False [5 rows x 21 columns]
Changing values
You can assign new values to a selection based on loc
/iloc
.
# change the value in the room_shared column to True if the room_type is a 'Shared room' df.loc[df['room_type']=='Shared room', 'room_shared'] = True df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction ... \ 0 False 1 0 10.0 93.0 ... 1 False 0 0 8.0 85.0 ... 2 False 0 1 9.0 87.0 ... 3 False 0 1 9.0 90.0 ... 4 True 0 0 10.0 98.0 ... dist metro_dist attr_index attr_index_norm rest_index \ 0 5.022964 2.539380 78.690379 4.166708 98.253896 1 0.488389 0.239404 631.176378 33.421209 837.280757 2 5.748312 3.651621 75.275877 3.985908 95.386955 3 0.384862 0.439876 493.272534 26.119108 875.033098 4 0.544738 0.318693 552.830324 29.272733 815.305740 rest_index_norm lng lat city weekend 0 6.846473 4.90569 52.41772 amsterdam False 1 58.342928 4.90005 52.37432 amsterdam False 2 6.646700 4.97512 52.36103 amsterdam False 3 60.973565 4.89417 52.37663 amsterdam False 4 56.811677 4.90051 52.37508 amsterdam False [5 rows x 21 columns]
An important method to alter all values in a column at once is apply()
. It is commonly used to apply changes along a whole axis (i.e. a whole column or row) of a DataFrame. This built-in method should be always preferred over a loop in Python, because it is much faster and more efficient.
apply()
requires a function that specifies the changes you want to apply to each value in the column. So, before using apply you need to write a function.
# a function that capitalizes a string def change_spelling(x): return x.capitalize() # apply the function to the 'city' column to capitalize every city in the column df['city'] = df['city'].apply(change_spelling) df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction ... \ 0 False 1 0 10.0 93.0 ... 1 False 0 0 8.0 85.0 ... 2 False 0 1 9.0 87.0 ... 3 False 0 1 9.0 90.0 ... 4 True 0 0 10.0 98.0 ... dist metro_dist attr_index attr_index_norm rest_index \ 0 5.022964 2.539380 78.690379 4.166708 98.253896 1 0.488389 0.239404 631.176378 33.421209 837.280757 2 5.748312 3.651621 75.275877 3.985908 95.386955 3 0.384862 0.439876 493.272534 26.119108 875.033098 4 0.544738 0.318693 552.830324 29.272733 815.305740 rest_index_norm lng lat city weekend 0 6.846473 4.90569 52.41772 Amsterdam False 1 58.342928 4.90005 52.37432 Amsterdam False 2 6.646700 4.97512 52.36103 Amsterdam False 3 60.973565 4.89417 52.37663 Amsterdam False 4 56.811677 4.90051 52.37508 Amsterdam False [5 rows x 21 columns]
Alternatively, you can use a so-called lambda function. This is optional and already more advanced. Lambda is a short, anonymous function that is only used once and is not saved in the memory. It is introduced with the lambda keyword.
# use a lambda function to capitalize every city in the 'city' column df['city'] = df['city'].apply(lambda x: x.capitalize()) df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction ... \ 0 False 1 0 10.0 93.0 ... 1 False 0 0 8.0 85.0 ... 2 False 0 1 9.0 87.0 ... 3 False 0 1 9.0 90.0 ... 4 True 0 0 10.0 98.0 ... dist metro_dist attr_index attr_index_norm rest_index \ 0 5.022964 2.539380 78.690379 4.166708 98.253896 1 0.488389 0.239404 631.176378 33.421209 837.280757 2 5.748312 3.651621 75.275877 3.985908 95.386955 3 0.384862 0.439876 493.272534 26.119108 875.033098 4 0.544738 0.318693 552.830324 29.272733 815.305740 rest_index_norm lng lat city weekend 0 6.846473 4.90569 52.41772 Amsterdam False 1 58.342928 4.90005 52.37432 Amsterdam False 2 6.646700 4.97512 52.36103 Amsterdam False 3 60.973565 4.89417 52.37663 Amsterdam False 4 56.811677 4.90051 52.37508 Amsterdam False [5 rows x 21 columns]
For common changes you do not need to write your own functions. Pandas already has built-in methods that allow you to make quick changes to a whole column. For example, str
makes it possible to use methods that modify strings. To make all values in the 'name' column uppercase we can use str
followed by upper()
. It is worth it to always do a quick search on the internet for an existing pandas method before writing your own.
df['city'] = df['city'].str.capitalize() df.head()
realSum room_type room_shared room_private person_capacity \ 0 194.033698 Private room False True "2.0" 1 344.245776 Private room False True "4.0" 2 264.101422 Private room False True "2.0" 3 433.529398 Private room False True "4.0" 4 485.552926 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction ... \ 0 False 1 0 10.0 93.0 ... 1 False 0 0 8.0 85.0 ... 2 False 0 1 9.0 87.0 ... 3 False 0 1 9.0 90.0 ... 4 True 0 0 10.0 98.0 ... dist metro_dist attr_index attr_index_norm rest_index \ 0 5.022964 2.539380 78.690379 4.166708 98.253896 1 0.488389 0.239404 631.176378 33.421209 837.280757 2 5.748312 3.651621 75.275877 3.985908 95.386955 3 0.384862 0.439876 493.272534 26.119108 875.033098 4 0.544738 0.318693 552.830324 29.272733 815.305740 rest_index_norm lng lat city weekend 0 6.846473 4.90569 52.41772 Amsterdam False 1 58.342928 4.90005 52.37432 Amsterdam False 2 6.646700 4.97512 52.36103 Amsterdam False 3 60.973565 4.89417 52.37663 Amsterdam False 4 56.811677 4.90051 52.37508 Amsterdam False [5 rows x 21 columns]
Rearranging and Grouping Data
Often the data is not provided in a structured way, which makes it necessary to rearrange it in order to get a better overview. The following code shows cases of sorting and grouping the data.
You can use sort_values
to sort the data by a single or multiple values. Use the ascending
parameter to specify if you want to sort them from low to high (this is the default) or high to low.
# sort by 'realSum', from high to low df = df.sort_values(by=['realSum'], ascending=False) df.head()
realSum room_type room_shared room_private \ 3590 18545.450285 Entire home/apt False False 34803 16445.614689 Entire home/apt False False 24348 15499.894165 Entire home/apt False False 48380 13664.305916 Private room False True 50787 13656.358834 Private room False True person_capacity host_is_superhost multi biz cleanliness_rating \ 3590 "2.0" True 0 1 10.0 34803 "2.0" False 0 0 9.0 24348 "3.0" True 0 1 10.0 48380 "2.0" False 0 0 9.0 50787 "2.0" False 0 0 9.0 guest_satisfaction ... dist metro_dist attr_index \ 3590 100.0 ... 1.196536 0.381128 134.904353 34803 100.0 ... 4.602378 0.118665 260.896109 24348 95.0 ... 0.269101 0.227193 1100.391676 48380 87.0 ... 2.239501 0.414395 128.349070 50787 87.0 ... 2.239486 0.414409 128.349821 attr_index_norm rest_index rest_index_norm lng lat \ 3590 5.086455 275.573716 20.691752 23.73200 37.98600 34803 12.700335 545.826245 32.072497 2.29772 48.83669 24348 76.489446 2358.358493 42.228377 -0.13038 51.50995 48380 9.191812 201.545043 4.818080 16.34356 48.20751 50787 9.191567 201.546533 4.871302 16.34356 48.20751 city weekend 3590 Athens False 34803 Paris False 24348 London False 48380 Vienna False 50787 Vienna True [5 rows x 21 columns]
You can reorder the columns by selecting them in the preferred order like this:
# reorder the columns df_reordered = df[['city','weekend', 'realSum']] df_reordered.head()
city weekend realSum 3590 Athens False 18545.450285 34803 Paris False 16445.614689 24348 London False 15499.894165 48380 Vienna False 13664.305916 50787 Vienna True 13656.358834
The groupby()
method splits the data into groups based on defined criteria.
# group the data according to the city df.groupby('city')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002D978AE0280>
This is helpful for group-dependent analysis, for example if we want to know the mean price in each city.
# group the data according to the city and calculate the mean price for each city price_per_city = df.groupby('city')['realSum'].mean() print(price_per_city)
city Amsterdam 573.181612 Athens 151.724876 Barcelona 294.093539 Berlin 244.744835 Budapest 176.507530 Lisbon 238.215403 London 362.675702 Paris 392.534581 Rome 205.396554 Vienna 241.577172 Name: realSum, dtype: float64
Deleting Data
The deletion of data plays an important role in data analysis for getting rid of unnecessary or inaccurate data. drop()
is a common method for removing columns or rows. The axis
parameter specifies if you want to delete rows (axis=0
, the default) or columns (axis=1
).
# delete rows 3590 and 34803 df = df.drop([3590, 34803]) df.head()
# delete columns rest_index and rest_index_norm df = df.drop(['rest_index', 'rest_index_norm'], axis=1) df.head()
realSum room_type room_shared room_private \ 24348 15499.894165 Entire home/apt False False 48380 13664.305916 Private room False True 50787 13656.358834 Private room False True 48215 12942.991375 Entire home/apt False False 30626 12937.275101 Entire home/apt False False person_capacity host_is_superhost multi biz cleanliness_rating \ 24348 "3.0" True 0 1 10.0 48380 "2.0" False 0 0 9.0 50787 "2.0" False 0 0 9.0 48215 "4.0" False 0 1 7.0 30626 "6.0" False 0 0 10.0 guest_satisfaction bedrooms dist metro_dist attr_index \ 24348 95.0 3 0.269101 0.227193 1100.391676 48380 87.0 1 2.239501 0.414395 128.349070 50787 87.0 1 2.239486 0.414409 128.349821 48215 93.0 1 1.497979 0.396893 123.776241 30626 100.0 2 2.938307 0.619716 386.733299 attr_index_norm lng lat city weekend 24348 76.489446 -0.13038 51.50995 London False 48380 9.191812 16.34356 48.20751 Vienna False 50787 9.191567 16.34356 48.20751 Vienna True 48215 8.864325 16.38614 48.21911 Vienna False 30626 26.883339 -0.14658 51.53190 London True
Remember that you can also select the rows and columns you want to keep and overwrite the old dataframe with your selection. In some cases this might be faster than deleting all the rows/columns you want to get rid off. For example, you can select rows by a condition.
# select/delete rows by condition: we keep all rows with a 'realSum' below 100 df = df[df['realSum'] < 100] df.head()
realSum room_type room_shared room_private person_capacity \ 18041 99.906191 Private room False True "2.0" 20787 99.906191 Private room False True "2.0" 14759 99.873102 Entire home/apt False False "5.0" 13623 99.873102 Entire home/apt False False "3.0" 2136 99.828932 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction \ 18041 True 1 0 9.0 99.0 20787 True 1 0 10.0 95.0 14759 False 0 1 9.0 91.0 13623 False 0 0 8.0 89.0 2136 True 1 0 10.0 98.0 bedrooms dist metro_dist attr_index attr_index_norm lng \ 18041 1 2.357336 0.630488 110.844980 3.659468 -9.12773 20787 1 2.150651 0.887178 120.517161 3.975050 -9.12500 14759 1 2.125625 0.307137 227.040210 13.550651 19.02769 13623 1 2.408637 0.069400 115.642961 6.896333 19.04842 2136 1 1.604749 0.471949 146.849762 5.536846 23.73082 lat city weekend 18041 38.73146 Lisbon False 20787 38.72800 Lisbon True 14759 47.50505 Budapest True 13623 47.47725 Budapest False 2136 37.96157 Athens False
After deletion of data, you might want to reset the index because it is not adjusted automatically during the deletion process.
# reset the index of the dataframe # drop=True: the old index is replaced by the new one df = df.reset_index(drop=True) df.head()
realSum room_type room_shared room_private person_capacity \ 0 99.906191 Private room False True "2.0" 1 99.906191 Private room False True "2.0" 2 99.873102 Entire home/apt False False "5.0" 3 99.873102 Entire home/apt False False "3.0" 4 99.828932 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction \ 0 True 1 0 9.0 99.0 1 True 1 0 10.0 95.0 2 False 0 1 9.0 91.0 3 False 0 0 8.0 89.0 4 True 1 0 10.0 98.0 bedrooms dist metro_dist attr_index attr_index_norm lng \ 0 1 2.357336 0.630488 110.844980 3.659468 -9.12773 1 1 2.150651 0.887178 120.517161 3.975050 -9.12500 2 1 2.125625 0.307137 227.040210 13.550651 19.02769 3 1 2.408637 0.069400 115.642961 6.896333 19.04842 4 1 1.604749 0.471949 146.849762 5.536846 23.73082 lat city weekend 0 38.73146 Lisbon False 1 38.72800 Lisbon True 2 47.50505 Budapest True 3 47.47725 Budapest False 4 37.96157 Athens False
Adding Data
To add a new column to your dataframe you can simply invent a column name and assign it some values.
# create a new column and fill it with the value 'new' df['new_column'] = 'new' df.head()
realSum room_type room_shared room_private person_capacity \ 0 99.906191 Private room False True "2.0" 1 99.906191 Private room False True "2.0" 2 99.873102 Entire home/apt False False "5.0" 3 99.873102 Entire home/apt False False "3.0" 4 99.828932 Private room False True "2.0" host_is_superhost multi biz cleanliness_rating guest_satisfaction \ 0 True 1 0 9.0 99.0 1 True 1 0 10.0 95.0 2 False 0 1 9.0 91.0 3 False 0 0 8.0 89.0 4 True 1 0 10.0 98.0 bedrooms dist metro_dist attr_index attr_index_norm lng \ 0 1 2.357336 0.630488 110.844980 3.659468 -9.12773 1 1 2.150651 0.887178 120.517161 3.975050 -9.12500 2 1 2.125625 0.307137 227.040210 13.550651 19.02769 3 1 2.408637 0.069400 115.642961 6.896333 19.04842 4 1 1.604749 0.471949 146.849762 5.536846 23.73082 lat city weekend new_column 0 38.73146 Lisbon False new 1 38.72800 Lisbon True new 2 47.50505 Budapest True new 3 47.47725 Budapest False new 4 37.96157 Athens False new
Summary
- Access data with
[]
,loc
oriloc
. - Change data with
= new value
,apply()
or built-in Pandas methods. - Sort and group data with
sort_values()
andgroupby()
. - Delete data with
drop()
. - Don't forget to save your changes.
References
To be added soon
The author of this entry is Wanja Tolksdorf