Dataframe manipulation with Pandas

Dataframe manipulation

In this entry, you will learn how to access and manipulate data in Pandas dataframes. This skill is essential for many data analysis steps, such as data cleaning and visualization.

As an example, we will use a modified version of the Airbnb dataset from Kaggle. [INSERT LINK TO OUR MODIFIED DATASET] You can find the documentation with information about the columns here.

First, we need to import the Pandas library and our dataset.

import pandas as pd
df = pd.read_csv('airbnb_modified.csv')
df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  \
0              False      1    0                10.0   
1              False      0    0                 8.0   
2              False      0    1                 9.0   
3              False      0    1                 9.0   
4               True      0    0                10.0   

   guest_satisfaction_overall  ...      dist  metro_dist  attr_index  \
0                        93.0  ...  5.022964    2.539380   78.690379   
1                        85.0  ...  0.488389    0.239404  631.176378   
2                        87.0  ...  5.748312    3.651621   75.275877   
3                        90.0  ...  0.384862    0.439876  493.272534   
4                        98.0  ...  0.544738    0.318693  552.830324   

   attr_index_norm  rest_index  rest_index_norm      lng       lat       city  \
0         4.166708   98.253896         6.846473  4.90569  52.41772  amsterdam   
1        33.421209  837.280757        58.342928  4.90005  52.37432  amsterdam   
2         3.985908   95.386955         6.646700  4.97512  52.36103  amsterdam   
3        26.119108  875.033098        60.973565  4.89417  52.37663  amsterdam   
4        29.272733  815.305740        56.811677  4.90051  52.37508  amsterdam   

  weekend  
0   False  
1   False  
2   False  
3   False  
4   False  

[5 rows x 21 columns]

Accessing Data

In order to manipulate the data inside a dataframe it is necessary to access the data. We can access the data via column and row labels. To get an overview of the column and row labels in our dataframe we can use the columns and index attributes.

print(df.columns)
print(df.index)

Index(['realSum', 'room_type', 'room_shared', 'room_private',
       'person_capacity', 'host_is_superhost', 'multi', 'biz',
       'cleanliness_rating', 'guest_satisfaction_overall', 'bedrooms', 'dist',
       'metro_dist', 'attr_index', 'attr_index_norm', 'rest_index',
       'rest_index_norm', 'lng', 'lat', 'city', 'weekend'],
      dtype='object')
RangeIndex(start=0, stop=51727, step=1)

If your dataframe has no index yet, you can improve its accessibility by adding an index yourself. It can be a simple range index from 0 to n or an own customized index for more specific needs.

# creating a simple range index
data.index = pd.RangeIndex(start=0, stop=len(data), step=1)

# creating a customized index
data.index = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o']

There are three different options to access the data: square brackets [], loc and iloc.

Selecting rows and columns with square brackets `[]`

To select a subset of your data, you can use square brackets []. Inside the brackets you can specify either the columns OR rows you want.

You have several options for specifying the rows or columns:

a single column label
a list of column labels
a slice of rows
a conditional expression

Note: For building conditions, you can use all of the common operators, such as ==, !=, <, >, <=, etc. If you combine multiple conditions, each condition must be surrounded by parentheses (). In pandas the or operator is represented by the symbol |, the and operator is represented by the symbol &.

Here are some examples:

# selecting a single column: realSum
df['realSum']

0        194.033698
1        344.245776
2        264.101422
3        433.529398
4        485.552926
            ...    
51722    789.999765
51723    454.854358
51724    364.898872
51725    204.053012
51726    486.187074
Name: realSum, Length: 51727, dtype: float64

# selecting multiple columns: realSum & room_type
df[['realSum', 'room_type']]

          realSum        room_type
0      194.033698     Private room
1      344.245776     Private room
2      264.101422     Private room
3      433.529398     Private room
4      485.552926     Private room
...           ...              ...
51722  789.999765  Entire home/apt
51723  454.854358  Entire home/apt
51724  364.898872  Entire home/apt
51725  204.053012  Entire home/apt
51726  486.187074  Entire home/apt

[51727 rows x 2 columns]

# selecting a slice of rows (the 5th row is not included)
df[3:5]

      realSum     room_type  room_shared  room_private person_capacity  \
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  \
3              False      0    1                 9.0   
4               True      0    0                10.0   

   guest_satisfaction_overall  ...      dist  metro_dist  attr_index  \
3                        90.0  ...  0.384862    0.439876  493.272534   
4                        98.0  ...  0.544738    0.318693  552.830324   

   attr_index_norm  rest_index  rest_index_norm      lng       lat       city  \
3        26.119108  875.033098        60.973565  4.89417  52.37663  amsterdam   
4        29.272733  815.305740        56.811677  4.90051  52.37508  amsterdam   

  weekend  
3   False  
4   False  

[2 rows x 21 columns]

# selecting by condition: all rows, for which realSum is between 500-1000€
df[(df['realSum'] > 500) & (df['realSum'] < 1000)]

          realSum        room_type  room_shared  room_private person_capacity  \
5      552.808567     Private room        False          True           "3.0"   
10     909.474375  Entire home/apt        False         False           "2.0"   
12     675.602840  Entire home/apt        False         False           "4.0"   
13     552.808567  Entire home/apt        False         False           "2.0"   
21     933.845757  Entire home/apt        False         False           "4.0"   
...           ...              ...          ...           ...             ...   
51701  750.765491  Entire home/apt        False         False           "6.0"   
51702  715.938574  Entire home/apt        False         False           "6.0"   
51704  637.168969  Entire home/apt        False         False           "2.0"   
51717  990.527614  Entire home/apt        False         False           "6.0"   
51722  789.999765  Entire home/apt        False         False           "6.0"   

       host_is_superhost  multi  biz  cleanliness_rating  \
5                  False      0    0                 8.0   
10                 False      0    0                10.0   
12                 False      0    0                 8.0   
13                  True      0    0                10.0   
21                 False      0    0                10.0   
...                  ...    ...  ...                 ...   
51701              False      0    1                10.0   
51702              False      0    1                10.0   
51704              False      0    0                10.0   
51717               True      0    1                 8.0   
51722              False      0    1                 8.0   

       guest_satisfaction_overall  ...      dist  metro_dist  attr_index  \
5                           100.0  ...  2.131420    1.904668  174.788957   
10                           96.0  ...  1.009922    0.917115  409.858124   
12                           87.0  ...  2.933046    0.628073  214.923342   
13                          100.0  ...  1.305494    1.342162  325.255952   
21                           96.0  ...  1.014066    0.377104  477.794070   
...                           ...  ...       ...         ...         ...   
51701                        96.0  ...  0.378804    0.203138  257.494817   
51702                       100.0  ...  0.530181    0.135447  219.402478   
51704                        93.0  ...  0.994051    0.202539  169.073402   
51717                       100.0  ...  1.060404    0.192599  583.188707   
51722                        80.0  ...  4.318515    0.295255  344.212057   

       attr_index_norm   rest_index  rest_index_norm       lng       lat  \
5             9.255191   225.201662        15.692376   4.87699  52.38966   
10           21.702260   555.114276        38.681161   4.87956  52.36953   
12           11.380334   269.624904        18.787851   4.88934  52.34697   
13           17.222519   390.912052        27.239314   4.87417  52.37509   
21           25.299513   664.053251        46.272186   4.89088  52.36422   
...                ...          ...              ...       ...       ...   
51701        18.440080   548.973296        13.268473  16.37825  48.21001   
51702        15.712158   438.756874        10.604584  16.37940  48.21136   
51704        12.107921   282.296424         6.822996  16.38568  48.20460   
51717        19.876012  1336.989084        29.431245   2.16000  41.39300   
51722        23.927522   715.238423        12.801522  -0.18656  51.49615   

            city weekend  
5      amsterdam   False  
10     amsterdam   False  
12     amsterdam   False  
13     amsterdam   False  
21     amsterdam   False  
...          ...     ...  
51701     vienna    True  
51702     vienna    True  
51704     vienna    True  
51717  barcelona   False  
51722     london    True  

[4413 rows x 21 columns]

Selecting rows and columns with `loc`

If you want more advanced options, for example selecting specific rows and columns at the same time, you need the loc operator. In the brackets, you specify the rows before the comma and the columns after the comma.

You have multiple options how to specify the rows and columns:

a single label
a list of labels
a slice of labels
a conditional expression
a colon (meaning you want to select all rows or columns)

Note: For building conditions, you can use all of the common operators, such as ==, !=, <, >, <=, etc. If you combine multiple conditions, each condition must be surrounded by parentheses (). In pandas the or operator is represented by the symbol |, the and operator is represented by the symbol &.

Here are some examples:

# selecting a value by row and column: row 5 and column room_type
df.loc[5, 'room_type']

'Private room'

# selecting multiple rows and columns: row 3 & 5 and columns realSum & room_type
df.loc[[3,5], ['realSum', 'room_type']]

           name  nwspol
3  ESS10SCe03_1     120
5  ESS10SCe03_1      60

# selecting a slice of rows and columns: rows 3-5 (incl. row 5) and columns room_type - person_capacity
df.loc[3:5, 'room_type' : 'person_capacity']

      room_type  room_shared  room_private person_capacity
3  Private room        False          True           "4.0"
4  Private room        False          True           "2.0"
5  Private room        False          True           "3.0"

# selecting rows by conditions and specific columns: 
# all rows, for which realSum is below 1000€ and room_type is private room
# columns realSum, room_type and city
df.loc[(df['realSum'] > 1000) & (df['room_type'] == 'Private room'), ['realSum', 'room_type', 'city']]

            realSum     room_type       city
137     1474.937314  Private room  amsterdam
190     1413.305837  Private room  amsterdam
352     1572.891524  Private room  amsterdam
552     1072.809505  Private room  amsterdam
607     1007.662925  Private room  amsterdam
...             ...           ...        ...
42518   2305.192528  Private room       rome
45892   1281.883431  Private room       rome
47009   2311.738714  Private room       rome
48380  13664.305916  Private room     vienna
50787  13656.358834  Private room     vienna

[74 rows x 3 columns]

# selecting all rows and the realSum column
df.loc[:, 'realSum']

0        194.033698
1        344.245776
2        264.101422
3        433.529398
4        485.552926
            ...    
51722    789.999765
51723    454.854358
51724    364.898872
51725    204.053012
51726    486.187074
Name: realSum, Length: 51727, dtype: float64

Selecting rows and columns with `iloc`

If you want to select rows and columns via their position in the dataframe, you can use iloc. Again, you specify the rows before the comma and the columns after the comma.

You have multiple options how to specify the rows and columns:

a single position
a list of positions
a slice of positions
a colon (meaning you want to select all rows or columns)

Here are some examples:

# selecting the value in row 0, column 0
df.iloc[0,0]

194.03369812293488

# selecting a slice of rows: rows 0-2 (excl. row 3)
# and selecting columns at position 0 and 2
df.iloc[0:3, [0,2]]

      realSum  room_shared
0  194.033698        False
1  344.245776        False
2  264.101422        False

When to use `[]`, `loc` or `iloc`

You can use square brackets [] if you want to make a simple selection of rows OR columns using row/column labels.
You need to use loc if you want to select specific rows AND columns using row/column labels.
You need to use iloc if you want to select specific rows and/or columns using their positions in the dataframe.

You can also assign new values to a selection based on loc/iloc. You will learn how to do this in the next section.

Source: https://pandas.pydata.org/pandas-docs/version/1.0.2/getting_started/intro_tutorials/03_subset_data.html

Changing data

Since we know how to access data, we can also modify the values.

Saving your changes

If you make changes to your data, chances are that you want to save your modifications and continue working with the modified dataframe.

There are several options for saving your changes:

Many - but not all - Pandas methods support the inplace parameter. If you set inplace = True your changes will be applied to the original dataframe.
You can overwrite (a part of) your original dataframe using the assignment operator =. It works the same way as overwriting other variables.
You can save your modifications in a new variable using the assignment operator =, i.e. create a copy of your original dataframe. If you still need the original dataframe alongside the modified one make a copy.

Note: With 'original dataframe' we mean the dataframe stored in your variable, and not the csv/excel file on your computer. The file is not changed at any point. So don't worry, if you accidentally overwrite your original dataframe you can always import it again from your csv file.

In section (2.2) Renaming columns, all three options will be showcased. In the following sections, we will always overwrite the original dataframe using =.

Renaming columns

If your dataframe has long unreadable column names, the first thing you should do is changing the column names to make your life easier. Column names should ideally be short, self-explanatory and precisely summarize what the column is about. Do not use spaces because they can cause problems later, but use underscores instead.

You can use the Pandas method rename(). The axis parameter specifies if you want to rename rows (axis=0, the default) or columns (axis=1).

# rename guest_satisfaction_overall into guest_satisfaction
# make this change in the original dataframe with inplace=True
df.rename({'guest_satisfaction_overall': 'guest_satisfaction'}, axis=1, inplace=True)
df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  ...  \
0              False      1    0                10.0                93.0  ...   
1              False      0    0                 8.0                85.0  ...   
2              False      0    1                 9.0                87.0  ...   
3              False      0    1                 9.0                90.0  ...   
4               True      0    0                10.0                98.0  ...   

       dist  metro_dist  attr_index  attr_index_norm  rest_index  \
0  5.022964    2.539380   78.690379         4.166708   98.253896   
1  0.488389    0.239404  631.176378        33.421209  837.280757   
2  5.748312    3.651621   75.275877         3.985908   95.386955   
3  0.384862    0.439876  493.272534        26.119108  875.033098   
4  0.544738    0.318693  552.830324        29.272733  815.305740   

   rest_index_norm      lng       lat       city weekend  
0         6.846473  4.90569  52.41772  amsterdam   False  
1        58.342928  4.90005  52.37432  amsterdam   False  
2         6.646700  4.97512  52.36103  amsterdam   False  
3        60.973565  4.89417  52.37663  amsterdam   False  
4        56.811677  4.90051  52.37508  amsterdam   False  

[5 rows x 21 columns]

# rename guest_satisfaction_overall into guest_satisfaction
# overwrite the variable df with the changed dataframe
df = df.rename({'guest_satisfaction_overall': 'guest_satisfaction'}, axis=1)
df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  ...  \
0              False      1    0                10.0                93.0  ...   
1              False      0    0                 8.0                85.0  ...   
2              False      0    1                 9.0                87.0  ...   
3              False      0    1                 9.0                90.0  ...   
4               True      0    0                10.0                98.0  ...   

       dist  metro_dist  attr_index  attr_index_norm  rest_index  \
0  5.022964    2.539380   78.690379         4.166708   98.253896   
1  0.488389    0.239404  631.176378        33.421209  837.280757   
2  5.748312    3.651621   75.275877         3.985908   95.386955   
3  0.384862    0.439876  493.272534        26.119108  875.033098   
4  0.544738    0.318693  552.830324        29.272733  815.305740   

   rest_index_norm      lng       lat       city weekend  
0         6.846473  4.90569  52.41772  amsterdam   False  
1        58.342928  4.90005  52.37432  amsterdam   False  
2         6.646700  4.97512  52.36103  amsterdam   False  
3        60.973565  4.89417  52.37663  amsterdam   False  
4        56.811677  4.90051  52.37508  amsterdam   False  

[5 rows x 21 columns]

# rename guest_satisfaction_overall into guest_satisfaction
# save the changed dataframe in a new variable
df_new = df.rename({'guest_satisfaction_overall': 'guest_satisfaction'}, axis=1)
df_new.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  ...  \
0              False      1    0                10.0                93.0  ...   
1              False      0    0                 8.0                85.0  ...   
2              False      0    1                 9.0                87.0  ...   
3              False      0    1                 9.0                90.0  ...   
4               True      0    0                10.0                98.0  ...   

       dist  metro_dist  attr_index  attr_index_norm  rest_index  \
0  5.022964    2.539380   78.690379         4.166708   98.253896   
1  0.488389    0.239404  631.176378        33.421209  837.280757   
2  5.748312    3.651621   75.275877         3.985908   95.386955   
3  0.384862    0.439876  493.272534        26.119108  875.033098   
4  0.544738    0.318693  552.830324        29.272733  815.305740   

   rest_index_norm      lng       lat       city weekend  
0         6.846473  4.90569  52.41772  amsterdam   False  
1        58.342928  4.90005  52.37432  amsterdam   False  
2         6.646700  4.97512  52.36103  amsterdam   False  
3        60.973565  4.89417  52.37663  amsterdam   False  
4        56.811677  4.90051  52.37508  amsterdam   False  

[5 rows x 21 columns]

Changing values

You can assign new values to a selection based on loc/iloc.

# change the value in the room_shared column to True if the room_type is a 'Shared room'
df.loc[df['room_type']=='Shared room', 'room_shared'] = True
df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  ...  \
0              False      1    0                10.0                93.0  ...   
1              False      0    0                 8.0                85.0  ...   
2              False      0    1                 9.0                87.0  ...   
3              False      0    1                 9.0                90.0  ...   
4               True      0    0                10.0                98.0  ...   

       dist  metro_dist  attr_index  attr_index_norm  rest_index  \
0  5.022964    2.539380   78.690379         4.166708   98.253896   
1  0.488389    0.239404  631.176378        33.421209  837.280757   
2  5.748312    3.651621   75.275877         3.985908   95.386955   
3  0.384862    0.439876  493.272534        26.119108  875.033098   
4  0.544738    0.318693  552.830324        29.272733  815.305740   

   rest_index_norm      lng       lat       city weekend  
0         6.846473  4.90569  52.41772  amsterdam   False  
1        58.342928  4.90005  52.37432  amsterdam   False  
2         6.646700  4.97512  52.36103  amsterdam   False  
3        60.973565  4.89417  52.37663  amsterdam   False  
4        56.811677  4.90051  52.37508  amsterdam   False  

[5 rows x 21 columns]

An important method to alter all values in a column at once is apply(). It is commonly used to apply changes along a whole axis (i.e. a whole column or row) of a DataFrame. This built-in method should be always preferred over a loop in Python, because it is much faster and more efficient.

apply() requires a function that specifies the changes you want to apply to each value in the column. So, before using apply you need to write a function.

# a function that capitalizes a string
def change_spelling(x):
    return x.capitalize()

# apply the function to the 'city' column to capitalize every city in the column
df['city'] = df['city'].apply(change_spelling)
df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  ...  \
0              False      1    0                10.0                93.0  ...   
1              False      0    0                 8.0                85.0  ...   
2              False      0    1                 9.0                87.0  ...   
3              False      0    1                 9.0                90.0  ...   
4               True      0    0                10.0                98.0  ...   

       dist  metro_dist  attr_index  attr_index_norm  rest_index  \
0  5.022964    2.539380   78.690379         4.166708   98.253896   
1  0.488389    0.239404  631.176378        33.421209  837.280757   
2  5.748312    3.651621   75.275877         3.985908   95.386955   
3  0.384862    0.439876  493.272534        26.119108  875.033098   
4  0.544738    0.318693  552.830324        29.272733  815.305740   

   rest_index_norm      lng       lat       city weekend  
0         6.846473  4.90569  52.41772  Amsterdam   False  
1        58.342928  4.90005  52.37432  Amsterdam   False  
2         6.646700  4.97512  52.36103  Amsterdam   False  
3        60.973565  4.89417  52.37663  Amsterdam   False  
4        56.811677  4.90051  52.37508  Amsterdam   False  

[5 rows x 21 columns]

Alternatively, you can use a so-called lambda function. This is optional and already more advanced. Lambda is a short, anonymous function that is only used once and is not saved in the memory. It is introduced with the lambda keyword.

# use a lambda function to capitalize every city in the 'city' column
df['city'] = df['city'].apply(lambda x: x.capitalize())
df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  ...  \
0              False      1    0                10.0                93.0  ...   
1              False      0    0                 8.0                85.0  ...   
2              False      0    1                 9.0                87.0  ...   
3              False      0    1                 9.0                90.0  ...   
4               True      0    0                10.0                98.0  ...   

       dist  metro_dist  attr_index  attr_index_norm  rest_index  \
0  5.022964    2.539380   78.690379         4.166708   98.253896   
1  0.488389    0.239404  631.176378        33.421209  837.280757   
2  5.748312    3.651621   75.275877         3.985908   95.386955   
3  0.384862    0.439876  493.272534        26.119108  875.033098   
4  0.544738    0.318693  552.830324        29.272733  815.305740   

   rest_index_norm      lng       lat       city weekend  
0         6.846473  4.90569  52.41772  Amsterdam   False  
1        58.342928  4.90005  52.37432  Amsterdam   False  
2         6.646700  4.97512  52.36103  Amsterdam   False  
3        60.973565  4.89417  52.37663  Amsterdam   False  
4        56.811677  4.90051  52.37508  Amsterdam   False  

[5 rows x 21 columns]

For common changes you do not need to write your own functions. Pandas already has built-in methods that allow you to make quick changes to a whole column. For example, str makes it possible to use methods that modify strings. To make all values in the 'name' column uppercase we can use str followed by upper(). It is worth it to always do a quick search on the internet for an existing pandas method before writing your own.

df['city'] = df['city'].str.capitalize()
df.head()

      realSum     room_type  room_shared  room_private person_capacity  \
0  194.033698  Private room        False          True           "2.0"   
1  344.245776  Private room        False          True           "4.0"   
2  264.101422  Private room        False          True           "2.0"   
3  433.529398  Private room        False          True           "4.0"   
4  485.552926  Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  ...  \
0              False      1    0                10.0                93.0  ...   
1              False      0    0                 8.0                85.0  ...   
2              False      0    1                 9.0                87.0  ...   
3              False      0    1                 9.0                90.0  ...   
4               True      0    0                10.0                98.0  ...   

       dist  metro_dist  attr_index  attr_index_norm  rest_index  \
0  5.022964    2.539380   78.690379         4.166708   98.253896   
1  0.488389    0.239404  631.176378        33.421209  837.280757   
2  5.748312    3.651621   75.275877         3.985908   95.386955   
3  0.384862    0.439876  493.272534        26.119108  875.033098   
4  0.544738    0.318693  552.830324        29.272733  815.305740   

   rest_index_norm      lng       lat       city weekend  
0         6.846473  4.90569  52.41772  Amsterdam   False  
1        58.342928  4.90005  52.37432  Amsterdam   False  
2         6.646700  4.97512  52.36103  Amsterdam   False  
3        60.973565  4.89417  52.37663  Amsterdam   False  
4        56.811677  4.90051  52.37508  Amsterdam   False  

[5 rows x 21 columns]

Rearranging and Grouping Data

Often the data is not provided in a structured way, which makes it necessary to rearrange it in order to get a better overview. The following code shows cases of sorting and grouping the data.

You can use sort_values to sort the data by a single or multiple values. Use the ascending parameter to specify if you want to sort them from low to high (this is the default) or high to low.

# sort by 'realSum', from high to low
df = df.sort_values(by=['realSum'], ascending=False)
df.head()

            realSum        room_type  room_shared  room_private  \
3590   18545.450285  Entire home/apt        False         False   
34803  16445.614689  Entire home/apt        False         False   
24348  15499.894165  Entire home/apt        False         False   
48380  13664.305916     Private room        False          True   
50787  13656.358834     Private room        False          True   

      person_capacity  host_is_superhost  multi  biz  cleanliness_rating  \
3590            "2.0"               True      0    1                10.0   
34803           "2.0"              False      0    0                 9.0   
24348           "3.0"               True      0    1                10.0   
48380           "2.0"              False      0    0                 9.0   
50787           "2.0"              False      0    0                 9.0   

       guest_satisfaction  ...      dist  metro_dist   attr_index  \
3590                100.0  ...  1.196536    0.381128   134.904353   
34803               100.0  ...  4.602378    0.118665   260.896109   
24348                95.0  ...  0.269101    0.227193  1100.391676   
48380                87.0  ...  2.239501    0.414395   128.349070   
50787                87.0  ...  2.239486    0.414409   128.349821   

       attr_index_norm   rest_index  rest_index_norm       lng       lat  \
3590          5.086455   275.573716        20.691752  23.73200  37.98600   
34803        12.700335   545.826245        32.072497   2.29772  48.83669   
24348        76.489446  2358.358493        42.228377  -0.13038  51.50995   
48380         9.191812   201.545043         4.818080  16.34356  48.20751   
50787         9.191567   201.546533         4.871302  16.34356  48.20751   

         city weekend  
3590   Athens   False  
34803   Paris   False  
24348  London   False  
48380  Vienna   False  
50787  Vienna    True  

[5 rows x 21 columns]

You can reorder the columns by selecting them in the preferred order like this:

# reorder the columns
df_reordered = df[['city','weekend', 'realSum']]
df_reordered.head()

         city  weekend       realSum
3590   Athens    False  18545.450285
34803   Paris    False  16445.614689
24348  London    False  15499.894165
48380  Vienna    False  13664.305916
50787  Vienna     True  13656.358834

The groupby() method splits the data into groups based on defined criteria.

# group the data according to the city 
df.groupby('city')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002D978AE0280>

This is helpful for group-dependent analysis, for example if we want to know the mean price in each city.

# group the data according to the city and calculate the mean price for each city
price_per_city = df.groupby('city')['realSum'].mean()
print(price_per_city)

city
Amsterdam    573.181612
Athens       151.724876
Barcelona    294.093539
Berlin       244.744835
Budapest     176.507530
Lisbon       238.215403
London       362.675702
Paris        392.534581
Rome         205.396554
Vienna       241.577172
Name: realSum, dtype: float64

Deleting Data

The deletion of data plays an important role in data analysis for getting rid of unnecessary or inaccurate data. drop() is a common method for removing columns or rows. The axis parameter specifies if you want to delete rows (axis=0, the default) or columns (axis=1).

# delete rows 3590 and 34803
df = df.drop([3590, 34803])
df.head()

# delete columns rest_index and rest_index_norm
df = df.drop(['rest_index', 'rest_index_norm'], axis=1)
df.head()

            realSum        room_type  room_shared  room_private  \
24348  15499.894165  Entire home/apt        False         False   
48380  13664.305916     Private room        False          True   
50787  13656.358834     Private room        False          True   
48215  12942.991375  Entire home/apt        False         False   
30626  12937.275101  Entire home/apt        False         False   

      person_capacity  host_is_superhost  multi  biz  cleanliness_rating  \
24348           "3.0"               True      0    1                10.0   
48380           "2.0"              False      0    0                 9.0   
50787           "2.0"              False      0    0                 9.0   
48215           "4.0"              False      0    1                 7.0   
30626           "6.0"              False      0    0                10.0   

       guest_satisfaction  bedrooms      dist  metro_dist   attr_index  \
24348                95.0         3  0.269101    0.227193  1100.391676   
48380                87.0         1  2.239501    0.414395   128.349070   
50787                87.0         1  2.239486    0.414409   128.349821   
48215                93.0         1  1.497979    0.396893   123.776241   
30626               100.0         2  2.938307    0.619716   386.733299   

       attr_index_norm       lng       lat    city  weekend  
24348        76.489446  -0.13038  51.50995  London    False  
48380         9.191812  16.34356  48.20751  Vienna    False  
50787         9.191567  16.34356  48.20751  Vienna     True  
48215         8.864325  16.38614  48.21911  Vienna    False  
30626        26.883339  -0.14658  51.53190  London     True

Remember that you can also select the rows and columns you want to keep and overwrite the old dataframe with your selection. In some cases this might be faster than deleting all the rows/columns you want to get rid off. For example, you can select rows by a condition.

# select/delete rows by condition: we keep all rows with a 'realSum' below 100
df = df[df['realSum'] < 100]
df.head()

         realSum        room_type  room_shared  room_private person_capacity  \
18041  99.906191     Private room        False          True           "2.0"   
20787  99.906191     Private room        False          True           "2.0"   
14759  99.873102  Entire home/apt        False         False           "5.0"   
13623  99.873102  Entire home/apt        False         False           "3.0"   
2136   99.828932     Private room        False          True           "2.0"   

       host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  \
18041               True      1    0                 9.0                99.0   
20787               True      1    0                10.0                95.0   
14759              False      0    1                 9.0                91.0   
13623              False      0    0                 8.0                89.0   
2136                True      1    0                10.0                98.0   

       bedrooms      dist  metro_dist  attr_index  attr_index_norm       lng  \
18041         1  2.357336    0.630488  110.844980         3.659468  -9.12773   
20787         1  2.150651    0.887178  120.517161         3.975050  -9.12500   
14759         1  2.125625    0.307137  227.040210        13.550651  19.02769   
13623         1  2.408637    0.069400  115.642961         6.896333  19.04842   
2136          1  1.604749    0.471949  146.849762         5.536846  23.73082   

            lat      city  weekend  
18041  38.73146    Lisbon    False  
20787  38.72800    Lisbon     True  
14759  47.50505  Budapest     True  
13623  47.47725  Budapest    False  
2136   37.96157    Athens    False

After deletion of data, you might want to reset the index because it is not adjusted automatically during the deletion process.

# reset the index of the dataframe
# drop=True: the old index is replaced by the new one
df = df.reset_index(drop=True)
df.head()

     realSum        room_type  room_shared  room_private person_capacity  \
0  99.906191     Private room        False          True           "2.0"   
1  99.906191     Private room        False          True           "2.0"   
2  99.873102  Entire home/apt        False         False           "5.0"   
3  99.873102  Entire home/apt        False         False           "3.0"   
4  99.828932     Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  \
0               True      1    0                 9.0                99.0   
1               True      1    0                10.0                95.0   
2              False      0    1                 9.0                91.0   
3              False      0    0                 8.0                89.0   
4               True      1    0                10.0                98.0   

   bedrooms      dist  metro_dist  attr_index  attr_index_norm       lng  \
0         1  2.357336    0.630488  110.844980         3.659468  -9.12773   
1         1  2.150651    0.887178  120.517161         3.975050  -9.12500   
2         1  2.125625    0.307137  227.040210        13.550651  19.02769   
3         1  2.408637    0.069400  115.642961         6.896333  19.04842   
4         1  1.604749    0.471949  146.849762         5.536846  23.73082   

        lat      city  weekend  
0  38.73146    Lisbon    False  
1  38.72800    Lisbon     True  
2  47.50505  Budapest     True  
3  47.47725  Budapest    False  
4  37.96157    Athens    False

Adding Data

To add a new column to your dataframe you can simply invent a column name and assign it some values.

# create a new column and fill it with the value 'new'
df['new_column'] = 'new'
df.head()

     realSum        room_type  room_shared  room_private person_capacity  \
0  99.906191     Private room        False          True           "2.0"   
1  99.906191     Private room        False          True           "2.0"   
2  99.873102  Entire home/apt        False         False           "5.0"   
3  99.873102  Entire home/apt        False         False           "3.0"   
4  99.828932     Private room        False          True           "2.0"   

   host_is_superhost  multi  biz  cleanliness_rating  guest_satisfaction  \
0               True      1    0                 9.0                99.0   
1               True      1    0                10.0                95.0   
2              False      0    1                 9.0                91.0   
3              False      0    0                 8.0                89.0   
4               True      1    0                10.0                98.0   

   bedrooms      dist  metro_dist  attr_index  attr_index_norm       lng  \
0         1  2.357336    0.630488  110.844980         3.659468  -9.12773   
1         1  2.150651    0.887178  120.517161         3.975050  -9.12500   
2         1  2.125625    0.307137  227.040210        13.550651  19.02769   
3         1  2.408637    0.069400  115.642961         6.896333  19.04842   
4         1  1.604749    0.471949  146.849762         5.536846  23.73082   

        lat      city  weekend new_column  
0  38.73146    Lisbon    False        new  
1  38.72800    Lisbon     True        new  
2  47.50505  Budapest     True        new  
3  47.47725  Budapest    False        new  
4  37.96157    Athens    False        new

Summary

Access data with [], loc or iloc.
Change data with = new value, apply() or built-in Pandas methods.
Sort and group data with sort_values() and groupby().
Delete data with drop().
Don't forget to save your changes.

References

To be added soon

The author of this entry is Wanja Tolksdorf

Dataframe manipulation with Pandas

Contents