Difference between revisions of "Scatterplots in Python"

Latest revision as of 12:24, 3 September 2024

THIS ARTICLE IS STILL IN EDITING MODE

Introduction

Scatterplot is a helpful tool in data visualization for observing the relationship (also called correlation) between two variables. In the following we will go through the steps to create scatterplots in Python with the help of the Matplotlib and Seaborn libraries and how to interpret them.

In a scatterplot, the variable on the horizontal (or x) axis is the independent one, also called the predictor. The one on the vertical (or y) axis is the dependent variable (or response). The scatterplot visually displays how the independent variable influences the dependent one (Jerimi, 2017).

Basic Scatterplot of Two Variables

As an example, we will use the Iris dataset. We can load it using Scikit-learn library.

from sklearn.datasets import load_iris
iris_data = load_iris()
df_iris = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
df_iris.head()

# Create a scatterplot of two variables (sepal length and sepal width)
df_iris.plot.scatter(x='sepal_length', y='sepal_width', color='DarkBlue') 
plt.show() # Figure 1

Figure 1: Scatterplot of sepal`s length and width

Adding the Distribution of the two Variables to the Scatterplot

In order to understand two variables better, it can be helpful to add their respective distributions. With that, the central tendency, shape and width is visible, which is important for perceiving the relationship of two variables in further detail (Hehman & Xie, 2021). For more information on the distribution of data, look into Data distribution wiki entry.

For the visualization we need to import the Seaborn library, which can be done with the following code. For more information on the code you can visit the documentation.

import seaborn as sns 
sns.jointplot(data=df_iris, x='sepal_length', y='sepal_width', kind='scatter')
plt.show() # Figure 2

Figure 2: Jointplot of sepal`s length and width

Displaying a Whole Dataset in One Plot

If you want to have it all in one, there is the Seaborn pairplot, which gives scatterplots and distributions of all numeric variables. There are some specifications to consider, when going for a pairplot:

Color the data in different groups with the hue = 'variable' keyword.
Specify what visual you want to display the distribution of the data with diag_kind = 'kde'.
With the keyword markers change the shape of the groups to better distinguish them.
For the plot_kws a dictionary of keyword arguments is used that is passed to the underlying plotting function (scatterplot). Here, alpha controls the transparency of the markers, s sets the size of the markers, and edgecolor sets the color of the markers` edges.
The keyword height sets the height of each facet in inches and the keyword size the width (Koehrsen, 2018).

Even though this visual is very powerful, be careful to use it on datasets with many numeric variables, as is gets too crowded and eventually hard to interpret. For further details on the code, have a look into the documentation.

sns.pairplot(df_iris, hue='species', diag_kind = 'hist', markers=["o", "s", "D"],
             plot_kws = {'alpha': 0.9, 's': 40, 'edgecolor': 'k'}, height= 3)
plt.suptitle('Pairplot of Iris Dataset grouped by Species', y=1.01) # Sets a Title above the plot
plt.show() # Figure 3

Figure 3: Pairplot of Iris Dataset grouped by Species

Interpreting Scatterplots

When analysing scatterplots, we can take into consideration different aspects:

Direction: A trend that appears to rise suggests a positive relationship, while a falling trend suggests a negative relationship.
Pattern: The shape of the pattern suggests the type of relationship, e.g. a line hints at a linear relationship, a more complex pattern might suggest a non-linear relationship. If there is no pattern visible, it often implies no linear correlation.
Strength: How closely the points fit the shape (for example a line) indicates the strength of the association.
Outliers: Points that fall far from the main cloud of points may indicate anomalies in the data. (Wilke, 2019, p.118) (Jerimi, 2017)

By grouping the data, as done in the pairplot, we can even go one step further and add a layer of understanding the relation. In the Iris dataset, we see, that in some scatterplots the species are easily separable, which we can use e.g. for classification.

Sources (recommended to read for a deep dive)

1. Hehman, E., & Xie, S. Y. (2021). Doing better data visualization. Advances in Methods and Practices in Psychological Science, 4(4), 251524592110453. https://doi.org/10.1177/25152459211045334

2. Jerimi. (2017, June 3). Reading scatterplots - MathBootCamps. Retrieved from https://www.mathbootcamps.com/reading-scatterplots/

3. Koehrsen, W. (2018, July 6). Visualizing Data with Pairs Plots in Python - Towards Data Science. Medium. Retrieved from https://towardsdatascience.com

4. Wilke, C. O. (2019). Fundamentals of data visualization: A primer on making informative and compelling figures. (Can be downloaded here: https://data.vk.edu.ee/powerbi/opikud/Fundamentals_of_Data_Visualization.pdf) (Further reading on data visualization in general and on scatterplots in particuar from page 117 on)

@@ Line 9: / Line 9: @@
 ==Basic Scatterplot of Two Variables==
-As an example, we will use the Iris dataset. We can load it using scikit-learn library.
+As an example, we will use the Iris dataset. We can load it using Scikit-learn library.
 <syntaxhighlight lang="Python" line>
@@ Line 27: / Line 27: @@
 ==Adding the Distribution of the two Variables to the Scatterplot==
-In order to understand the two variables better, it can be helpful to add their respective distributions. With that, the central tendency, shape and width is visible, which is important for perceiving the relationship of two variables in further detail (Hehman & Xie, 2021). For more information on the distribution of data, look into [https://sustainabilitymethods.org/index.php/Data_distribution Data distribution wiki entry].
+In order to understand two variables better, it can be helpful to add their respective distributions. With that, the central tendency, shape and width is visible, which is important for perceiving the relationship of two variables in further detail (Hehman & Xie, 2021). For more information on the distribution of data, look into [https://sustainabilitymethods.org/index.php/Data_distribution Data distribution wiki entry].
-For the visualization we need to import the seaborn library, which can be done with the following code. For more information on the code you can visit the [https://seaborn.pydata.org/generated/seaborn.jointplot.html#seaborn.jointplot documentation].
+For the visualization we need to import the Seaborn library, which can be done with the following code. For more information on the code you can visit the [https://seaborn.pydata.org/generated/seaborn.jointplot.html#seaborn.jointplot documentation].
 <syntaxhighlight lang="Python" line>
@@ Line 42: / Line 42: @@
 ==Displaying a Whole Dataset in One Plot==
-If you want to have it all in one, there is the Seaborn pairplot, which gives you the scatterplots and distributions of all numeric variables. There are some specifications to consider, when going for a pairplot:
+If you want to have it all in one, there is the Seaborn pairplot, which gives scatterplots and distributions of all numeric variables. There are some specifications to consider, when going for a pairplot:
 * Color the data in different groups with the <syntaxhighlight lang="Python" inline>hue = 'variable'</syntaxhighlight> keyword.
 * Specify what visual you want to display the distribution of the data with <syntaxhighlight lang="Python" inline>diag_kind = 'kde'</syntaxhighlight>.
-* With the keyword <syntaxhighlight lang="Python" inline>markes</syntaxhighlight> change the shape of the groups to better distinguish them.
+* With the keyword <syntaxhighlight lang="Python" inline>markers</syntaxhighlight> change the shape of the groups to better distinguish them.
-* For the "plot_kws" a dictionary of keyword arguments is used that is passed to the underlying plotting function (scatterplot). Here, 'alpha' controls the transparency of the markers, 's' sets the size of the markers, and 'edgecolor' sets the color of the edges of the markers.
+* For the <syntaxhighlight lang="Python" inline>plot_kws</syntaxhighlight> a dictionary of keyword arguments is used that is passed to the underlying plotting function (scatterplot). Here, <syntaxhighlight lang="Python" inline>alpha</syntaxhighlight> controls the transparency of the markers, <syntaxhighlight lang="Python" inline>s</syntaxhighlight> sets the size of the markers, and <syntaxhighlight lang="Python" inline>edgecolor</syntaxhighlight> sets the color of the markers` edges.
-* The keyword "height" sets the height of each facet in inches and the keyword "size" the width (Koehrsen, 2018).
+* The keyword <syntaxhighlight lang="Python" inline>height</syntaxhighlight> sets the height of each facet in inches and the keyword <syntaxhighlight lang="Python" inline>size</syntaxhighlight> the width (Koehrsen, 2018).
 Even though this visual is very powerful, be careful to use it on datasets with many numeric variables, as is gets too crowded and eventually hard to interpret.
@@ Line 68: / Line 68: @@
 * '''Direction''': A trend that appears to rise suggests a positive relationship, while a falling trend suggests a negative relationship.
-* '''Pattern''': The shape of the pattern suggest the type of relationship, e.g. a line hints at a linear relationship, a more complex pattern might suggest a non-linear relationship. If there is no pattern visible, it often implies no linear correlation.
+* '''Pattern''': The shape of the pattern suggests the type of relationship, e.g. a line hints at a linear relationship, a more complex pattern might suggest a non-linear relationship. If there is no pattern visible, it often implies no linear correlation.
 * '''Strength''': How closely the points fit the shape (for example a line) indicates the strength of the association.
 * '''Outliers''': Points that fall far from the main cloud of points may indicate anomalies in the data. (Wilke, 2019, p.118) (Jerimi, 2017)
@@ Line 90: / Line 90: @@
 The author of this entry is Hedda Fiedler.
-[[Category:Statistics]]
-[[Category:Python basics]]