Correlation Plots

From Sustainability Methods

In short: Correlations describe the mutual relationship between two variables. They provide the possibility to measure the relation between any kind of data - continuous and continuous, categorical and categorical, and also continuous and categorical at the same time. In this entry you will learn about how to build correlation plots in R. To learn more about correlations in theoretical terms, please refer to the entry on Correlations.

What are Correlation Plots?

If you want to know more about the relationship of two or more variables, correlation plots are the right tool from your toolbox. And always have in mind, correlations can tell you whether two variables are related, but cannot tell you anything about the causality between the variables!

With a bit experience, you can recognize quite fast, if there is a relationship between the variables. It is also possible to see, if the relationship is weak or strong and if there is a positive, negative or sometimes even no relationship. You can visualize correlation in many different ways, here we will have a look at the following visualizations:

  • Scatter Plot
  • Scatter Plot Matrix
  • Correlogram

A note on calculating the correlation coefficient: Generally, there are three main methods to calculate the correlation coefficient: Pearson's correlation coefficient, Spearman's rank correlation coefficient and Kendall's rank coefficient. Pearson's correlation coefficient is the most popular among them. This measure only allows the input of continuous data and is sensitive to linear relationships. While Pearson's correlation coefficient is a parametric measure, the other two are non-parametric methods based on ranks. Therefore, they are more sensitive to non-linear relationships and measure the monotonic association - either positive or negative. Spearman's rank correlation coefficient calculates the rank order of the variables' values using a monotonic function whereas Kendall's rank correlation coefficient computes the degree of similarity between two sets of ranks introducing concordant and discordant pairs. Since Pearson's correlation coefficient is the most frequently used one among the correlation coefficients, the examples shown later based on this correlation method.

Scatter Plot

Scatter plots are easy to build and the right way to go, if you have two numeric variables. They show every observation as a dot in the graph and the further the dots scatter, the less they explain. The position of the dot on the x- and y-axis represent the values of the two numeric variables.

data("mtcars")
#Plotting the scatter plot
plot(x = mtcars$mpg,
     y = mtcars$hp,
     main = "Correlation between Miles per Gallon and Horsepower",
     xlab = "Miles per Gallon",
     ylab = "Horsepower",
     pch = 16,
     col = "red",
     las = 1,
     xlim = c(min(mtcars$mpg), max(mtcars$mpg)),
     ylim = c(min(mtcars$hp), max(mtcars$hp)),
     abline(lm(mtcars$hp ~ mtcars$mpg), col = "blue"))
MilesHorsepower.png

In this scatter plot you can easily recognize a strong negative relationship between the variables “mpg” and “hp” from the “mtcars” dataset. The Pearson's correlation coefficient is -0.7761684.

#Calculating the coefficient
cor(mtcars$hp,mtcars$mpg)

## Output: [1] -0.7761684

To create such a scatter plot, you need the plot() function and define several graphical parameter arguments. In this example, the following parameters were defined:

  • x: variable, that will be displayed on the x-axis.
  • y: variable, that will be displayed on the y-axis.
  • xlab: title for the x-axis.
  • ylab: title for the y-axis.
  • pch: shape and size of the plotted observations, in this case, filled circles. Here you can find an overview of the different possibilities.
  • col: plotting color. You can either write the name of the color or use the color number.
  • las: style of axis labels. By default it is always parallel to the axis. 1 is always horizontal, 2 is always perpendicular and 3 is always vertical to the axis.
  • xlim: set the limit of the x-axis.
  • ylim: set the limit of the y-axis.
  • abline: this function creates a regression line for the two variables.

Scatter Plot Matrix

The normal scatter plot is only useful if you want to know the relationship between two variables, but often you are interested in more than two variables. A convenient way to visualize multiple variables in a scatter plot matrix is offered by the PerformanceAnalytics package. To access the scatter plot matrix from this package, you have to install the package and import the library. After doing that, you can start to select the variables which will be displayed in the plot.


library(PerformanceAnalytics)

# Now calling the chart.Correlation() function and defining a few parameters.

data <- mtcars[, c(1,3,4,6,7)]
chart.Correlation(data, histogram = TRUE)
Scatterplotmatrix.png

The scatter plot matrix from this package is already very nice by default. It splits the plot into an upper, lower and diagonal part. The upper part consists of the correlation coefficients for the different variables. The red stars show you the results of the implemented correlation test. There is a range from zero to three stars and the higher the number of stars, the higher is the significance of the results for the test. In the diagonal part of the plot are histograms for every variable and show you the distribution of the variable. The bivariate scatter plots can be found on the lower part of the plot and contain a fitted line by default.

Correlogram

The correlogram visualizes the calculated correlation coefficients for more than two variables. You can quickly determine whether there is a relationship between the variables or not. The different colors give you also the strength and the direction of the relationship. To create such a correlogram, you need to install the R package corrplot and import the library. Before we start to create and customize the correlogram, we can calculate the correlation coefficients of the variables and store it in a variable. It is also possible to calculate it when creating the plot, but this makes your code more clear.

library(corrplot)
correlations <- cor(mtcars)

Clear and meaningful coding and plots are important. In order to achieve this, we have to change the names of the variables from the “mtcars” dataset into something meaningful. One way to do this, is to change the names of the columns and rows of the correlation variable.

correlations <- cor(mtcars)[1:11, 1:11]
colnames(correlations) <- c("Miles per Gallon", "Cylinders", 
                            "Displacement", "Horsepower", "Rear Axle Ratio",
                            "Weight", "1/4 Mile Time", "Engine", "Transmission",
                            "Gears", "Carburetors")
rownames(correlations) <- c("Miles per Gallon", "Cylinders", 
                            "Displacement", "Horsepower", "Rear Axle Ratio",
                            "Weight", "1/4 Mile Time", "Engine", "Transmission",
                            "Gears", "Carburetors")

Now, we are ready to customize and plot the correlogram.

corrplot(correlations,
         method = "circle",
         type = "upper",
         order = "hclust",
         tl.col = "black",
         tl.srt = 45,
         tl.cex = 0.6)
Correlogram.png

The parameters are different from the previous scatter plots. Obviously, here you need the corrplot() function and define your parameters, regarding to your preferred taste, in this function. Some of the parameters will be explained briefly.

  • method: which method should be used to visualize your correlation matrix. There are seven different methods (“circle”, “square”, “ellipse”, “number”, “shade”, “color”, “pie”), “circle” is called by default and shows the correlation between the variables in different colors and sizes for the circles.
  • type: how the correlation matrix will be displayed. It can either be “upper”, “lower” or “full”. Full is called by default.
  • order: order method for the correlation coefficients. The “hclust” method orders them in hierarchical order, but it also possible to order them alphabetical (“alphabetical”) or with a principal component analysis (“PCA”).
  • tl.col: color of the labels.
  • tl.srt: rotation of the labels in degrees.
  • tl.cex: size of the labels.

As you can see, there are many different ways to visualize correlations between variables. The right correlation plot depends on your data and on the number of variables you want to analyze. But never forget, correlation plots show you only the relationship between the variables and nothing about the causality.

References

A nice example that shows how easy it is to create a spurious correlation: