Correlations

Method categorization
Quantitative	Qualitative
Inductive	Deductive
Individual	System	Global
Past	Present	Future

In short: A correlation analyses the statistical relation between two continuous variables.

Background

Karl Pearson is considered to be the founding father of mathematical statistics; hence it is no surprise that one of the centrals methods in statistics - to test the relation between two continuous variables - was invented by him at the brink of the 20th Century (see Karl Pearson's "Notes on regression and inheritance in the case of two parents" from 1895). His contribution was based on work from Francis Galton and Auguste Bravais. With more data becoming available and the need of an “exact science” as part of the industrialization and the rise of modern science, the Pearson correlation paved the road to modern statistics at the beginning of the 20th century. While other approaches such as the t-test or the Analysis of Variance (ANOVA) by Pearsons arch-enemy Fisher demanded an experimental approach, the correlation simply needed continuous data. Hence it appealed to the demand for an analysis which would only need measuring as in engineering or counting as in economics as the basis for simple correlative tests, but one that would not be preoccupied too deeply with the reasoning on why variables correlated. Pearson recognized the predictive power of his discovery, and the correlation remains to be one of the most abundantly used statistical approaches, for instances in economics, ecology, psychology and social sciences.

What the method does

This graph shows the positive correlation between global nutrition and the life expectancy. Source: Gapminder

By comparison, life expectancy and agricultural land have no correlation - which obviously makes sense. Source: Gapminder

Income does have an impact on how much CO2 a country emits. Source: Gapminder

Correlations analyse the relation between two continuous variables, and test whether the relation is statistically significant, typically using a statistical software. Correlations take the sample size as well as the strength of the relation between the two variables into account to derive a testing of the statistical relation. The so-called correlation coefficient indicates the strength of the relation, and ranges from 1 to -1. While values close to 0 indicate a weak correlation, values close to 1 indicate strong positive correlations, and values close to -1 indicates a strong negative correlation.

Correlations can be applied onto the most diversely originated continuous variables, hence correlations can be applied to data across all spatial and temporal scales. The used data can originate from surveys, economics, measurements, the industry and many other sources, which is why the method is considered one of the most relevant statistical analysis tools. Since correlations are also used in both inductive and deductive approaches, correlations are among the most abundantly used quantitative method to date.

The Pearson correlation is the most abundantly used method of correlation. It focuses on normally distributed data, or more precisely, data that shows a Student's t-distribution. Kendall tau and Spearman rho are other forms of distribution, but I recommend you just look them up, and keep as a rule of thumb that Spearman is more robust when it comes to non-normally distributed data.

Calculating Pearson's correlation coefficient r

For people with an affinity to math, the formula for calculating a Person correlation is still tangible. You just need to be aware that you have two variables or samples, called x and y, and their respective means (m).

This is the formula for calculating the Pearson correlation coefficient r.

There are some core questions related to the application of correlations:
1) Are relations between two variables positive or negative, and how strong is the estimate of the relation? Being taller leads to a significant increase in body weight. Being smaller leads to an overall lower gross calorie demand. The strength of this relation - what statisticians call the estimate - is an important measure when evaluating correlations and regressions. (A regression implies a causal link between two continuous variables, which makes it different from a correlation, where two variables are related, but not necessarily causally linked. For more on regressions, please refer to the entry on Regression Analysis.)

2) Does the relation show a significantly strong effect , or is it rather weak? In other words, can the regression explain a lot of variance of your data, or is the results rather weak regarding its explanatory power? The correlation coefficient explains how strong or weak the correlation is and if it is positive or negative. It can be between -1 and +1. The relationship of temperature in Celsius and Fahrenheit for example is pefectly linear, which should not be surprising as we know that Fahrenheit is defined as 32 + 1.8* Celsius. Furthermore we can say that 100% of the variation in temperatures in Fahrenheit is explained by the temperature in Celsius: the correlation coefficient is 1.

3) What does the relation between two variables explain? Correlation can explain a lot of variance for some data, and less variance for other parts of the data. Take the percentage of people working in Agriculture within individual countries. At a low income (<5000 Dollar/year) there is a high variance in between countries: half of the population of the Chad work in agriculture, while in Zimbabwe with a even slightly lower income it is only 10 %. At an income above 15000 Dollar/year, however, there is hardly any variance in the people that work in agriculture: the proportion is always very low. This has reasons, there is probably one or several variables that explain at least partly the high variance within different income segments. Finding such variance that explain partly unexplained variance is a key effort in doing correlation analysis.

Examples for the correlation coefficient. Source: Wikipedia, Kiatdd, CC BY-SA 3.0

Reading correlation plots

Seeing a correlation plot and being able to read this plot quickly is the daily job of any data analyst. There are three questiones that one should ask yourself whenever looking at a correlation plot:

1) How strong is the relation? Regarding this, it is good to practise. Once you get an eye for the strength of a correlation, you become really fast in understanding relations in data. This may be your first step towards a rather intuitive understanding of a method. Having this kind of skill is essential for anyone interested in approximating facts through quantitative data. Obviously, the further the points scatter, the less they explain. If the points are distributed like stars in the sky, then the relation is probably not significant. If they show however any kind of relation, it is good to know the strength.

2) Is the relation positive or negative? Regarding this, relations can be positive or negative (or neutral). The stronger the estimate of a relation is, the more may these relations matter, some may argue. Of course this is not entirely generalisable, but it is definitely true that a neutral relation only tells you, that the relation does not matter. While this is trivial in itself, it is good to get an eye for the strength of estimates, and what they mean for the specific data being analysed. Even weaker relation may give important initial insights. In addition, the normative value of a positive or negative relation typically has strong implications, especially if both directions are theoretically possible. Therefore it is vital to be able to interpret the estimate of a correlation.

3) Does the relation change within parts of the data? Regarding this, the best advice is to look at the initial scatterplot, but also the residuals. If the scattering of all points is more or less equal across the whole relation, then you may realise that all errors are equally distributed across the relation. In reality, this is often not the case. Instead we often know less about one part of the data, and more about another part of the data. In addition to this, we often have a stronger relation across parts of the dataset, and a weaker relation across other parts of the dataset. These differences are important, as they hint at underlying influencing variables or factors that we did not understand yet. Becoming versatile in reading scatter plots becomes a key skill here, as it allows you to rack biases and flaws in your dataset and analysis. This is probably the most advanced skill when it comes to reading a correlation plot.

Strengths & Challenges

Correlations test for mere relations, but do not depend on a deductive reasoning. Hence correlations can be powerful both regarding inductive predictions as well as for initial analysis of data without any underlying theoretical foundation. Yet, with the predictive power of correlations comes a great responsibility for the researcher who apply correlations, as it is tempting to infer causality purely from the results of correlations. Economics and other fields have a long history of causal interpretation based on basically inductive correlative results.It can be tempting to assume causality based purely on inductively created correlations, even if there is no logical connection explaining the correlation. For more thoughts on the connection between correlations and causality, have a look at this entry: Causality and correlation.
Correlations are rather easy to apply, and most software allows to derive simple scatterplots that can then be analyzed using correlations. However, you need some minimal knowledge about data distribution, since for instance the Pearson correlation is based on data that is normally distributed.
There is an endless debate which correlation coefficient value is high, and which one is low. In other words: how much does a correlation explain, and what is this worth? While this depends widely on the context, it is still remarkable that people keep discussing this. A high relation can be trivial or wrong, while a low relation can be an important scientific result. Most of all, also a lack of a statistical relation between two variables is already a statistical result.

Normativity

While it is tempting to find causality in correlations, this is potentially difficult, because correlations indicate statistical relations, but not causal explanations, which is a minute difference. Diverse disciplines - among them economics, psychology and ecology - are widely built on correlative analysis, yet do not always urge caution in the interpretation of correlations.
Another normative problem of correlations is rooted in so called statistical fishing. With more and more data becoming available, there is an increasing chance that certain correlations are just significant by chance, for which there is a corrective procedure available called Bonferroni correction. However, this is hardly applied, and since p-value.driven statistics are increasingly seen critical, the resulting correlations should be seen as an initial form of a mostly inductive analysis, no more, but also not less. With some practice, p-value-driven statistics can be a robust tool to compare statistical relations in continuous data.

Outlook

Correlations are the among the foundational pillars of frequentist statistics. Nonetheless, with science engaging in more complex designs and analysis, correlations will increasingly become less important. As a robust working horse for initial analysis, however, they will remain a good starting point for many datasets. Time will tell whether other approaches - such as Bayesian statistics and machine learning - will ultimately become more abundant. Correlations may benefit from a clear comparison to results based on Bayesian statistics.

Key Publications

Hazewinkel, Michiel, ed. (2001). Correlation (in statistics). Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers.

Further Information

If you want to practise recognizing whether a correlation is weak or strong I recommend spending some time on this website. There you can guess the correlation coefficients based on graphs: http://guessthecorrelation.com/

The correlation coefficient: A very detailed and vivid article

The relationship of temperature in Celsius and Fahrenheit: Several examples of interpreting the correlation coefficient

How to read scatter plots

Employment in Agriculture: A detailed database

Kendall's Tau & Spearman's Rank: Two examples for other forms of correlation

Strength of Correlation Plots: Some examples

History of antibiotics: An example for findings when using the inductive approach

Pearson's correlation coefficient: Many examples

Pearson correlation: A quick explanation

Pearson's r Correlation: An example calculation

The author of this entry is Henrik von Wehrden.