Difference between revisions of "Correlations"

From Sustainability Methods
 
(38 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[[File:ConceptCorrelation.png|450px|frameless|left|[[Sustainability Methods:About|Method categorization]] for [[Correlations]]]]
+
[[File:Quan dedu indu indi syst glob past pres futu.png|thumb|right|[[Design Criteria of Methods|Method Categorisation:]]<br>
 +
'''Quantitative''' - Qualitative<br>
 +
'''Deductive''' - '''Inductive'''<br>
 +
'''Individual''' - '''System''' - '''Global'''<br>
 +
'''Past''' - '''Present''' - '''Future''']]
 +
'''In short:''' Correlation analysis examines the statistical relationship between two continuous variables. For R examples on Correlations, please refer to [[Correlation Plots]].
 +
 
 +
== Background ==
 +
[[File:Correlation.png|400px|thumb|right|'''SCOPUS hits per year for Correlations until 2020.''' Search terms: 'Correlation' in Title, Abstract, Keywords. Source: own.]]
 +
 
 +
Karl Pearson is considered to be the founding father of mathematical statistics; hence it is no surprise that one of the central methods in statistics - to test the relationship between two continuous variables - was invented by him at the brink of the 20th century (see Karl Pearson's "Notes on regression and inheritance in the case of two parents" from 1895). His contribution was based on work from Francis Galton and Auguste Bravais. With more data becoming available and the need for an “exact science” as part of the industrialization and the rise of modern science, the Pearson correlation paved the road to modern statistics at the beginning of the 20th century. While other approaches such as the t-test or the Analysis of Variance ([[ANOVA]]) by Pearson's arch-enemy Fisher demanded an experimental approach, the correlation simply required data with a continuous measurement level. Hence it appealed to the demand for an analysis that could be conducted based solely on measurements done in engineering, or on counting as in economics, without being preoccupied too deeply with the reasoning on why variables correlated. '''Pearson recognized the predictive power of his discovery, and the correlation analysis became one of the most abundantly used statistical approaches in diverse disciplines such as economics, ecology, psychology and social sciences.''' Later came the ​regression analysis, which implies a causal link between two continuous variables. This makes it different from a correlation, where two variables are related, but not necessarily causally linked. This article focuses on correlation analysis and only touches upon regressions. For more, please refer to the entry on [[Regression Analysis]].)
 +
 
 +
 
 +
== What the method does ==
 +
Correlation analysis examines the relationship between two [[Data formats|continuous variables]], and test whether the relation is statistically significant. For this, correlation analysis takes the sample size and the strength of the relation between the two variables into account. The so-called ''correlation coefficient'' indicates the strength of the relation, and ranges from -1 to 1. A coefficient close to 0 indicates a weak correlation. A coefficient close to 1 indicates a strong positive correlation, and a coefficient close to -1 indicates a strong negative correlation.
 +
 
 +
Correlations can be applied to all kinds of quantitative continuous data from all spatial and temporal scales, from diverse methodological origins including [[Survey]]s and Census data, ecological measurements, economical measurements, GIS and more. Correlations are also used in both inductive and deductive approaches. This versatility makes correlation analysis one of the most frequently used quantitative methods to date.
 +
 
 +
'''There are different forms of correlation analysis.''' The Pearson correlation is usually applied to normally distributed data, or more precisely, data that shows a [https://365datascience.com/students-t-distribution/ Student's t-distribution]. Alternative correlation measures like [https://www.statisticssolutions.com/kendalls-tau-and-spearmans-rank-correlation-coefficient/ Kendall's tau and Spearman's rho] are usually applied to variables that are not normally distributed. I recommend you just look them up, and keep as a rule of thumb that Spearman's rho is the most robust correlation measure when it comes to non-normally distributed data.
 +
 
 +
==== Calculating Pearson's correlation coefficient r ====
 +
The formula to calculate [https://www.youtube.com/watch?v=2B_UW-RweSE a Pearson correlation coefficient] is fairly simple. You just need to keep in mind that you have two variables or samples, called x and y, and their respective means (m).
 +
[[File:Bildschirmfoto 2020-05-02 um 09.46.54.png|400px|center|thumb|This is the formula for calculating the Pearson correlation coefficient r.]]
 
<br/>
 
<br/>
{|class="wikitable" style="text-align: center; width: 50%"
 
! colspan = 3 | Method categorization
 
|-
 
| '''[[:Category:Quantitative|Quantitative]]''' || colspan="2" | [[:Category:Qualitative|Qualitative]]
 
|-
 
| '''[[:Category:Inductive|Inductive]]''' || colspan="2"| '''[[:Category:Deductive|Deductive]]'''
 
|-
 
| style="width: 33%"| '''[[:Category:Individual|Individual]]''' || style="width: 33%"| '''[[:Category:System|System]]''' || '''[[:Category:Global|Global]]'''
 
|-
 
| style="width: 33%"| '''[[:Category:Past|Past]]''' || style="width: 33%"| '''[[:Category:Present|Present]]''' || '''[[:Category:Future|Future]]'''
 
|}
 
<br/>__NOTOC__
 
  
<br/><br/>
+
=== Conducting and reading correlations ===
'''In short:''' A correlation is the statistical relation between two continuous variables.
+
There are some core questions related to the application and reading of correlations. These can be of interest whenever you have the correlation coefficient at hand - for example, in a statistical software - or when you see a correlation plot.<br/>
 +
 
 +
'''1) Is the relationship between two variables positive or negative?''' If one variable increases, and the other one increases, too, we have  a positive ("+") correlation. This is also true if both variables decrease. For instance, being taller leads to a significant increase in [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3534609/ body weight]. On the other hand, if one variable increases, and the other decreases, the correlation is negative ("-"): for example, the relationship between 'pizza eaten' and 'pizza left' is negative. The more pizza slices are eaten, the fewer slices are still there. This direction of the relationship tells you a lot about how two variables might be logically connected. The normative value of a positive or negative relation typically has strong implications, especially if both directions are theoretically possible. Therefore it is vital to be able to interpret the direction of a correlative relationship.
 +
 
 +
'''2) Is the correlation coefficient small or large?''' It can range from -1 to +1, and is an important measure when we evaluate the strength of a statistical relationship. Data points may scatter widely in a [[Correlation_Plots#Scatter_Plot|scatter plot,]] or there may be a rather linear relationship - and everything in between. An example for a perfect positive correlation (with a correlation coefficient ''r'' of +1) is the relationship between temperature in [[To_Rule_And_To_Measure#Celsius_vs_Fahrenheit_vs_Kelvin|Celsius and Fahrenheit]]. This should not be surprising, since Fahrenheit is defined as 32 + 1.8° C. Therefore, their relationship is perfectly linear, which results in such a strong correlation coefficient. We can thus say that 100% of the variation in temperature in Fahrenheit is explained by the temperature in Celsius.
 +
 
 +
On the other hand, you might encounter data of two variables that is scattered all the way in a scatter plot and you cannot find a significant relationship. The correlation coefficient ''r'' might be around 0.1, or 0.2. Here, you can assume that there is no strong relationship between these two variables, and that one variable does not explain the other one.
 +
 
 +
The stronger the correlation coefficient of a relation is, the more may these relations matter, some may argue. If the points are distributed like stars in the sky, then the relationship is probably not significant and interesting. Of course this is not entirely generalisable, but it is definitely true that a neutral relation only tells you, that the relation does not matter. At the same time, even weaker relations may give important initial insights into the data, and if two variables show any kind of relation, it is good to know the strength. By practising to quickly grasp the strength of a correlation, you become really fast in understanding relationships in data. Having this kind of skill is essential for anyone interested in approximating facts through quantitative data. 
 +
 
 +
'''3) What does the relationship between two variables explain?'''
 +
This is already an advanced skill, and is rather related to regression analysis. So if you have looked at the strength of a correlation, and its direction, you are good to go generally. But sometimes, these measures change in different parts of the data.
 +
 
 +
To illustrate this, let us have a look at the example of the percentage of people working in [https://ourworldindata.org/employment-in-agriculture?source=post_page--------------------------- Agriculture] within individual countries. Across the world, people at a low income (<5000 Dollar/year) have a high variability in terms of agricultural employment:  half of the population of the Chad work in agriculture, while in Zimbabwe it is only 10 %. However, at an income above 15000 Dollar/year, there is hardly any variance in the percentage of people that work in agriculture: it is always very low. If you plotted this, you would see that the data points are rather broadly spread in the lower x-values (with x as the income), but are more linearly spread in the higher income areas (= x values). This has reasons, and there are probably one or several variables that explain this variability. Maybe there are other factors that have a stronger influence on the percentage of farmers in lower income groups than for higher incomes, where the income is a good predictor.  
  
== Background ==
+
A correlation analysis helps us identify such variances in the data relationship, and we should look at correlation coefficients and the direction of the relationship for different parts of the data. We often have a stronger relation across parts of the dataset, and a weaker relation across other parts of the dataset. These differences are important, as they hint at underlying influencing variables or factors that we did not understand yet.
Karl Pearson is considered to be the founding father of mathematical statistics; hence it is no surprise that one of the centrals methods in statistics - to test the relation between two continuous variables - was invented by him at the brink of the 20th Century (see Karl Pearson's "Notes on regression and inheritance in the case of two parents" from 1895). His contribution was based on work from Francis Galton and Auguste Bravais. With more data becoming available and the need of an “exact science” as part of the industrialization and the rise of modern science, the Pearson correlation paved the road to modern statistics at the beginning of the 20th century. While other approaches such as the t-test or the Analysis of Variance ([[ANOVA]]) by Pearsons arch-enemy Fisher demanded an experimental approach, the correlation simply needed continuous data. Hence it appealed to the demand for an analysis which would only need measuring as in engineering or counting as in economics as the basis for simple correlative tests, but one that would not be preoccupied too deeply with the reasoning on why variables correlated. Pearson recognized the predictive power of his discovery, and the correlation remains to be one of the most abundantly used statistical approaches, for instances in economics, ecology, psychology and social sciences.
 
  
 +
[[File:Bildschirmfoto 2019-10-18 um 10.38.48.png|400px|thumb|center|'''This graph shows the positive correlation between global nutrition and the life expectancy.''' Source: Gapminder]]
  
== What the method does ==
+
[[File:Bildschirmfoto 2019-10-18 um 10.51.34.png|400px|thumb|center|'''By comparison, life expectancy and agricultural land have no correlation - which obviously makes sense.''' Source: Gapminder]]
Correlations analyse the relation between two [[Data formats|continuous variables]], and test whether the relation is statistically significant, typically using a statistical software. Correlations take the sample size as well as the strength of the relation between the two variables into account to derive a testing of the statistical relation. The so-called ''correlation coefficient'' indicates the strength of the relation, and ranges from 1 to -1. While values close to 0 indicate a weak correlation, values close to 1 indicate strong positive correlations, and values close to -1 indicates a strong negative correlation.  
+
 +
[[File:Bildschirmfoto 2019-10-18 um 10.30.35.png|400px|thumb|center|'''Income does have an impact on how much CO2 a country emits.''' Source: Gapminder]]
  
Correlations can be applied onto the most diversely originated continuous variables, hence correlations can be applied to data across all spatial and temporal scales. The used data can originate from [[Survey|surveys]], economics, measurements, the industry and many other sources, which is why the method is considered one of the most relevant statistical analysis tools. Since correlations are also used in both inductive and deductive approaches, correlations are among the most abundantly used quantitative method to date.
 
  
'''There are some core questions related to the application of correlations:'''
+
=== A quick introduction to regression lines ===
1) ''Are relations between two variables positive or negative, and how strong is the estimate of the relation?'' Being taller leads to a significant increase in [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3534609/ body weight]. Being smaller leads to an overall lower gross calorie demand. The strength of this relation - what statisticians call the ''estimate'' - is an important measure when evaluating correlations and regressions. (A regression implies a causal link between two continuous variables, which makes it different from a correlation, where two variables are related, but not necessarily causally linked.)
+
As you can see, '''correlation analysis is first and foremost a matter of identifying ''if'' and ''how'' two variables are related.''' We do not necessarily assume that we can predict the value of one variable based on the value of the other variable - we only see how they are related. People often show a correlation in a scatter plot - the x-axis is one variable, the y-axis the other one. You can see this in the example below. Then, they put a line on the data. This line - the "regression line" - represents the correlation coefficient. It is the best approximation for all data points. This means that this line has the minimum distance to all data points. If all data points are exactly on the line, we have a correlation of +1 or -1 (depending on the direction of the line). However, the further the data points are from the line, the closer the correlation coefficient is to 0, and the less meaningful the correlation is.
  
2) ''Does the relation show a significantly [https://www.youtube.com/watch?v=WpZi02ulCvQ strong effect ], or is it rather weak?'' In other words, can the regression explain a lot of variance of your data, or is the results rather weak regarding its explanatory power? The [correlation coefficient https://online.stat.psu.edu/stat501/lesson/1/1.6] explains how strong or weak the correlation is and if it is positive or negative. It can be between -1 and +1. The [https://online.stat.psu.edu/stat501/lesson/1/1.7 relationship of temperature in Celsius and Fahrenheit ] for example is pefectly linear, which should not be surprising as we know that Fahrenheit is defined as 32 + 1.8* Celsius. Furthermore we can say that 100% of the variation in temperatures in Fahrenheit is explained by the temperature in Celsius: the correlation coefficient is 1.
+
[[File:Correlation coefficient examples.png|600px|thumb|center|'''Examples for the correlation coefficient.''' Source: Wikipedia, Kiatdd, CC BY-SA 3.0]]
 +
<br>
 +
'''It is however important to know two things:'''<br>
 +
1) Do not confuse the slope of this line (the 'regression coefficient') - i.e. the number of y-values that the regression line steps per x-value - with the correlation coefficient. They are not the same, and this often leads to confusion. The regression coefficient of the line can easily be 5 or 10, but the correlation coefficient will always be between -1 and +1.
  
3) ''What does the relation between two variables explain?'' Correlation can explain a lot of variance for some data, and less variance for other parts of the data. Take the percentage of people working in [https://ourworldindata.org/employment-in-agriculture?source=post_page--------------------------- Agriculture] within individual countries. At a low income (<5000 Dollar/year) there is a high variance in between countries: half of the population of the Chad work in agriculture, while in Zimbabwe with a even slightly lower income it is only 10 %. At an income above 15000 Dollar/year, however, there is hardly any variance in the people that work in agriculture: the proportion is always very low. This has reasons, there is probably one or several variables that explain at least partly the high variance within different income segments. Finding such variance that explain partly unexplained variance is a key effort in doing correlation analysis.
+
2) Regressions only really make sense if there is some kind of causal explanation for the relationship. We can create a regression line for all correlations of all pairs of two variables, but we might end up suggesting a causal relationship when there really is none. As an example, have a look at the correlation below. There is no regression line here, but the visualisation implies that there is some connection, right? However, it does not really make sense that the divorce rate in Maine and the margarine consumption are related, even though their correlation coefficient is obviously quite high! So you should always question correlations, and ask yourself which kinds of variables are tested for their relationship, and if you can derive meaningful results from doing so.
  
 +
[[File:860-header-explainer-correlationchart.jpg|500px|thumb|center|'''Correlations can be deceitful'''. Source: [http://www.tylervigen.com/spurious-correlations Spurious Correlations]]]
 +
<br>
  
 
== Strengths & Challenges ==
 
== Strengths & Challenges ==
* Correlations test for mere relations, but do not depend on a deductive reasoning. Hence correlations can be powerful both regarding inductive predictions as well as for initial analysis of data without any underlying theoretical foundation. Yet, with the predictive power of correlations comes a great responsibility for the researcher who apply correlations, as it is tempting to infer causality purely from the results of correlations. Economics and other fields have a long history of causal interpretation based on basically inductive correlative results.It can be tempting to assume causality based purely on inductively created correlations, even if there is no logical connection explaining the correlation. For more thoughts on the connection between correlations and causality, have a look at this entry: [[Causality and correlation]].
+
* Correlation analysis can be a powerful tool both for inductive reasoning, without a theoretical foundation; or deductive reasoning, which is based on theory. This makes it versatile and has enabled new discoveries as well as the support of existing theories.
 +
* The versatility of the method expands over all spatial and temporal scales, and basically any discipline that uses continuous data. This makes it clear why correlation analysis has become such a powerhorse for many researchers over time, and is so prevalent also in public debates.
 
* Correlations are rather easy to apply, and most software allows to derive simple scatterplots that can then be analyzed using correlations. However, you need some minimal knowledge about data distribution, since for instance the Pearson correlation is based on data that is normally distributed.
 
* Correlations are rather easy to apply, and most software allows to derive simple scatterplots that can then be analyzed using correlations. However, you need some minimal knowledge about data distribution, since for instance the Pearson correlation is based on data that is normally distributed.
* There is an endless debate which correlation coefficient value is high, and which one is low. In other words: how much does a correlation explain, and what is this worth? While this depends widely on the context, it is still remarkable that people keep discussing this. A high relation can be trivial or wrong, while a low relation can be an important scientific result. Most of all, also a lack of a statistical relation between two variables is already a statistical result.
 
  
  
 
== Normativity ==
 
== Normativity ==
* While it is tempting to find causality in correlations, this is potentially difficult, because correlations indicate statistical relations, but not causal explanations, which is a minute difference. Diverse disciplines - among them economics, psychology and ecology - are widely built on correlative analysis, yet do not always urge caution in the interpretation of correlations.  
+
* With the power of correlations comes a great responsibility for the researcher. It can be tempting to infer causality and a logical relatoinship between two variables purely from the results of correlations. Economics and other fields have a long history of causal interpretation based on observed associations from the results of correlation analyses. However, researchers should always question whether there is a plausible connection between two variables, even if - or especially when - the correlation seems so clear. A regression analysis, that allows for the prediction of data beyond what can be observed, should especially only be done if there is a logical underlying connection. Keep in mind that regression = correlation + causality. For more thoughts on the connection between correlations and causality, have a look at this entry: [[Causality and correlation]].
* Another normative problem of correlations is rooted in so called statistical fishing. With more and more data becoming available, there is an increasing chance that certain correlations are just significant by chance, for which there is a corrective procedure available called Bonferroni correction. However, this is hardly applied, and since p-value.driven statistics are increasingly seen critical, the resulting correlations should be seen as an initial form of a mostly inductive analysis, no more, but also not less. With some practice, p-value-driven statistics can be a robust tool to compare statistical relations in continuous data.
+
* Another normative problem of correlations is rooted in so called statistical fishing. With more and more data becoming available, there is an increasing chance that certain correlations are just significant by chance, for which there is a corrective procedure available called [https://www.youtube.com/watch?v=HLzS5wPqWR0 Bonferroni correction]. However, this is seldom applied. Today, p-value-driven statistics are increasingly seen critical, and the resulting correlations should be seen as no more than an initial form of a mostly inductive analysis. With some practice, p-value-driven statistics can be a robust tool to compare statistical relations in continuous data, but more complex methods may be useful to better understand the relationships in the data.
 +
* There is an endless debate about what constitutes a meaninful, strong correlation. Yet, this depends widely on the context and the field of research - for some disciplines, topics, or research questions, a correlation of +0.4 may be meaningful, while it is mostly irrelevant in others. It is a matter of experience and contextualisation how much meaning we infer on correlation coefficients. Furthermore, finding no correlation between two variables is an important statistical result, too.
  
  
 
== Outlook ==
 
== Outlook ==
Correlations are the among the foundational pillars of frequentist statistics. Nonetheless, with science engaging in more complex designs and analysis, correlations will increasingly become less important. As a robust working horse for initial analysis, however, they will remain a good starting point for many datasets. Time will tell whether other approaches - such as [[Bayesian Inference|Bayesian statistics]] and [[Machine Learning|machine learning]] - will ultimately become more abundant. Correlations may benefit from a clear comparison to results based on Bayesian statistics.
+
Correlations are among the foundational pillars of frequentist statistics. Nonetheless, with science engaging in more complex designs and analysis, correlations will increasingly become less important. As a robust working horse for initial analysis, however, they will remain a good starting point for many datasets. Time will tell whether other approaches - such as [[Bayesian Inference|Bayesian statistics]] and [[Machine Learning|machine learning]] - will ultimately become more abundant. Correlations may benefit from a clear comparison to results based on Bayesian statistics. Until then, we should all be aware of the possibilities and limits of correlations, and what they can - and cannot - tell us about data and its underlying relationships.
  
  
 
== Key Publications ==
 
== Key Publications ==
 
Hazewinkel, Michiel, ed. (2001). ''Correlation (in statistics)''. Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers.
 
Hazewinkel, Michiel, ed. (2001). ''Correlation (in statistics)''. Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers.
 +
 +
Babbie, Earl. 2016. ''The Practice of Social Research.'' 14th ed. Boston: Cengage Learning.
 +
 +
Neuman, W. Lawrence. 2014. ''Social Research Methods: Qualitative and Quantitative Approaches.'' 7th ed. Pearson.
 +
  
  
 
== Further Information ==
 
== Further Information ==
 
 
* If you want to practise recognizing whether a correlation is weak or strong I recommend spending some time on this website. There you can guess the correlation coefficients based on graphs: http://guessthecorrelation.com/
 
* If you want to practise recognizing whether a correlation is weak or strong I recommend spending some time on this website. There you can guess the correlation coefficients based on graphs: http://guessthecorrelation.com/
  
Line 60: Line 93:
  
 
* [https://online.stat.psu.edu/stat501/lesson/1/1.7 The relationship of temperature in Celsius and Fahrenheit]: Several examples of interpreting the correlation coefficient
 
* [https://online.stat.psu.edu/stat501/lesson/1/1.7 The relationship of temperature in Celsius and Fahrenheit]: Several examples of interpreting the correlation coefficient
 +
 +
* [https://www.mathbootcamps.com/reading-scatterplots/ How to read scatter plots]
  
 
* [https://ourworldindata.org/employment-in-agriculture?source=post_page--------------------------- Employment in Agriculture]: A detailed database
 
* [https://ourworldindata.org/employment-in-agriculture?source=post_page--------------------------- Employment in Agriculture]: A detailed database
Line 87: Line 122:
 
[[Category:Statistics]]
 
[[Category:Statistics]]
  
The [[Table of Contributors|author]] of this entry is Henrik von Wehrden.
+
The [[Table of Contributors|authors]] of this entry are Henrik von Wehrden and Christopher Franz.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
OLD ENTRY TEXT - DELETE?
 
 
 
==Correlations==
 
[[File:Bildschirmfoto 2019-10-18 um 10.38.48.png|thumb|right|This graph from gapminder.org shows how positively correlated nutrition and the life expactancy of people worldwide are.]]
 
Propelled through the general development of science during the Enlightenment, numbers started piling up. With more technological possibilities to measure more and more information, and slow to store this information, people started wondering whether these numbers could lead to something. The increasing numbers had diverse sources, some were from science, such as [https://en.wikipedia.org/wiki/Least_squares#The_method Astronomy] or other branches of [https://en.wikipedia.org/wiki/Regression_toward_the_mean#History natural sciences]. Other prominent sources of numbers were from engineering, and even other from economics, such as [https://en.wikipedia.org/wiki/Bookkeeping#History double bookkeeping]. It was thanks to the tandem efforts of [https://www.britannica.com/biography/Adrien-Marie-Legendre Adrien-Marie Legendre] and [https://www.britannica.com/biography/Carl-Friedrich-Gauss Carl Friedrich Gauss] that mathematics offered with the methods of least squares the first approach to relate one line of data with another.
 
"__NOTOC__"
 
How is one continuous variable related to another? Pandora's box was opened, and questions started to emerge. [https://en.wikipedia.org/wiki/Econometrics Economists] were the first who utilised regression analysis at a larger scale, relating all sorts of economical and social indicators with each other, building an ever more complex controlling, management and maybe even understanding of statistical relations. The [https://www.investopedia.com/terms/g/gdp.asp Gross domestic product] -or GDP- became for quite some time kind of the favorite toy for many economists, and especially Growth became a core goal of many analysis to inform policy. What people basically did is ask themselves, how one variable is related to another variable.
 
 
 
[[File:Bildschirmfoto 2019-10-18 um 10.51.34.png|thumb|left|On the other hand this graph from gapminder.org explains that life expactancy and agricultural land have no correlation which obviously makes sense.]]
 
If nutrition of people increases, do they live longer (Yes, see above).
 
 
 
Does a high life expactancy relate to more agricultural land area within a country? (No, like you can see on the left).
 
 
 
Is a higher income related to more Co2 emissions at a country scale (Yes, see right below).
 
[[File:Bildschirmfoto 2019-10-18 um 10.30.35.png|thumb|right|As this graph from gapminder.org shows does income have an impact on how much CO2 a country emits.]]
 
As these relations started coming in the questions of whether two continuous variables are casually related becoming a nagging thought. With more and more data being available, correlation became a staple of modern statistics. There are some core questions related to the application of correlations.
 
 
 
1) Are relations between two variables positive or negative?
 
Relations between two variables can be positive or negative. Being taller leads to a significant increase in [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3534609/ body weight]. Being smaller leads to an overall lower gross calorie demand. The strength of this relation -what statisticians call the estimate- is an important measure when evaluating correlations and regressions. Is a relation positive or negative, and how strong is the estimate of the relation?
 
 
 
2) Does the relation show a significantly [https://www.youtube.com/watch?v=WpZi02ulCvQ strong effect], or is it rather weak? In other words, can the regression explain a lot of variance of your data, or is the results rather weak regarding its explanatory power?
 
The [https://online.stat.psu.edu/stat501/lesson/1/1.6 correlation coefficient] explains how strong or weak the correlation is and if it is positive or negative. It can be between -1 and +1.
 
The [https://online.stat.psu.edu/stat501/lesson/1/1.7 relationship of temperature in Celsius and Fahrenheit] for example is pefectly linear, which should not be surprising as we know that Fahrenheit is defined as 32 + 1.8* Celsius. Furthermore we can say that 100% of the variation in temperatures in Fahrenheit is explained by the temperature in Celsius.
 
 
 
[http://guessthecorrelation.com/ If you want to practise recognizing whether a correlation is weak or strong I recommend spending some time on this website. There you can guess the correlation coefficients based on graphs.]
 
[[File:ModerateNegativeCorrelationSwissFertilityExamination.png|thumb|middle|This scatter plot displays a moderate negative correlation between the fertility in Switzerland in 1888 and percentage of draftees receiving the highest mark on army examination. The correlation coefficient is about -0.6.]]
 
[[File:Bildschirmfoto 2019-10-18 um 11.28.30 Kopie.png|thumb|right|Here you can see a scatter plot which shows a weak positive correlation between the fertility in Switzerland in 1888 and the percentage of males working in agriculture at that time. The correlation coefficient is +0.3.]]
 
[[File:NoCorrelationSwissInfantMortalityCatholic.png|thumb|right|This plot shows no correlation between the Infant Mortality in Switzerland in 1888 and the percentage of Catholics.]]
 
3) What does the relation between two variables explain? Relation can explain a lot of variance for some data, and less variance for other parts of the data. Take the percentage of people working in [https://ourworldindata.org/employment-in-agriculture?source=post_page--------------------------- Agriculture] within individual countries. At a low income (<5000 Dollar/year) there is a high variance. Half of the population of the Chad work in agriculture, while in Zimbabwe with a even slightly lower income its 10 %. At an income above 15000 Dollar/year, there is hardly any variance in the people that work in agriculture within a country. The proportion is very low. This has reasons, there is probably one or several variables that explain at least partly the high variance within different income segments. Finding such variance that explain partly unexplained variance is a key effort in doing correlation analysis.
 
 
 
<syntaxhighlight lang="R" line>
 
 
 
#let's do some correlations with the swiss data set
 
#you find the corresponding plots for the calculations at the right
 
 
 
data(swiss)
 
#now we choose two columns and correlate them
 
cor(swiss$Fertility,swiss$Examination)
 
 
 
# -0.65 - strong negative correlation
 
 
 
plot(swiss$Fertility,swiss$Examination, xlab = "Fertility", ylab = "Examination")
 
#let's try some more
 
cor(swiss$Fertility,swiss$Agriculture)
 
 
 
# +0.35 - weak positive correlation
 
plot(swiss$Fertility, swiss$Agriculture,xlab = "Fertility",ylab = "Agriculture")
 
 
 
cor(swiss$Infant.Mortality, swiss$Catholic)
 
 
 
# +0.17 - very weak positive correlation, not correlated
 
plot(swiss$Infant.Mortality, swiss$Catholic, xlab = "Infant Mortality", ylab = "Catholic")
 
</syntaxhighlight>
 
 
 
====How do we now calculate a correlation?====
 
The most abundantly used method of correlation is a Pearson correlation. For sake of simplicity we will build on this one, as it has the [https://sustainabilitymethods.org/index.php/Data_distribution#The_normal_distribution normal distribution] at it’s heart, or more precisely, [https://365datascience.com/students-t-distribution/ Student's t-distribution]. [https://www.statisticssolutions.com/kendalls-tau-and-spearmans-rank-correlation-coefficient/ Kendall tau and Spearman rho] are other forms of distribution, but I recommend you just look them up, and keep as a rule of thumb that Spearman is more robust when it comes to non-normally distributed data.
 
 
 
[[File:Bildschirmfoto 2020-05-02 um 09.46.54.png|thumb|This is the formula for calculating the Pearson correlation.]]
 
For people with an affinity to math, the formula for [https://www.youtube.com/watch?v=2B_UW-RweSE calculation a Person correlation] is still tangible. You just need to be aware that you have two variables or samples, called x and y, and their respective means (m). The p-values to determine whether there is a significant relation between x and y can be calculated based on the so called [https://www.thoughtco.com/what-is-a-degree-of-freedom-3126416 degrees of freedom]. These are the sample number minus 2, because there are two variables. In the example of the swiss data these are df=47-2, because we have 47 Kantone in the datasets, and any given correlation would build on 2 variables. By calculating the t-value and setting it in relation to the degrees of freedom, we get the significance level from a [https://www.thoughtco.com/student-t-distribution-table-3126265 t-distribution table]. This was an important breakthrough, since we now realise that our sample size, or better the degrees of freedom determine our p-value. A larger sample leads to a smaller p-value, which is no trivial information. In other words, more data leads to a clearer knowledge about whether our hypothesis is confirmed or not. Sample size matters!
 
 
 
In the swiss dataset, we can relate for instance the two variables examination and fertility. Both are fairly normally distributed. If we plot them we see a clear relation. But how is the relation in term of the r-value?
 
 
 
<syntaxhighlight lang="R" line>
 
 
 
cor(swiss$Fertility,swiss$Examination)
 
</syntaxhighlight>
 
 
 
-0.64, so there is certainly some sort of a relation, as we have seen before
 
But is it significant? As a quick check some people tend to favour ggplots.
 
 
 
<syntaxhighlight lang="R" line>
 
library("ggpubr")
 
ggscatter(swiss, x = "Fertility", y = "Examination",
 
          cor.coef = TRUE, cor.method = "pearson")
 
</syntaxhighlight>
 
 
 
We can get a clearer calculation using
 
 
 
<syntaxhighlight lang="R" line>
 
cor.test(swiss$Fertility,swiss$Examination)
 
</syntaxhighlight>
 
 
 
Here, everything is neatly packed together, the t-value, the degrees of freedom, and the p-value. Also, the correlation coefficient is given. We see that there is a significant negative relation between the two variables. Before your brain now starts to interpret this, I urge you to develop a statistical literacy to read such plots. You need to become able to read correlation plots in your sleep, and be very sure about the relation between the variables, how strong it is, whether it is negative or positive, and whether it is significant. This is the bread and butter of any person versatile in statistics.
 
 
 
==Reading correlation plots==
 
[[File:Bildschirmfoto 2019-10-18 um 11.08.57.png|thumb|left|This plot shows a strong positive correlation between the speed of a car and its distance taken to stop. The correlation coefficient is about +0.8.]]
 
 
 
One of the core skills regarding statistics is to quickly make sense out of snapshots of graphics that contain [https://www.youtube.com/watch?v=372iaWfH-Dg formation on correlations.] Seeing a correlation plot and being able to read this plot quickly is the daily job of any data analyst.
 
 
 
There are three questiones that one should ask yourself whenever [https://www.mathbootcamps.com/reading-scatterplots/ looking at a correlation plot]:
 
[[File:ModerateCorrelationStudentsAttitudeCorrectAnswer.png|thumb|right|This scatter plot shows a moderate positive correlation between students attitude toward an exam and their correct answers. The correlation coefficient is about +0.6.]]
 
1) How strong is the relation?
 
 
 
2) Is the relation positive or negative?
 
 
 
3) Does the relation change within parts of the data?
 
 
 
Regarding 1), it is good to practise. Once you get an eye for the strength of a correlation, you become really fast in understanding relations in data. This may be your first step towards a rather intuitive understanding of a method. Having this kind of skill is essential for anyone interested in approximating facts through quantitative data. Obviously, the further the points scatter, the less they explain. If the points are distributed like stars in the sky, then the relation is probably not significant. If they show however any kind of relation, it is good to know the strength.
 
 
 
[[File:StronNegativeCorrelation.png|thumb|left|Here you can see a scatter plot of a strong negative correlation. It shows the relationship gas usage compared to horsepower. The correlation coefficient is about -0.79.]]
 
 
 
Regarding 2), relations can be [https://study.com/academy/lesson/scatter-plot-and-correlation-definition-example-analysis.html positive, or negative (or neutral)]. The stronger the estimate of a relation is, the more may these relations matters, some may argue. Of course this is not entirely generalisable, but it is definitely true that a neutral relation only tells you, that the relation does not matter. While this is trivial in itself, it is good to get an eye for the strength of estimates, and what they mean for the specific data being analysed. Even weaker relation may give important initial insights. In addition is the normative value of a positive or negative relation having typically strong implications, especially if both directions are theoretically possible. Therefore it is vital to be able to interpret the estimate of a correlation.
 
 
 
[[File:WeakNegativeCorrelation.png|thumb|right|This plot presents a weak negative correlation between the number of gears and speed. The correlation coefficient is about -0.21.]]
 
 
 
Regarding 3, the best advise is to look at the initial scatterplot, but also the [https://www.youtube.com/watch?v=-qlb_nZvN_U residuals]. If the scattering of all points is more or less equal across the whole relation, then you may realise that all errors are equally distributed across the relation. In reality, this is often not the case. Instead we often know less about one part of the data, and more about another part of the data. In addition to this we do often have a stronger relation across parts of the dataset, and a weaker understanding across other parts of the dataset. These differences are important, as they hint at underlying influencing variables or factors that we did not understand yet. Becoming versatile in reading scatter plots becomes a key skill here, as it allows you to rack biases and flaws in your dataset and analysis. This is probably the most advanced skill when it comes to reading a correlation plot.
 
 
 
 
 
 
 
'''Key messages'''
 
 
 
• Correlation coefficient ranges from -1 to 1
 
 
 
• Inspect the relation for flaws
 
 
 
• Correlations can be inductive and deductive
 
 
 
====Articles====
 
[https://365datascience.com/students-t-distribution/ Student's T Distribution]: An introduction
 
 
 
[https://www.thoughtco.com/what-is-a-degree-of-freedom-3126416 Degrees of Freedom]: A very explanatory article
 
 
 
[https://www.thoughtco.com/student-t-distribution-table-3126265 Student t Distribution Table]: Important link to save!
 
 
 
[https://www.mathbootcamps.com/reading-scatterplots/ Reading Scatterplots]: Some instructions
 
 
 
[https://en.wikipedia.org/wiki/Inductive_reasoning#History The History of Inductive Reasoning]: A detailed article
 
 
 
[https://en.wikipedia.org/wiki/Least_squares#The_method Astronomy]: Method of least squares
 
 
 
[https://en.wikipedia.org/wiki/Regression_toward_the_mean#History Natural sciences]: Regression toward the mean
 
 
 
[https://en.wikipedia.org/wiki/Bookkeeping#History Double bookkeeping]: An example from economics
 
 
 
[https://www.britannica.com/biography/Adrien-Marie-Legendre Adrien Marie Legendre]: The French rival of Carl Friedrich Gauss
 
 
 
[https://www.britannica.com/biography/Carl-Friedrich-Gauss Carl Friedrich Gauss]: One of the greatest mathematicians of all time
 
 
 
[https://en.wikipedia.org/wiki/Econometrics Regression Analysis]: The origin lies in economics
 
 
 
[https://www.investopedia.com/terms/g/gdp.asp Gross domestic product]: A detailed article
 
 
 
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3534609/ Body weight]: An article on the relationship of body weight and energy intake
 
 
 
----
 
[[Category:Statistics]]
 
 
 
The [[Table of Contributors|author]] of this entry is Henrik von Wehrden.
 

Latest revision as of 12:36, 7 March 2024

Method Categorisation:
Quantitative - Qualitative
Deductive - Inductive
Individual - System - Global
Past - Present - Future

In short: Correlation analysis examines the statistical relationship between two continuous variables. For R examples on Correlations, please refer to Correlation Plots.

Background

SCOPUS hits per year for Correlations until 2020. Search terms: 'Correlation' in Title, Abstract, Keywords. Source: own.

Karl Pearson is considered to be the founding father of mathematical statistics; hence it is no surprise that one of the central methods in statistics - to test the relationship between two continuous variables - was invented by him at the brink of the 20th century (see Karl Pearson's "Notes on regression and inheritance in the case of two parents" from 1895). His contribution was based on work from Francis Galton and Auguste Bravais. With more data becoming available and the need for an “exact science” as part of the industrialization and the rise of modern science, the Pearson correlation paved the road to modern statistics at the beginning of the 20th century. While other approaches such as the t-test or the Analysis of Variance (ANOVA) by Pearson's arch-enemy Fisher demanded an experimental approach, the correlation simply required data with a continuous measurement level. Hence it appealed to the demand for an analysis that could be conducted based solely on measurements done in engineering, or on counting as in economics, without being preoccupied too deeply with the reasoning on why variables correlated. Pearson recognized the predictive power of his discovery, and the correlation analysis became one of the most abundantly used statistical approaches in diverse disciplines such as economics, ecology, psychology and social sciences. Later came the ​regression analysis, which implies a causal link between two continuous variables. This makes it different from a correlation, where two variables are related, but not necessarily causally linked. This article focuses on correlation analysis and only touches upon regressions. For more, please refer to the entry on Regression Analysis.)


What the method does

Correlation analysis examines the relationship between two continuous variables, and test whether the relation is statistically significant. For this, correlation analysis takes the sample size and the strength of the relation between the two variables into account. The so-called correlation coefficient indicates the strength of the relation, and ranges from -1 to 1. A coefficient close to 0 indicates a weak correlation. A coefficient close to 1 indicates a strong positive correlation, and a coefficient close to -1 indicates a strong negative correlation.

Correlations can be applied to all kinds of quantitative continuous data from all spatial and temporal scales, from diverse methodological origins including Surveys and Census data, ecological measurements, economical measurements, GIS and more. Correlations are also used in both inductive and deductive approaches. This versatility makes correlation analysis one of the most frequently used quantitative methods to date.

There are different forms of correlation analysis. The Pearson correlation is usually applied to normally distributed data, or more precisely, data that shows a Student's t-distribution. Alternative correlation measures like Kendall's tau and Spearman's rho are usually applied to variables that are not normally distributed. I recommend you just look them up, and keep as a rule of thumb that Spearman's rho is the most robust correlation measure when it comes to non-normally distributed data.

Calculating Pearson's correlation coefficient r

The formula to calculate a Pearson correlation coefficient is fairly simple. You just need to keep in mind that you have two variables or samples, called x and y, and their respective means (m).

This is the formula for calculating the Pearson correlation coefficient r.


Conducting and reading correlations

There are some core questions related to the application and reading of correlations. These can be of interest whenever you have the correlation coefficient at hand - for example, in a statistical software - or when you see a correlation plot.

1) Is the relationship between two variables positive or negative? If one variable increases, and the other one increases, too, we have a positive ("+") correlation. This is also true if both variables decrease. For instance, being taller leads to a significant increase in body weight. On the other hand, if one variable increases, and the other decreases, the correlation is negative ("-"): for example, the relationship between 'pizza eaten' and 'pizza left' is negative. The more pizza slices are eaten, the fewer slices are still there. This direction of the relationship tells you a lot about how two variables might be logically connected. The normative value of a positive or negative relation typically has strong implications, especially if both directions are theoretically possible. Therefore it is vital to be able to interpret the direction of a correlative relationship.

2) Is the correlation coefficient small or large? It can range from -1 to +1, and is an important measure when we evaluate the strength of a statistical relationship. Data points may scatter widely in a scatter plot, or there may be a rather linear relationship - and everything in between. An example for a perfect positive correlation (with a correlation coefficient r of +1) is the relationship between temperature in Celsius and Fahrenheit. This should not be surprising, since Fahrenheit is defined as 32 + 1.8° C. Therefore, their relationship is perfectly linear, which results in such a strong correlation coefficient. We can thus say that 100% of the variation in temperature in Fahrenheit is explained by the temperature in Celsius.

On the other hand, you might encounter data of two variables that is scattered all the way in a scatter plot and you cannot find a significant relationship. The correlation coefficient r might be around 0.1, or 0.2. Here, you can assume that there is no strong relationship between these two variables, and that one variable does not explain the other one.

The stronger the correlation coefficient of a relation is, the more may these relations matter, some may argue. If the points are distributed like stars in the sky, then the relationship is probably not significant and interesting. Of course this is not entirely generalisable, but it is definitely true that a neutral relation only tells you, that the relation does not matter. At the same time, even weaker relations may give important initial insights into the data, and if two variables show any kind of relation, it is good to know the strength. By practising to quickly grasp the strength of a correlation, you become really fast in understanding relationships in data. Having this kind of skill is essential for anyone interested in approximating facts through quantitative data.

3) What does the relationship between two variables explain? This is already an advanced skill, and is rather related to regression analysis. So if you have looked at the strength of a correlation, and its direction, you are good to go generally. But sometimes, these measures change in different parts of the data.

To illustrate this, let us have a look at the example of the percentage of people working in Agriculture within individual countries. Across the world, people at a low income (<5000 Dollar/year) have a high variability in terms of agricultural employment: half of the population of the Chad work in agriculture, while in Zimbabwe it is only 10 %. However, at an income above 15000 Dollar/year, there is hardly any variance in the percentage of people that work in agriculture: it is always very low. If you plotted this, you would see that the data points are rather broadly spread in the lower x-values (with x as the income), but are more linearly spread in the higher income areas (= x values). This has reasons, and there are probably one or several variables that explain this variability. Maybe there are other factors that have a stronger influence on the percentage of farmers in lower income groups than for higher incomes, where the income is a good predictor.

A correlation analysis helps us identify such variances in the data relationship, and we should look at correlation coefficients and the direction of the relationship for different parts of the data. We often have a stronger relation across parts of the dataset, and a weaker relation across other parts of the dataset. These differences are important, as they hint at underlying influencing variables or factors that we did not understand yet.

This graph shows the positive correlation between global nutrition and the life expectancy. Source: Gapminder
By comparison, life expectancy and agricultural land have no correlation - which obviously makes sense. Source: Gapminder
Income does have an impact on how much CO2 a country emits. Source: Gapminder


A quick introduction to regression lines

As you can see, correlation analysis is first and foremost a matter of identifying if and how two variables are related. We do not necessarily assume that we can predict the value of one variable based on the value of the other variable - we only see how they are related. People often show a correlation in a scatter plot - the x-axis is one variable, the y-axis the other one. You can see this in the example below. Then, they put a line on the data. This line - the "regression line" - represents the correlation coefficient. It is the best approximation for all data points. This means that this line has the minimum distance to all data points. If all data points are exactly on the line, we have a correlation of +1 or -1 (depending on the direction of the line). However, the further the data points are from the line, the closer the correlation coefficient is to 0, and the less meaningful the correlation is.

Examples for the correlation coefficient. Source: Wikipedia, Kiatdd, CC BY-SA 3.0


It is however important to know two things:
1) Do not confuse the slope of this line (the 'regression coefficient') - i.e. the number of y-values that the regression line steps per x-value - with the correlation coefficient. They are not the same, and this often leads to confusion. The regression coefficient of the line can easily be 5 or 10, but the correlation coefficient will always be between -1 and +1.

2) Regressions only really make sense if there is some kind of causal explanation for the relationship. We can create a regression line for all correlations of all pairs of two variables, but we might end up suggesting a causal relationship when there really is none. As an example, have a look at the correlation below. There is no regression line here, but the visualisation implies that there is some connection, right? However, it does not really make sense that the divorce rate in Maine and the margarine consumption are related, even though their correlation coefficient is obviously quite high! So you should always question correlations, and ask yourself which kinds of variables are tested for their relationship, and if you can derive meaningful results from doing so.

Correlations can be deceitful. Source: Spurious Correlations


Strengths & Challenges

  • Correlation analysis can be a powerful tool both for inductive reasoning, without a theoretical foundation; or deductive reasoning, which is based on theory. This makes it versatile and has enabled new discoveries as well as the support of existing theories.
  • The versatility of the method expands over all spatial and temporal scales, and basically any discipline that uses continuous data. This makes it clear why correlation analysis has become such a powerhorse for many researchers over time, and is so prevalent also in public debates.
  • Correlations are rather easy to apply, and most software allows to derive simple scatterplots that can then be analyzed using correlations. However, you need some minimal knowledge about data distribution, since for instance the Pearson correlation is based on data that is normally distributed.


Normativity

  • With the power of correlations comes a great responsibility for the researcher. It can be tempting to infer causality and a logical relatoinship between two variables purely from the results of correlations. Economics and other fields have a long history of causal interpretation based on observed associations from the results of correlation analyses. However, researchers should always question whether there is a plausible connection between two variables, even if - or especially when - the correlation seems so clear. A regression analysis, that allows for the prediction of data beyond what can be observed, should especially only be done if there is a logical underlying connection. Keep in mind that regression = correlation + causality. For more thoughts on the connection between correlations and causality, have a look at this entry: Causality and correlation.
  • Another normative problem of correlations is rooted in so called statistical fishing. With more and more data becoming available, there is an increasing chance that certain correlations are just significant by chance, for which there is a corrective procedure available called Bonferroni correction. However, this is seldom applied. Today, p-value-driven statistics are increasingly seen critical, and the resulting correlations should be seen as no more than an initial form of a mostly inductive analysis. With some practice, p-value-driven statistics can be a robust tool to compare statistical relations in continuous data, but more complex methods may be useful to better understand the relationships in the data.
  • There is an endless debate about what constitutes a meaninful, strong correlation. Yet, this depends widely on the context and the field of research - for some disciplines, topics, or research questions, a correlation of +0.4 may be meaningful, while it is mostly irrelevant in others. It is a matter of experience and contextualisation how much meaning we infer on correlation coefficients. Furthermore, finding no correlation between two variables is an important statistical result, too.


Outlook

Correlations are among the foundational pillars of frequentist statistics. Nonetheless, with science engaging in more complex designs and analysis, correlations will increasingly become less important. As a robust working horse for initial analysis, however, they will remain a good starting point for many datasets. Time will tell whether other approaches - such as Bayesian statistics and machine learning - will ultimately become more abundant. Correlations may benefit from a clear comparison to results based on Bayesian statistics. Until then, we should all be aware of the possibilities and limits of correlations, and what they can - and cannot - tell us about data and its underlying relationships.


Key Publications

Hazewinkel, Michiel, ed. (2001). Correlation (in statistics). Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers.

Babbie, Earl. 2016. The Practice of Social Research. 14th ed. Boston: Cengage Learning.

Neuman, W. Lawrence. 2014. Social Research Methods: Qualitative and Quantitative Approaches. 7th ed. Pearson.


Further Information

  • If you want to practise recognizing whether a correlation is weak or strong I recommend spending some time on this website. There you can guess the correlation coefficients based on graphs: http://guessthecorrelation.com/

The authors of this entry are Henrik von Wehrden and Christopher Franz.