Difference between revisions of "Data distribution"

From Sustainability Methods
 
(58 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
== Data distribution ==
 
== Data distribution ==
 
 
[https://www.youtube.com/watch?v=bPFNxD3Yg6U Data distribution] is the most basic and also a fundamental step of analysis for any given data set. On the other hand, data distribution encompasses the most complex concepts in statistics, thereby including also a diversity of concepts that translates further into many different steps of analysis. Consequently, without [https://www.analyticsvidhya.com/blog/2017/09/6-probability-distributions-data-science/ understanding the basics of data distribution], it is next to impossible to understand any statistics down the road. Data distribution can be seen as the [https://www.statisticshowto.datasciencecentral.com/probability-distribution/ fundamentals], and we shall often return to these when building statistics further.
 
[https://www.youtube.com/watch?v=bPFNxD3Yg6U Data distribution] is the most basic and also a fundamental step of analysis for any given data set. On the other hand, data distribution encompasses the most complex concepts in statistics, thereby including also a diversity of concepts that translates further into many different steps of analysis. Consequently, without [https://www.analyticsvidhya.com/blog/2017/09/6-probability-distributions-data-science/ understanding the basics of data distribution], it is next to impossible to understand any statistics down the road. Data distribution can be seen as the [https://www.statisticshowto.datasciencecentral.com/probability-distribution/ fundamentals], and we shall often return to these when building statistics further.
  
====The normal distribution====
+
===The normal distribution===
 
+
[[File:Bell curve deviation.jpg|thumb|500px|left|'''This is an ideal bell curve with the typical deviation in per cent.''' The σ sign (sigma) stands for standard deviation: within the range of -1 to +1 σ you have about 68,2% of your [[Glossary|data]]. Within -2 to +2 σ you have 95,4% of the data and so on.]]  
[[File:Gauss Normal Distribution.png|thumb|right|Discovered by Gauss, it is only consecutive that you can find it even at a 10DM banknote.]]
 
 
How wonderful, it is truly a miracle how almost everything that can be measured seems to be following the normal distribution. Overall, the normal distribution is not only the most abundantly occurring, but also the [https://www.maa.org/sites/default/files/pdf/upload_library/22/Allendoerfer/stahl96.pdf earliest distribution] that was known. It follows the premise that most data in any given dataset has its majority around a mean value, and only small amounts of the data are found at the extremes.  
 
How wonderful, it is truly a miracle how almost everything that can be measured seems to be following the normal distribution. Overall, the normal distribution is not only the most abundantly occurring, but also the [https://www.maa.org/sites/default/files/pdf/upload_library/22/Allendoerfer/stahl96.pdf earliest distribution] that was known. It follows the premise that most data in any given dataset has its majority around a mean value, and only small amounts of the data are found at the extremes.  
  
Most phenomena we can observe follow a normal distribution. The fact that many do not want this to be true is I think associated to the fact that it makes us assume that the world is not complex, which is counterintuitive to many. While I believe that the world can be complex, there are many natural laws that explain many phenomena we investigate. The Gaussian [https://www.youtube.com/watch?v=mtbJbDwqWLE normal distribution] is such an example. [https://studiousguy.com/real-life-examples-normal-distribution/ Most things] that can be measured in any sense (length, weight etc.) are normally distributed, meaning that if you measure many different items of the same thing, the data follows a normal distribution.  
+
'''Most phenomena we can observe follow a normal distribution.''' The fact that many do not want this to be true is I think associated to the fact that it makes us assume that the world is not complex, which is counterintuitive to many. While I believe that the world can be complex, there are many natural laws that explain many phenomena we investigate. The Gaussian [https://www.youtube.com/watch?v=mtbJbDwqWLE normal distribution] is such an example. [https://studiousguy.com/real-life-examples-normal-distribution/ Most things] that can be measured in any sense (length, weight etc.) are normally distributed, meaning that if you measure many different items of the same thing, the data follows a normal distribution.  
  
[[File:Bell curve deviation.jpg|thumb|left|This is an ideal bell curve with the typical deviation in per cent. The σ sign (sigma) stands for standard deviation: within the range of -1 to +1 σ you have about 68,2% of your data. Within -2 to +2 σ you have 95,4% of the data and so on.]]  
+
The easiest example is [https://statisticsbyjim.com/basics/normal-distribution/ tallness of people]. While there is a gender difference in terms of height, all people that would identify as e.g. females have a certain height. Most have a different height from each other, yet there are almost always many of a mean height, and few very small and few very tall females within a given population. There are of course exceptions, for instance due to selection biases. The members of a professional basketball team would for instance follow a selection [[Bias in statistics|bias]], as these would need to be ideally tall. Within the normal population, people’s height follow the normal distribution. The same holds true for weight, and many other things that can be measured.
 +
<br/>
 +
[[File:Gauss Normal Distribution.png|thumb|400px|center|'''Discovered by Gauss, it is only consecutive that you can find the normal distribution even at a 10DM bill.''']]
  
The easiest example is [https://statisticsbyjim.com/basics/normal-distribution/ tallness of people]. While there is a gender difference in terms of height, all people that would identify as e.g. females have a certain height. Most have a different height from each other, yet there are almost always many of a mean height, and few very small and few very tall females within a given population. There are of course exceptions, for instance due to selection biases. The members of a professional basketball team would for instance follow a selection [[Bias in statistics|bias]], as these would need to be ideally tall. Within the normal population, people’s height follow the normal distribution. The same holds true for weight, and many other things that can be measured.
 
  
[[File:NormalDistributionSampleSize.png|thumb|right|These five plots can easily show that is does matter how big your sample size is and that the bigger it gets the more normal distributed it will be.]]
+
==== Sample size matters ====
 +
[[File:NormalDistributionSampleSize.png|thumb|500px|right|'''Sample size matters.''' As these five plots show, bigger samples will more likely show a normal distribution.]]
  
 
Most things in their natural state follow a normal distribution. If somebody tells you that something is not normally  
 
Most things in their natural state follow a normal distribution. If somebody tells you that something is not normally  
distributed, this person is either very clever or not very clever. A [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3915399/ small sample] can hamper you from finding a normal distribution. If you weigh five people you will hardly find a normal distribution, as the sample is obviously too small. While it may seem like a magic trick, it is actually true that many phenomena that can be measured will follow the normal distribution, at least when your sample is large enough. Consequently, much of the probabilistic statistics is built on the normal distribution.
+
distributed, this person is either very clever or not very clever. A [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3915399/ small sample] can hamper you from finding a normal distribution. '''If you weigh five people you will hardly find a normal distribution, as the sample is obviously too small.''' While it may seem like a magic trick, it is actually true that many phenomena that can be measured will follow the normal distribution, at least when your sample is large enough. Consequently, much of the probabilistic statistics is built on the normal distribution.
 
 
[[File:GapminderWeightRanked.png|thumb| This graphic from gapminder.org shows you how male body mass index is ranked within all countries of the world. Germany by the way is number 32.]]
 
  
The most abundant reason for a deviance from the normal distribution is us. We changed the planet and ourselves, creating effects that may change everything, up to the normal distribution. Take [https://link.springer.com/content/pdf/10.1186/1471-2458-12-439.pdf weight]. Today the human population shows a very complex pattern in terms of weight distribution across the globe, and there are many reasons why the weight distribution does not follow a normal distribution. There is no such thing as a normal weight, but studies from indigenous communities show a normal distribution in the weight found in their populations. Within our wider world, this is clearly different. Yet before we bash the western diet, please remember that never before in the history of humans did we have a more steady stream of calories, which is not all bad.
 
  
=====Skew of distributions=====
+
==== Why some distributions are skewed ====
 +
[[File:SkewedDistribution.png|thumb|500px|right|'''Data can be skewed.''' These graphs show you how distributions can differ according to mode, median and mean of the displayed data.]]
  
Apart from that distributions can have different [https://www.youtube.com/watch?v=XSSRrVMOqlQ skews]. There is the symmetrical skew which is basically a normal distributions or bell curve that you can see on the picture. But normal distributions can also be skewed to the left or to the right depending on how mode, median and mean differ. For the symmetrical normal distribution they are of course all the same but for the right skewed distribution (mode < median < mean) it's different.  
+
The most abundant reason for a deviance from the normal distribution is us. We changed the planet and ourselves, creating effects that may change everything, up to the normal distribution. Take [https://link.springer.com/content/pdf/10.1186/1471-2458-12-439.pdf weight]. Today the human population shows a very complex pattern in terms of weight distribution across the globe, and there are many reasons why the weight distribution does not follow a normal distribution. There is no such thing as a normal weight, but studies from indigenous communities show a normal distribution in the weight found in their populations. Within our wider world, this is clearly different. Yet before we bash the Western diet, please remember that never before in the history of humans did we have a more steady stream of calories, which is not all bad.
  
[[File:SkewedDistribution.png|thumb|right|These graphs show you how distributions can differ according to mode, median and mean of the displayed data.]]
+
'''Distributions can have different [https://www.youtube.com/watch?v=XSSRrVMOqlQ skews].''' There is the symmetrical skew which is basically a normal distributions or bell curve that you can see on the picture. But normal distributions can also be skewed to the left or to the right depending on how mode, median and mean differ. For the symmetrical normal distribution they are of course all the same but for the right skewed distribution (mode < median < mean) it's different.
  
See [https://sustainabilitymethods.org/index.php/Simple_Tests#Tests_for_normal_distribution Tests for normal distribution] to learn how to check if the data is normally distributed.
 
  
=====Detecting the normal distribution=====
+
==== Detecting the normal distribution ====
[[File:Car Accidents Barplot 2.jpg|thumb|left|Barplot]]
+
[[File:Car Accidents Barplot 2.jpg|thumb|400px|left|'''This is a time series visualized through barplots.''']]
[[File:Car Accidents Histogram 2.jpg|thumb|left|Histogram]]
+
[[File:Car Accidents Histogram 2.jpg|thumb|400px|left|'''This is the same data as a histogram.''']]
[[File:Car Accidents Boxplot 2.jpg|thumb|left|And this is how a boxplot of a normally distributed series of numbers looks like.  
+
[[File:Car Accidents Boxplot 2.jpg|thumb|400px|left|'''And this the data as a boxplot.''' You can see that the data is normally distributed because the whiskers and the quarters have nearly the same length.]]
 +
'''But when is data normally distributed?''' And how can you recognize it when you have a [[Barplots, Histograms and Boxplots|boxplot]] in front of you? Or a histogram? The best way to learn it, is to look at it. Always remember the ideal picture of the bell curve (you can see it above), especially if you look at histograms. If the histogram of your data show a long tail to either side, or has multiple peaks, your data is not normally distributed. The same is the case if your boxplot's whiskers are largely uneven.
  
Consequently, the whiskers have nearly the same length, and so have the quarters a similar range.]]
+
You can also use the Shapiro-Wilk test to check for normal distribution. If the test returns insignificant results (p-value > 0.05), we can assume normal distribution.
But when is data normally distributed and how to recognize it when you have a boxplot in front of you? Or a histogram? The best way to learn it, is to look at it. Always remember the ideal picture of the bell curve (you can see it above), especially if you look at histograms.
 
  
 
This barplot (at the left) represents the number of front-seat passengers that were killed or seriously injured annually from 1969 to 1985 in the UK. And here comes the magic trick: If you sort the annually number of people from the lowest to the highest (and slightly lower the resolution), a normal distribution evolves (histogram at the left).
 
This barplot (at the left) represents the number of front-seat passengers that were killed or seriously injured annually from 1969 to 1985 in the UK. And here comes the magic trick: If you sort the annually number of people from the lowest to the highest (and slightly lower the resolution), a normal distribution evolves (histogram at the left).
  
If you would like to know, how one can create the diagrams, which you see here, in R, we uploaded the code right below:
+
'''If you would like to know how one can create the diagrams which you see here, this is the R code:'''
  
 
<syntaxhighlight lang="R" line>
 
<syntaxhighlight lang="R" line>
  
 
# If you want some general information about the "Seatbelt" dataset, at which we will have look, you can use the ?-function.
 
# If you want some general information about the "Seatbelt" dataset, at which we will have look, you can use the ?-function.
# As "Seatbelts" is a dataset in R, you can receive a lot of information here.
+
# As "Seatbelts" is a dataset in R, you can receive a lot of information here. You can see all datasets available in R by typing data().
  
 
?Seatbelts
 
?Seatbelts
 
   
 
## hint: If you want to see all the datasets, that are available in R, just type:
 
   
 
data()
 
   
 
 
      
 
      
 
# to have a look a the dataset "Seatbelts" you can use several commands
 
# to have a look a the dataset "Seatbelts" you can use several commands
Line 60: Line 51:
 
## str() to know what data type "Seatbelts" is (e.g. a Time-Series, a matrix, a dataframe...)
 
## str() to know what data type "Seatbelts" is (e.g. a Time-Series, a matrix, a dataframe...)
 
str(Seatbelts)
 
str(Seatbelts)
   
+
       
   
 
 
## use show() or just type the name of the dataset ("Seatbelts") to see the table and all data it's containing
 
## use show() or just type the name of the dataset ("Seatbelts") to see the table and all data it's containing
 
show(Seatbelts)
 
show(Seatbelts)
 
# or
 
# or
 
Seatbelts
 
Seatbelts
   
+
     
   
 
 
## summary() to have the most crucial information for each variable: minimum/maximum value, median, mean...
 
## summary() to have the most crucial information for each variable: minimum/maximum value, median, mean...
 
summary(Seatbelts)
 
summary(Seatbelts)
  
 
      
 
      
# As you saw when you used the str() function, "Seatbelts" is a Time-Series, which is not entirely bad per se,
+
# As you saw when you used the str() function, "Seatbelts" is a Time-Series, which makes it hard to work with it. We should change it into a dataframe (as.data.frame()). We will also name the new dataframe "seat", which is more handy to work with.
# but makes it hard to work with it. Like that it is useful to change it into a dataframe (as.data.frame()).
 
# And simultaneously, we should assign the new dataframe "Seatbelts" to a variable, that we don't lose it and
 
# can work further with Seatbelts as a dataframe.
 
 
    
 
    
 
seat<-as.data.frame(Seatbelts)
 
seat<-as.data.frame(Seatbelts)
   
 
 
      
 
      
 
# To choose a single variable of the dataset, we use the '$' operator. If we want a barplot with all front drivers,
 
# To choose a single variable of the dataset, we use the '$' operator. If we want a barplot with all front drivers,
Line 84: Line 69:
 
      
 
      
 
barplot(seat$front)
 
barplot(seat$front)
   
 
 
      
 
      
 
# For a histogram:
 
# For a histogram:
Line 102: Line 86:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
====Non-normal distributions====
+
==== The QQ-Plot ====
[[File:Different distributions.png|thumb|right| We found this great overview by [http://people.stern.nyu.edu/adamodar/pdfiles/papers/probabilistic.pdf Aswath Damodaran] ]]
+
[[File:Data caterpillar.png|thumb|right|1. Growth of caterpillars in relation to tannin content in food]]
Sometimes the world is [https://www.statisticshowto.com/probability-and-statistics/non-normal-distributions/ not normally distributed]. At a closer examination, this makes perfect sense under the specific circumstances. It is therefore necessary to understand which [https://www.isixsigma.com/tools-templates/normality/dealing-non-normal-data-strategies-and-tools/ reasons] exists why data is not normally distributed.  
+
The command <syntaxhighlight land = "R" inline>qqplot</syntaxhighlight> will return a Quantile-Quantile plot. This plot allows for a visual inspection on how your model residuals behave in relation to a normal distribution. On the y-axis there are the standardised residuals and on the x-axis the theoretical quantiles. The simple answer is, if your data points are on this line you are fine, you have normal errors, and you can stop reading here. If you want to know more about the theory behind this please continue.
 +
Residuals is the difference of your response variable and the fitted values.
 +
<br>
 +
<br>
 +
'''Residuals = response variable - fitted values'''
 +
<br>
 +
<br>
 +
For a regression analysis this would be the difference of your data points to the regression line.
 +
The standardised residuals depend on the model function you are applying.
 +
 
 +
In the following example, the standardised residuals are the residuals divided by the standard deviation. Let's take the caterpillar data set as an example. On the right you can see the table with the data: growth of caterpillars in relation to tannin content of their diet. Below, we will discuss some correlation plots between these two factors.
 +
 
 +
[[File:Plot caterpillar.png|thumb|left|2. Plotting the data in an x-y plot already gives you an idea that growth probably depends on the tannin content.]]
 +
[[File:Qqplot2.png|thumb|right|4. The qqplot for this model looks good. Here the points are mostly on the line with point 4 and point 7 being slightly above and below the line. Still you would consider the residuals in this case to behave normally.]]
 +
[[File:Plot regression.png|thumb|center|3. Plotted regression line of the regression model
 +
<syntaxhighlight land = "R" inline>lm(growth~tannin)</syntaxhighlight> for testing the relation between two factors]]
 +
 
 +
[[File:Qqplot notnomral.jpg|thumb|left|5. A gamma distribution, where the variances increases with the square of the mean.]]
 +
[[File:Qqplot negbinom.jpg|thumb|center|6. A negative binomial distribution that is clearly not following a normal distribution. In other words here the points are not on the line, the visual inspection of this qqplot concludes that your residuals are not normally distributed.]]
 +
 
 +
===Non-normal distributions===
 +
'''Sometimes the world is [https://www.statisticshowto.com/probability-and-statistics/non-normal-distributions/ not normally distributed].''' At a closer examination, this makes perfect sense under the specific circumstances. It is therefore necessary to understand which [https://www.isixsigma.com/tools-templates/normality/dealing-non-normal-data-strategies-and-tools/ reasons] exists why data is not normally distributed.  
  
=====The Poisson distribution=====
+
==== The Poisson distribution ====
 +
[[File:Bildschirmfoto 2020-04-08 um 12.05.28.png|thumb|500px|'''This picture shows you several possible poisson distributions.''' They differ according to the lambda, the rate parameter.]]
  
[[File:Bildschirmfoto 2020-04-08 um 12.05.28.png|thumb|This picture shows you several possible poisson distributions. They differ according to the lambda, the rate parameter. ]]
+
[https://www.youtube.com/watch?v=BbLfV0wOeyc Things that can be counted] are often [https://www.britannica.com/topic/Poisson-distribution not normally distributed], but are instead skewed to the right. While this may seem curious, it actually makes a lot of sense. Take an example that coffee-drinkers may like. '''How many people do you think drink one or two cups of coffee per day? Quite many, I guess.''' How many drink 3-4 cups? Fewer people, I would say. Now how many drink 10 cups? Only a few, I hope. A similar and maybe more healthy example could be found in sports activities. How many people make 30 minute of sport per day? Quite many, maybe. But how many make 5 hours? Only some very few. In phenomenon that can be counted, such as sports activities in minutes per day, most people will tend to a lower amount of minutes, and few to a high amount of minutes.  
  
[https://www.youtube.com/watch?v=BbLfV0wOeyc Things that can be counted] are often [https://www.britannica.com/topic/Poisson-distribution not normally distributed], but are instead skewed to the right. While this may seem curious, it actually makes a lot of sense. Take an example that coffee-drinkers may like. How many people do you think drink one or two cups of coffee per day? Quite many, I guess. How many drink 3-4 cups? Fewer people, I would say. Now how many drink 10 cups? Only a few, I hope. A similar and maybe more healthy example could be found in sports activities. How many people make 30 minute of sport per day? Quite many, maybe. But how many make 5 hours? Only some very few. In phenomenon that can be counted, such as sports activities in minutes per day, most people will tend to a lower amount of minutes, and few to a high amount of minutes. Now here comes the funny surprise. Transform the data following a [https://towardsdatascience.com/the-poisson-distribution-and-poisson-process-explained-4e2cb17d459 Poisson distribution], and it will typically follow the normal distribution if you use the decadic logarithm (log). Hence skewed data can be often transformed to match the normal distribution. While many people refrain from this, it actually may make sense in such examples as [https://sustainabilitymethods.org/index.php/Is_the_world_linear%3F island biogeography]. Discovered by MacArtur & Wilson, it is a prominent example of how the log of the numbers of species and the log of island size are closely related. While this is one of the fundamental basic of ecology, a statistician would have preferred the use of the Poisson distribution.
+
Now here comes the funny surprise. Transform the data following a [https://towardsdatascience.com/the-poisson-distribution-and-poisson-process-explained-4e2cb17d459 Poisson distribution], and it will typically follow the normal distribution if you use the decadic logarithm (log). Hence skewed data can be often transformed to match the normal distribution. While many people refrain from this, it actually may make sense in such examples as [https://sustainabilitymethods.org/index.php/Is_the_world_linear%3F island biogeography]. Discovered by MacArtur & Wilson, it is a prominent example of how the log of the numbers of species and the log of island size are closely related. While this is one of the fundamental basic of ecology, a statistician would have preferred the use of the Poisson distribution.
  
=====Example for a log transformation=====
+
===== Example for a log transformation of a Poisson distribution =====
[[File:Poisson Education small.png|thumb|left]]
+
[[File:Poisson Education small.png|thumb|400px|left]]
[[File:Poisson Education log small.png|thumb|left]]
+
[[File:Poisson Education log small.png|thumb|400px|left]]
 
One example for skewed data can be found in the R data set “swiss”, it contains data about socio-economic indicators of about 50 provinces in Switzerland in 1888. The variable we would like to look at is “Education”, which shows how many men in the army (in %) have an education level beyond primary school.  
 
One example for skewed data can be found in the R data set “swiss”, it contains data about socio-economic indicators of about 50 provinces in Switzerland in 1888. The variable we would like to look at is “Education”, which shows how many men in the army (in %) have an education level beyond primary school.  
 
As you can see when you look at the first diagram, in 30 provinces only 10 percent of the people received education beyond the primary school.
 
As you can see when you look at the first diagram, in 30 provinces only 10 percent of the people received education beyond the primary school.
Line 124: Line 130:
 
<syntaxhighlight lang="R" line>
 
<syntaxhighlight lang="R" line>
  
# to get further information about the data set, you can type
+
# we will work with the swiss() dataset.
 
+
# to obtain a histogram of the variable Education, you type
?swiss
 
 
 
# to obtain a histogram of the variable Education
 
  
 
hist(swiss$Education)
 
hist(swiss$Education)
  
# to transform the data series with the natural logarithm, just use log()
+
# you transform the data series with the natural logarithm by the use of log()
# besides it is good idea to assign the new value to a variable
 
  
 
log_edu<-log(swiss$Education)
 
log_edu<-log(swiss$Education)
Line 142: Line 144:
 
shapiro.test(log_edu)
 
shapiro.test(log_edu)
  
# and as the p-value is higher than 0.05, log_exa is normally distributed
+
# and as the p-value is higher than 0.05, log_edu is normally distributed
  
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 
====The Pareto distribution====
 
====The Pareto distribution====
 +
[[File:Bildschirmfoto 2020-04-08 um 12.28.46.png|thumb|300px|'''The Pareto distribution can also be apllied when we are looking at how wealth is spread across the world.''']]
 +
 +
'''Did you know that most people wear 20 % of their clothes 80 % of their time?''' This observation can be described by the [https://www.youtube.com/watch?v=EAynHZE-lK4 Pareto distribution]. For many phenomena that describe proportion within a given population, you often find that few make a lot, and many make few things. Unfortunately this is often the case for workloads, and we shall hope to change this. For such proportions the [https://www.statisticshowto.com/pareto-distribution/ Pareto distribution] is quite relevant. Consequently, it is rooted in [https://www.pragcap.com/the-pareto-principle-and-wealth-inequality/ income statistics]. Many people have a small to average income, and few people have a large income. This makes this distribution so important for economics, and also for sustainability science.
  
[[File:Bildschirmfoto 2020-04-08 um 12.28.46.png|thumb|The Pareto distribution can also be apllied when we are looking at how wealth is spread across the world.]]
 
  
Do you know that most people wear 20 % of their clothes 80 % of their time? This observation can be described by the [https://www.youtube.com/watch?v=EAynHZE-lK4 Pareto distribution]. For many phenomena that describe proportion within a given population, you often find that few make a lot, and many make few things. Unfortunately this is often the case for workloads, and we shall hope to change this. For such proportions the [https://www.statisticshowto.com/pareto-distribution/ Pareto distribution] is quite relevant. Consequently, it is rooted in [https://www.pragcap.com/the-pareto-principle-and-wealth-inequality/ income statistics]. Many people have a small to average income, and few people have a large income. This makes this distribution so important for economics, and also for sustainability science.
+
=== Visualizing data: Boxplots ===
 +
A nice way to visualize a data set is to draw a [[Barplots,_Histograms_and_Boxplots#Boxplots|boxplot]]. You get a rough overview how the data is distributed and moreover you can say at a glance if it’s normally distributed. The same is true for [[Barplots,_Histograms_and_Boxplots#Histograms|histograms]], but we will focus on the boxplot for now. For more information on both these forms of data visualisation, please refer to the entry on [[Barplots, Histograms and Boxplots]].
  
====Boxplots====
+
 
A nice way to visualize a data set is to draw a [https://www.youtube.com/watch?v=b2C9I8HuCe4 boxplot]. You get a rough overview, how the data is distributed and moreover you can say at a glance if it’s normally distributed.
+
'''What are the components of a boxplot and what do they represent?'''
But what are the components of a boxplot and what do they represent?
+
[[File:Boxplot.png|frameless|500px|right]]
[[File:Boxplot.png|thumb|right]]
 
 
The '''median''' marks the exact middle of your data, which is something different than the mean. If you imagine a series of random numbers, e.g. 3, 5, 7, 12, 26, 34, 40, the median would be 12.
 
The '''median''' marks the exact middle of your data, which is something different than the mean. If you imagine a series of random numbers, e.g. 3, 5, 7, 12, 26, 34, 40, the median would be 12.
 
But what if your data series comprises an even number of numbers, like 1, 6, 19, 25, 26, 55? You take the mean of the numbers in the middle, which is 22 and hence 22 is your median.
 
But what if your data series comprises an even number of numbers, like 1, 6, 19, 25, 26, 55? You take the mean of the numbers in the middle, which is 22 and hence 22 is your median.
Line 163: Line 167:
 
The space between the lower quartile line and the upper quartile line (the box) is called the interquartile range ('''IQR'''), which is important to define the length of the '''whiskers'''. The data points which are not in the range of the whiskers are called '''outliers''', which could e.g. be a hint that they are due to measuring errors. To define the end of the upper whisker, you take the value of the upper quartile and add the product of 1,5 * IQR.
 
The space between the lower quartile line and the upper quartile line (the box) is called the interquartile range ('''IQR'''), which is important to define the length of the '''whiskers'''. The data points which are not in the range of the whiskers are called '''outliers''', which could e.g. be a hint that they are due to measuring errors. To define the end of the upper whisker, you take the value of the upper quartile and add the product of 1,5 * IQR.
  
[[File:Boxplot Boxplot Text 2.jpg|thumb|right|The boxplot for the series of data: 6, 7, 14, 15, 21, 43, 76, 81, 87, 89, 95]]
+
[[File:Boxplot Boxplot Text 2.jpg|thumb|400px|right|'''The boxplot for the series of data:''' 6, 7, 14, 15, 21, 43, 76, 81, 87, 89, 95]]
  
Sticking to our previous example:
+
 
 +
'''Sticking to our previous example:'''
 
The IQR is the range between the lower (14) and the upper quartile (87), therefore 73.
 
The IQR is the range between the lower (14) and the upper quartile (87), therefore 73.
 
Multiply 73 by 1,5 and add it to the value of the upper quartile: 87 + 109,5 = 196,5
 
Multiply 73 by 1,5 and add it to the value of the upper quartile: 87 + 109,5 = 196,5
Line 176: Line 181:
 
<syntaxhighlight lang="R" line>
 
<syntaxhighlight lang="R" line>
  
#boxplot
+
#boxplot for our random series of numbers 6, 7, 14, 15, 21, 43, 76, 81, 87, 89, 95
#our random series of numbers 6, 7, 14, 15, 21, 43, 76, 81, 87, 89, 95
 
  
 
boxplot.example<-c(6,7,14,15,21,43,76,81,87,89,95)
 
boxplot.example<-c(6,7,14,15,21,43,76,81,87,89,95)
Line 199: Line 203:
 
</syntaxhighlight>
 
</syntaxhighlight>
  
==Simple data visualisation==
 
 
====Scatter Plot====
 
'''Description'''
 
Scatter plots can be useful for showing the relationship between two things,  because they allow you to encode data simultaneously on a horizontal x‐axis and vertical y‐axis to see whether and what relationship exists.
 
*(Cole Nussbaumer Knaflic-Storytelling with Data)*
 
 
You can create scatter plots if you have a pair of continuous (or numeric) data.
 
  
'''Examples in R'''
+
If you want to learn more about Boxplots, check out the entry on [[Histograms and Boxplots]]. Histograms are also very useful when attempting to detect the type of distribution in your data.
  
'''Example 1: Basic Scatter Plot'''
+
'''For more on data visualisation, check out the [[Introduction to statistical figures]].'''
The basic Scatter Plot that we will plot will be based on a dataset, that comes built-in with R, called <syntaxhighlight lang="R" inline>trees</syntaxhighlight>.
 
 
 
The data set contains data on the girth, height and the volume of different trees.
 
 
 
We will first plot the histogram shown in the general structure section above.
 
 
 
'''Structure of the Data'''
 
The data frame for <syntaxhighlight lang="R" inline>trees</syntaxhighlight> dataset looks like this:
 
 
 
{| class="wikitable"
 
|-
 
! Girth !! Height !! Volume
 
|-
 
| 8.3|| 70|| 10.3
 
|-
 
| 8.6|| 65|| 10.3
 
|-
 
| 8.8 || 63 || 10.2
 
|-
 
| ...|| ...|| ...
 
|}
 
Here, the data for all the columns are numeric. So, no further data transformation is necessary.
 
 
 
'''R Code to Plot the Data'''
 
<syntaxhighlight lang="R" line>
 
# Plot a basic histogram
 
# look at the data
 
head(trees)
 
 
 
# Plot a basic scatter plot
 
plot(x = trees$Girth, y = trees$Height)
 
</syntaxhighlight>
 
 
 
Result in R
 
[[File:Basic Scatter Plot.png.png|This is a basic scatter plot made using R.]]
 
 
 
'''Example 2: Better Scatter Plot'''
 
In this section, we will take the plot from the previous example and customize it by changing the shape and color of the points, and by adding a title and x- and y-axis labels to the plot.
 
 
 
R code to plot the chart
 
<syntaxhighlight lang="R">
 
# look at the data
 
head(trees)
 
 
 
# Create a scatter plot with labels and colors
 
plot(x=trees$Girth, y=trees$Height, # choose the x- and y-values
 
    pch=16,                        # choose how points look on the plot
 
    col='blue',                    # choose the color of the points
 
    main='Scatter Plot of Girth and Height of Trees', # main header of the plot
 
    xlab='Tree girth', ylab='Tree height')            # x- and y-axis labels
 
</syntaxhighlight>
 
 
 
Result in R
 
[[File:Better Scatter Plot.png|Minor customizations make the plot look more professional and understandable.]]
 
 
 
Minor customizations make the plot look more professional and understandable.
 
 
 
'''Related Links'''
 
* [[Histogram]]
 
* density plot
 
* box plot
 
 
 
====Bar chart====
 
'''Description'''
 
(Also known as: column chart)
 
 
 
A bar chart displays quantitative values for different categories. The chart comprises line marks (bars) – not rectangular areas – with the size attribute (length or height) used to represent the quantitative value for each category.
 
''- Andy Kirk - Data Visualization''
 
 
 
'''General Structure of Bar Chart'''
 
 
 
[[File:Bar chart structure.png|This figure shows the structure of a bar chart.]]
 
 
 
'''Example in R'''
 
 
 
We will first plot the bar chart shown above in the section above.
 
The basic bar chart that we will plot will be based on a dataset built-in to R called <syntaxhighlight lang="R" inline>mtcars</syntaxhighlight>. The data set contains data on specifications of different cars. One such specification is the number of gears a given car's transmission has. 
 
We will first create a summary table that contains the number of cars for a given count of gears. Then, we will use that table to create the plot.
 
 
 
Structure of the Data
 
The table that contains information about the frequency of cars for a given number of gears looks like this:
 
 
 
{| class="wikitable"
 
|-
 
! gears !! freq
 
|-
 
| 3|| 15
 
|-
 
| 4|| 12
 
|-
 
| 5|| 5
 
|-
 
| ... || ...
 
|}
 
 
 
Here, the data for <syntaxhighlight lang="R" inline>gears</syntaxhighlight> column are categories, and the data for <syntaxhighlight lang="R" inline>freq</syntaxhighlight> columns are numeric.
 
 
 
'''Example 1: Basic Bar Chart'''
 
 
 
R code to plot the chart
 
 
 
<syntaxhighlight lang="R" line>
 
# get the data
 
gears <- table(mtcars$gear)
 
 
 
# Plot a basic bar chart with a title and labels
 
barplot(gears,
 
        main = "Frequency of Vehicles of each Gear Type",  # title of the plot
 
        xlab = "Number of Gears", ylab = "Number of Cars")  # labels of the plot
 
</syntaxhighlight>
 
 
 
Result in R
 
This is how the output in R looks like.
 
 
 
[[File:Bar Chart.png]]
 
 
 
'''Related Links'''
 
* [[Histogram]]
 
* [[Scatter plot]]
 
 
 
====Line chart====
 
'''Description'''
 
A line chart shows how quantitative values for different categories have changed over time. They are typically structured around a temporal x-axis with equal intervals from the earliest to latest point in time. Quantitative values are plotted using joined-up lines that effectively connect consecutive points positioned along a y-axis. The resulting slopes formed between the two ends of each line provide an indication of the local trends between points in time. As this sequence is extended to plot all values across the time frame it forms an overall line representative of the quantitative change over time story for a single categorical value. 
 
 
 
Multiple categories can be displayed in the same view, each represented by a unique line. Sometimes a point (circle/dot) is also used to substantiate the visibility of individual values. The lines used in a line chart will generally be straight. However, sometimes curved line interpolation may be used as a method of estimating values between known data points. This approach can be useful to help emphasise a general trend. While this might slightly compromise the visual accuracy of discrete values if you already have approximations, this will have less impact.
 
 
 
''(Note- the description was based on a book by Andy Kirk named "Data Visualization")''
 
 
 
'''Examples in R'''
 
 
 
We will first plot the line chart shown in the section above.
 
 
 
The basic line chart that we will plot will be based on a built-in dataset called <syntaxhighlight lang="R" inline>EuStockMarkets</syntaxhighlight>. The data set contains data on the closing stock prices of different European stock indices over the years 1991 to 1998.
 
 
 
To make things easier, we will first transform the built-in dataset into a data frame object. Then, we will use that data frame to create the plot.
 
 
 
Structure of the Data
 
The table that contains information about the different market indices looks like this:
 
 
 
{| class="wikitable"
 
|-
 
! DAX !! SMI !! CAC !! FTSE
 
|-
 
| 1628.75|| 1678.1 || 1772.8 || 2443.6
 
|-
 
| 1613.63|| 1688.5 || 1750.5 || 2460.2
 
|-
 
| 1606.51|| 1678.6 || 1718.0 || 2448.2
 
|-
 
| ... || ... || ... || ...
 
|}
 
 
 
Here, the data for all the columns are numeric.
 
 
 
'''Example 1: Basic Line Chart'''
 
This line chart shows how the <syntaxhighlight lang="R" inline>DAX</syntaxhighlight> index from the table from previous section.
 
 
R code to plot the chart
 
 
 
<syntaxhighlight lang="R" line>
 
# read the data as a data frame
 
eu_stocks <- as.data.frame(EuStockMarkets)
 
 
 
# Plot a basic line chart
 
plot(eu_stocks$DAX,  # simply select a stock index
 
    type='l')      # choose 'l' for line chart
 
</syntaxhighlight>
 
 
 
Result in R
 
 
 
[[File:Simple line chart.png]]
 
 
 
As you can see, the plot is very simple. We can enhance the way this plot looks by making a few tweaks as shown in the section below.
 
 
 
'''Example 2: Better Looking Line Chart'''
 
Here, we will plot the DAX index again as we did in Example 1. However, the plot will be enhanced to be more informative and aesthetically pleasing.
 
 
 
R code to plot the chart
 
<syntaxhighlight lang="R">
 
# get the data
 
eu_stocks <- as.data.frame(EuStockMarkets)
 
 
 
# Plot a basic line chart
 
plot(eu_stocks$DAX, # select the data
 
    type='l',      # choose 'l' for line chart
 
    col='blue',    # choose the color of the line
 
    lwd = 2,      # choose the line width
 
    main = 'Line Chart of DAX Index (1991-1998)',        # title of the plot
 
    xlab = 'Time (1991 to 1998)', ylab = 'Prices in EUR') # x- and y-axis labels
 
</syntaxhighlight>
 
 
 
Result in R
 
 
 
[[File:Line chart.png]]
 
 
 
You can see that this plot looks much more informative and attractive.
 
 
 
'''Related Links'''
 
* [[Scatter plot]]
 
* [[Stacked line chart]]
 
* [[Box plot]]
 
 
 
====Histogram====
 
'''Description'''
 
A histogram displays the frequency and distribution for a range of quantitative groups. Whereas Histograms compare quantities for different categories, a histogram technically compares the number of observations across a range of value ‘bins’ using the size of lines/bars (if the bins relate to values with equal intervals) or the area of rectangles (if the bins have unequal value ranges) to represent the quantitative counts. With the bins arranged in meaningful order (that effectively form ordinal groupings) the resulting shape formed reveals the overall pattern of the distribution of observations.
 
 
 
''- Andy Kirk - Data Visualization''
 
 
 
'''General Structure of Histogram'''
 
 
 
[[File:Histogram structure.png|This is how a histogram looks.]]
 
 
 
'''Examples in R'''
 
 
 
We will first plot the histogram shown in the general structure section above.
 
 
 
The basic histogram that we will plot will be based on a built-in dataset called <syntaxhighlight lang="R" inline>cars</syntaxhighlight>. This data set contains data on stopping distance of different cars at different speeds.
 
 
 
Since both the values are numeric, we don't need to transform the data in any way in order to plot a histogram.
 
 
 
Structure of the Data
 
The table that contains information about the stopping distance of different cars at a given speed looks like this:
 
 
 
{| class="wikitable"
 
|-
 
! speed !! dist
 
|-
 
| 4|| 2
 
|-
 
| 4|| 10
 
|-
 
| 7|| 4
 
|-
 
| 7|| 22
 
|-
 
| 8|| 16
 
|-
 
| 9|| 10
 
|-
 
| ...|| ...
 
|}
 
 
 
Here, the data for both <syntaxhighlight lang="R" inline>speed</syntaxhighlight> and <syntaxhighlight lang="R" inline>dist</syntaxhighlight> columns are numeric.
 
 
 
'''Example 1: Basic Histogram'''
 
(with <syntaxhighlight lang="R" inline>speed</syntaxhighlight> variable)
 
 
 
R code to plot the chart
 
 
 
<syntaxhighlight lang="R" line>
 
# data that we are going to use
 
View(cars)
 
 
 
# Plot a basic histogram
 
hist(cars$speed,
 
    main = "Histogram for speed of cars", # main title
 
    xlab = "Speed") # x-axis label
 
</syntaxhighlight>
 
 
 
Result in R
 
 
 
[[File:Simple Histogram.png]]
 
 
 
'''Example 2: Better looking Histogram'''
 
(with <syntaxhighlight lang="R" inline>dist</syntaxhighlight> variable)
 
 
 
R code to plot the chart
 
<syntaxhighlight lang="R">
 
# data that we are going to use
 
View(cars)
 
 
 
# Plot a basic histogram
 
# data that we are going to use
 
View(cars)
 
 
 
# Plot a basic histogram
 
hist(cars$dist,
 
    breaks = 15, # define the number of bins you want in the histogram
 
    col = 'seagreen', # define the color of the bars in the histogram
 
    main = "Histogram for stopping distance of cars", # main title
 
    xlab = "Stopping Distance") # x-axis label
 
</syntaxhighlight>
 
 
 
Result in R
 
 
 
[[File:Better Histogram.png|This is a better looking histogram.]]
 
 
 
'''Related Links'''
 
* [[Scatter plot]]
 
* density plot
 
* box plot
 
  
 +
=== More forms of data distribution ===
 +
Of course, there are more types of data distribution. We found this great overview by [http://people.stern.nyu.edu/adamodar/pdfiles/papers/probabilistic.pdf Aswath Damodaran], which helps you investigate the type of distribution in your data. [[File:Different distributions.png|frameless|1000px|center| '''A guide to detecting the right distribution.''' Source: [http://people.stern.nyu.edu/adamodar/pdfiles/papers/probabilistic.pdf Aswath Damodaran]]]
  
 
==External links==
 
==External links==
 
 
====Videos====
 
====Videos====
 
 
[https://www.youtube.com/watch?v=bPFNxD3Yg6U Data Distribution]: A crash course
 
[https://www.youtube.com/watch?v=bPFNxD3Yg6U Data Distribution]: A crash course
  
Line 524: Line 229:
 
[https://www.youtube.com/watch?v=9TDjifpGj-k Bayes theorem]: A detailed explanation
 
[https://www.youtube.com/watch?v=9TDjifpGj-k Bayes theorem]: A detailed explanation
  
[https://www.youtube.com/watch?v=FlIiYdHHpwU F test]: An example calculation
 
  
 
====Articles====
 
====Articles====
 
 
[https://www.analyticsvidhya.com/blog/2017/09/6-probability-distributions-data-science/ Probability Distributions]: 6 common distributions you should know
 
[https://www.analyticsvidhya.com/blog/2017/09/6-probability-distributions-data-science/ Probability Distributions]: 6 common distributions you should know
  
Line 565: Line 268:
  
 
[http://www.oecd.org/statistics/compare-your-income.htm Compare your income]: A tool by the OECD
 
[http://www.oecd.org/statistics/compare-your-income.htm Compare your income]: A tool by the OECD
 
[http://www.sthda.com/english/wiki/f-test-compare-two-variances-in-r F test]: An example in R
 
 
----
 
----
 
[[Category:Statistics]]
 
[[Category:Statistics]]

Latest revision as of 13:44, 13 June 2021

Data distribution

Data distribution is the most basic and also a fundamental step of analysis for any given data set. On the other hand, data distribution encompasses the most complex concepts in statistics, thereby including also a diversity of concepts that translates further into many different steps of analysis. Consequently, without understanding the basics of data distribution, it is next to impossible to understand any statistics down the road. Data distribution can be seen as the fundamentals, and we shall often return to these when building statistics further.

The normal distribution

This is an ideal bell curve with the typical deviation in per cent. The σ sign (sigma) stands for standard deviation: within the range of -1 to +1 σ you have about 68,2% of your data. Within -2 to +2 σ you have 95,4% of the data and so on.

How wonderful, it is truly a miracle how almost everything that can be measured seems to be following the normal distribution. Overall, the normal distribution is not only the most abundantly occurring, but also the earliest distribution that was known. It follows the premise that most data in any given dataset has its majority around a mean value, and only small amounts of the data are found at the extremes.

Most phenomena we can observe follow a normal distribution. The fact that many do not want this to be true is I think associated to the fact that it makes us assume that the world is not complex, which is counterintuitive to many. While I believe that the world can be complex, there are many natural laws that explain many phenomena we investigate. The Gaussian normal distribution is such an example. Most things that can be measured in any sense (length, weight etc.) are normally distributed, meaning that if you measure many different items of the same thing, the data follows a normal distribution.

The easiest example is tallness of people. While there is a gender difference in terms of height, all people that would identify as e.g. females have a certain height. Most have a different height from each other, yet there are almost always many of a mean height, and few very small and few very tall females within a given population. There are of course exceptions, for instance due to selection biases. The members of a professional basketball team would for instance follow a selection bias, as these would need to be ideally tall. Within the normal population, people’s height follow the normal distribution. The same holds true for weight, and many other things that can be measured.

Discovered by Gauss, it is only consecutive that you can find the normal distribution even at a 10DM bill.


Sample size matters

Sample size matters. As these five plots show, bigger samples will more likely show a normal distribution.

Most things in their natural state follow a normal distribution. If somebody tells you that something is not normally distributed, this person is either very clever or not very clever. A small sample can hamper you from finding a normal distribution. If you weigh five people you will hardly find a normal distribution, as the sample is obviously too small. While it may seem like a magic trick, it is actually true that many phenomena that can be measured will follow the normal distribution, at least when your sample is large enough. Consequently, much of the probabilistic statistics is built on the normal distribution.


Why some distributions are skewed

Data can be skewed. These graphs show you how distributions can differ according to mode, median and mean of the displayed data.

The most abundant reason for a deviance from the normal distribution is us. We changed the planet and ourselves, creating effects that may change everything, up to the normal distribution. Take weight. Today the human population shows a very complex pattern in terms of weight distribution across the globe, and there are many reasons why the weight distribution does not follow a normal distribution. There is no such thing as a normal weight, but studies from indigenous communities show a normal distribution in the weight found in their populations. Within our wider world, this is clearly different. Yet before we bash the Western diet, please remember that never before in the history of humans did we have a more steady stream of calories, which is not all bad.

Distributions can have different skews. There is the symmetrical skew which is basically a normal distributions or bell curve that you can see on the picture. But normal distributions can also be skewed to the left or to the right depending on how mode, median and mean differ. For the symmetrical normal distribution they are of course all the same but for the right skewed distribution (mode < median < mean) it's different.


Detecting the normal distribution

This is a time series visualized through barplots.
This is the same data as a histogram.
And this the data as a boxplot. You can see that the data is normally distributed because the whiskers and the quarters have nearly the same length.

But when is data normally distributed? And how can you recognize it when you have a boxplot in front of you? Or a histogram? The best way to learn it, is to look at it. Always remember the ideal picture of the bell curve (you can see it above), especially if you look at histograms. If the histogram of your data show a long tail to either side, or has multiple peaks, your data is not normally distributed. The same is the case if your boxplot's whiskers are largely uneven.

You can also use the Shapiro-Wilk test to check for normal distribution. If the test returns insignificant results (p-value > 0.05), we can assume normal distribution.

This barplot (at the left) represents the number of front-seat passengers that were killed or seriously injured annually from 1969 to 1985 in the UK. And here comes the magic trick: If you sort the annually number of people from the lowest to the highest (and slightly lower the resolution), a normal distribution evolves (histogram at the left).

If you would like to know how one can create the diagrams which you see here, this is the R code:

# If you want some general information about the "Seatbelt" dataset, at which we will have look, you can use the ?-function.
# As "Seatbelts" is a dataset in R, you can receive a lot of information here. You can see all datasets available in R by typing data().

?Seatbelts
     
# to have a look a the dataset "Seatbelts" you can use several commands
  
## str() to know what data type "Seatbelts" is (e.g. a Time-Series, a matrix, a dataframe...)
str(Seatbelts)
        
## use show() or just type the name of the dataset ("Seatbelts") to see the table and all data it's containing
show(Seatbelts)
# or
Seatbelts
      
## summary() to have the most crucial information for each variable: minimum/maximum value, median, mean...
summary(Seatbelts)

     
# As you saw when you used the str() function, "Seatbelts" is a Time-Series, which makes it hard to work with it. We should change it into a dataframe (as.data.frame()). We will also name the new dataframe "seat", which is more handy to work with.
  
seat<-as.data.frame(Seatbelts)
     
# To choose a single variable of the dataset, we use the '$' operator. If we want a barplot with all front drivers,
# who were killed or seriously injured:
     
barplot(seat$front)
     
# For a histogram:
     
hist(seat$front)
  
## To change the resolution of the histogram, you can use the "breaks"-argument of the hist-command, which states
## in how many increments the plot should be divided
     
hist(seat$front, breaks = 30)
hist(seat$front, breaks = 100)

# For a boxplot:
     
boxplot(seat$front)

The QQ-Plot

1. Growth of caterpillars in relation to tannin content in food

The command qqplot will return a Quantile-Quantile plot. This plot allows for a visual inspection on how your model residuals behave in relation to a normal distribution. On the y-axis there are the standardised residuals and on the x-axis the theoretical quantiles. The simple answer is, if your data points are on this line you are fine, you have normal errors, and you can stop reading here. If you want to know more about the theory behind this please continue. Residuals is the difference of your response variable and the fitted values.

Residuals = response variable - fitted values

For a regression analysis this would be the difference of your data points to the regression line. The standardised residuals depend on the model function you are applying.

In the following example, the standardised residuals are the residuals divided by the standard deviation. Let's take the caterpillar data set as an example. On the right you can see the table with the data: growth of caterpillars in relation to tannin content of their diet. Below, we will discuss some correlation plots between these two factors.

2. Plotting the data in an x-y plot already gives you an idea that growth probably depends on the tannin content.
4. The qqplot for this model looks good. Here the points are mostly on the line with point 4 and point 7 being slightly above and below the line. Still you would consider the residuals in this case to behave normally.
3. Plotted regression line of the regression model lm(growth~tannin) for testing the relation between two factors
5. A gamma distribution, where the variances increases with the square of the mean.
6. A negative binomial distribution that is clearly not following a normal distribution. In other words here the points are not on the line, the visual inspection of this qqplot concludes that your residuals are not normally distributed.

Non-normal distributions

Sometimes the world is not normally distributed. At a closer examination, this makes perfect sense under the specific circumstances. It is therefore necessary to understand which reasons exists why data is not normally distributed.

The Poisson distribution

This picture shows you several possible poisson distributions. They differ according to the lambda, the rate parameter.

Things that can be counted are often not normally distributed, but are instead skewed to the right. While this may seem curious, it actually makes a lot of sense. Take an example that coffee-drinkers may like. How many people do you think drink one or two cups of coffee per day? Quite many, I guess. How many drink 3-4 cups? Fewer people, I would say. Now how many drink 10 cups? Only a few, I hope. A similar and maybe more healthy example could be found in sports activities. How many people make 30 minute of sport per day? Quite many, maybe. But how many make 5 hours? Only some very few. In phenomenon that can be counted, such as sports activities in minutes per day, most people will tend to a lower amount of minutes, and few to a high amount of minutes.

Now here comes the funny surprise. Transform the data following a Poisson distribution, and it will typically follow the normal distribution if you use the decadic logarithm (log). Hence skewed data can be often transformed to match the normal distribution. While many people refrain from this, it actually may make sense in such examples as island biogeography. Discovered by MacArtur & Wilson, it is a prominent example of how the log of the numbers of species and the log of island size are closely related. While this is one of the fundamental basic of ecology, a statistician would have preferred the use of the Poisson distribution.

Example for a log transformation of a Poisson distribution
Poisson Education small.png
Poisson Education log small.png

One example for skewed data can be found in the R data set “swiss”, it contains data about socio-economic indicators of about 50 provinces in Switzerland in 1888. The variable we would like to look at is “Education”, which shows how many men in the army (in %) have an education level beyond primary school. As you can see when you look at the first diagram, in 30 provinces only 10 percent of the people received education beyond the primary school.

To obtain a normal distribution (which is useful for many statistical tests), we can use the natural logarithm.

If you would like to know, how to conduct an analysis like on the left-hand side, we uploaded the code right below:

# we will work with the swiss() dataset.
# to obtain a histogram of the variable Education, you type

hist(swiss$Education)

# you transform the data series with the natural logarithm by the use of log()

log_edu<-log(swiss$Education)
hist(log_edu)

# to make sure, that the data is normally distributed, you can use the shapiro wilk test

shapiro.test(log_edu)

# and as the p-value is higher than 0.05, log_edu is normally distributed

The Pareto distribution

The Pareto distribution can also be apllied when we are looking at how wealth is spread across the world.

Did you know that most people wear 20 % of their clothes 80 % of their time? This observation can be described by the Pareto distribution. For many phenomena that describe proportion within a given population, you often find that few make a lot, and many make few things. Unfortunately this is often the case for workloads, and we shall hope to change this. For such proportions the Pareto distribution is quite relevant. Consequently, it is rooted in income statistics. Many people have a small to average income, and few people have a large income. This makes this distribution so important for economics, and also for sustainability science.


Visualizing data: Boxplots

A nice way to visualize a data set is to draw a boxplot. You get a rough overview how the data is distributed and moreover you can say at a glance if it’s normally distributed. The same is true for histograms, but we will focus on the boxplot for now. For more information on both these forms of data visualisation, please refer to the entry on Barplots, Histograms and Boxplots.


What are the components of a boxplot and what do they represent?

Boxplot.png

The median marks the exact middle of your data, which is something different than the mean. If you imagine a series of random numbers, e.g. 3, 5, 7, 12, 26, 34, 40, the median would be 12. But what if your data series comprises an even number of numbers, like 1, 6, 19, 25, 26, 55? You take the mean of the numbers in the middle, which is 22 and hence 22 is your median.

The box of the boxplot is divided in the lower and the upper quartile. In each quarter there are, obviously, a quarter of the data points. To define them, you split the data set in two halves (outgoing from the median) and calculate again the median of each half. In a random series of numbers (6, 7, 14, 15, 21, 43, 76, 81, 87, 89, 95) your median is 43, your lower quartile is 14 and your upper quartile 87.

The space between the lower quartile line and the upper quartile line (the box) is called the interquartile range (IQR), which is important to define the length of the whiskers. The data points which are not in the range of the whiskers are called outliers, which could e.g. be a hint that they are due to measuring errors. To define the end of the upper whisker, you take the value of the upper quartile and add the product of 1,5 * IQR.

The boxplot for the series of data: 6, 7, 14, 15, 21, 43, 76, 81, 87, 89, 95


Sticking to our previous example: The IQR is the range between the lower (14) and the upper quartile (87), therefore 73. Multiply 73 by 1,5 and add it to the value of the upper quartile: 87 + 109,5 = 196,5

For the lower whisker, the procedure is nearly the same. Again, you use the product of 1,5*IQR, but this time you subtract this value from the lower quartile: Here is your lower whisker: 14 – 109,5 = -95,5

And as there are no values outside of the range of our whiskers, we have no outliers. Furthermore, the whiskers to not extend to their extremes, which we calculated above, but instead mark the most extreme data points.

#boxplot for our random series of numbers 6, 7, 14, 15, 21, 43, 76, 81, 87, 89, 95

boxplot.example<-c(6,7,14,15,21,43,76,81,87,89,95)
summary(boxplot.example)

# minimum = 6
# maximum = 95
# mean = 48.55
# median = 43
# 1Q = 14.5
# 3Q = 84
# don't worry about the difference between our calculated quartile-values above and the values that were calculated by R. R works just a little more precisely here, but the approach we introduced above is a good approximation.

# with this information we can calculate the interquartile range
IQR(boxplot.example)
# IQR = 69.5

#lastly we can visualize our boxplot using this comment
boxplot(boxplot.example)


If you want to learn more about Boxplots, check out the entry on Histograms and Boxplots. Histograms are also very useful when attempting to detect the type of distribution in your data.

For more on data visualisation, check out the Introduction to statistical figures.

More forms of data distribution

Of course, there are more types of data distribution. We found this great overview by Aswath Damodaran, which helps you investigate the type of distribution in your data.

A guide to detecting the right distribution. Source: Aswath Damodaran

External links

Videos

Data Distribution: A crash course

The normal distribution: An explanation

Skewness: A quick explanation

The Poisson distribution: A mathematical explanation

The Pareto Distribution: Some real life examples

The Boxplot: A quick example

Probability: An Introduction

Bayes theorem: A detailed explanation


Articles

Probability Distributions: 6 common distributions you should know

Distributions: A list of Statistical Distributions

Normal Distribution: The History

The Normal Distribution: Detailed Explanation

The Normal Distributions: Real Life Examples

The Normal Distribution: A word on sample size

The weight of nations: How body weight is distributed across the world

Non normal distributions: A list

Reasons for non normal distributions: An explanation

Different distributions: An overview by Aswath Damodaran, S.61

The Poisson Distribution: The history

The Poisson Process: A very detailed explanation with real life examples

The Pareto Distribution: An explanation

The pareto principle and wealth inequality: An example from the US

History of Probability: An Overview

Frequentist vs. Bayesian Approaches in Statistics: A comparison

Bayesian Statistics: An example from the wizarding world

Probability and the Normal Distribution: A detailed presentation

Compare your income: A tool by the OECD


The author of this entry is Henrik von Wehrden.