Difference between revisions of "Histograms and Boxplots"

From Sustainability Methods
Line 39: Line 39:
  
 
==Boxplots==
 
==Boxplots==
Boxplots are used to illustrate the scatter in the data and good at showing the distribution of data points around the median for several groups of sets of data. A boxplot displays the [https://en.wikipedia.org/wiki/Five-number_summary five number summary] — [https://en.wikipedia.org/wiki/Sample_maximum_and_minimum minimum], [https://mathworld.wolfram.com/Quartile.html first quartile], [https://www.investopedia.com/terms/m/median.asp median], [https://mathworld.wolfram.com/Quartile.html third quartile], and [https://en.wikipedia.org/wiki/Sample_maximum_and_minimum maximum] — of a set of data. Another great thing about boxplots is that they display outliers, value points more than 1.5 times the range above the 75th percentile and more than 1.5 times the range below the 25th percentile. Hence, one would already argue that a boxplot illustrates more details than a histogram. To understand boxplots better, let’s compare boxplots and histograms visually using an example from [https://r4ds.had.co.nz/index.html R for Data Science] book, [https://r4ds.had.co.nz/exploratory-data-analysis.html#cat-cont chapter 7.51]:
+
The box plot is a diagram suited for showing the distribution of continuous, univariate data, by visualizing the following six characteristics of a dataset: minimum, first quartile, median, third quartile, maximum and outliers. The compact and space-saving design of box plots allows the comparison of more than one dataset, which is why they have an edge over other diagrams for continuous data, like histograms. A good example for comparative visualisation of both, boxplots and histograms, can be found in the [https://r4ds.had.co.nz/index.html R for Data Science] book, [https://r4ds.had.co.nz/exploratory-data-analysis.html#cat-cont chapter 7.51]:
 
[[File:Screenshot 2021-03-29 at 12.42.17.png|700px|frameless|center]]
 
[[File:Screenshot 2021-03-29 at 12.42.17.png|700px|frameless|center]]
 +
 +
=== Components of a boxplot ===
 +
Prerequisite: The values of the dataset should be sorted in ascending order.
 +
 +
* '''Minimum:'''
 +
Lowest value of the dataset (outliers excluded)
 +
* '''First quartile:'''
 +
Value which seperates the lower 25% from the upper 75% of the values of the dataset
 +
* '''Median:'''
 +
Middle value of the dataset
 +
* '''Third quartile:'''
 +
Value which seperates the upper 25% from the lower 75% of the values of the dataset
 +
* '''Interquartile range:'''
 +
Range between first and third quartile
 +
* '''Maximum:'''
 +
Highest value of the dataset (outliers excluded)
 +
* '''Whiskers:'''
 +
Lines ranging from minimum to first quartile and from third quartile to maximum
 +
* '''Outliers:'''
 +
Abnormal values represented by circles beyond the whiskers
 +
 +
=== How to plot a boxplot in R? ===
 +
==== Single box plot ====
 +
[[File:meanofozoneonrooseveltislandfrommaytosep1973.png|250px|thumb|right|Fig.3]]
 +
<syntaxhighlight lang="R" line>
 +
#Fig.3
 +
boxplot(airquality$Ozone,
 +
        main = "Mean of Ozone on Roosevelt Island from May to Sep 1973 ",
 +
        xlab="Ozone",
 +
        ylab="Parts per Billion",
 +
        boxwex = 0.5,    # defines width of box
 +
        las = 1,          # flips labels on y-axis into horizontal position
 +
        col="red",        # defines colour of box
 +
        border = "black"  # turns frame and median of box to black
 +
        )
 +
</syntaxhighlight>
 +
 +
By default, box plots are plotted vertically. It can be flipped into a horizontal position, by passing the argument '''horizontal''' and setting it to '''TRUE'''. Furthermore, the box can be equipped with a '''notch''', by passing the argument notch and setting it to '''TRUE'''.
 +
 +
==== Multiple box plots with interaction ====
 +
[[File:weightsofchicksgivendifferentdietsfor6weeks.png|250px|thumb|right|Fig.4]]
 +
<syntaxhighlight lang="R" line>
 +
#Fig.4
 +
boxplot(chickwts$weight ~ chickwts$feed,   
 +
        las = 1,
 +
        main = "Weights of chicks given different diets for 6 weeks",
 +
        xlab = "Feed",
 +
        ylab = "Weight in grams",
 +
        col = "red",
 +
        border = "black"
 +
        )
 +
#creates interaction between weight and feed column of dataset
 +
</syntaxhighlight>
 +
 +
 +
==== Sorting multiple box plots ====
 +
Multiple box plots can be sorted according to their characteristics. The following box plot shows the plot from above, sorted in ascending order by the median.
 +
[[File:sortedweightsofchicksgivendifferentdietsfor6weeks.png|250px|thumb|right|Fig.5]]
 +
<syntaxhighlight lang="R" line>
 +
#Fig.5
 +
par(mfrow = c(1, 1))   
 +
# sets the plot window to a one by one matrix
 +
medians <- reorder(chickwts$feed, chickwts$weight, median)
 +
# sorts columns feed and weight according to the median 
 +
boxplot(chickwts$weight ~ medians, las = 1, main = "Weights of chicks given different diets for 6 weeks", xlab = "Feed")
 +
# plots interaction of weight and medians   
 +
</syntaxhighlight>
  
 
====Boxplot with notches====
 
====Boxplot with notches====
Line 50: Line 117:
 
where ‘IQR’ is the interquartile range, and ‘n’ is the replication per sample. Notches are based on assumptions of asymptotic normality of the median and roughly equal sample sizes for two medians being compared and are said to be rather insensitive to the underlying distribution of the samples. The idea is to give roughly a 95% confidence interval for the difference if two medians. ([https://www.wiley.com/en-us/The+R+Book%2C+2nd+Edition-p-9781118448960 The R Book, p213])
 
where ‘IQR’ is the interquartile range, and ‘n’ is the replication per sample. Notches are based on assumptions of asymptotic normality of the median and roughly equal sample sizes for two medians being compared and are said to be rather insensitive to the underlying distribution of the samples. The idea is to give roughly a 95% confidence interval for the difference if two medians. ([https://www.wiley.com/en-us/The+R+Book%2C+2nd+Edition-p-9781118448960 The R Book, p213])
  
[[File:boxplotswithnwithoutnotches.png|400px|thumb|left|Fig.3]]
+
[[File:boxplotswithnwithoutnotches.png|400px|thumb|left|Fig.6]]
 
Here are the beaver2 data that was used earlier for plotting histograms:
 
Here are the beaver2 data that was used earlier for plotting histograms:
  
 
<syntaxhighlight lang="R" line>
 
<syntaxhighlight lang="R" line>
#Fig.3
+
#Fig.6
 
  par(mfrow = c(1,2))
 
  par(mfrow = c(1,2))
 
   boxplot(beaver2$temp[1:38],beaver2$temp[39:100], names = c("in rest", "in move"), ylab = "Body temperature", main = "Boxplot without notches")
 
   boxplot(beaver2$temp[1:38],beaver2$temp[39:100], names = c("in rest", "in move"), ylab = "Body temperature", main = "Boxplot without notches")
Line 61: Line 128:
  
 
Because the boxes do not overlap, it can be assumed that the difference in the median of the two data samples will be highly significant. However, the same cannot be said for the dataset plotted in the boxplots below.
 
Because the boxes do not overlap, it can be assumed that the difference in the median of the two data samples will be highly significant. However, the same cannot be said for the dataset plotted in the boxplots below.
[[File:Boxplotquartiles.png|400px|thumb|left|Fig.4]]
+
[[File:Boxplotquartiles.png|400px|thumb|left|Fig.7]]
 
Below example JohnsonJohnson dataset in R contains quarterly earnings (dollars) per Johnson & Johnson shares share between 1960 and 1980.
 
Below example JohnsonJohnson dataset in R contains quarterly earnings (dollars) per Johnson & Johnson shares share between 1960 and 1980.
  
 
<syntaxhighlight lang="R" line>
 
<syntaxhighlight lang="R" line>
#Fig.4
+
#Fig.7
 
JnJ <- as.data.frame(type.convert(.preformat.ts(JohnsonJohnson)))
 
JnJ <- as.data.frame(type.convert(.preformat.ts(JohnsonJohnson)))
 
boxplot(JnJ,notch = T)
 
boxplot(JnJ,notch = T)
Line 73: Line 140:
 
'''The odd-looking boxplots'''
 
'''The odd-looking boxplots'''
 
The JohnsonJohnson boxplot also illustrates the notches’ curious behaviour in the boxes of the Qtr1 and Qtr2. The notches extended above 75th percentile and/or below the 25th percentile when the sample sizes are small and/or the within-sample variance is high. The notches act this way to warn of the likely invalidity of the test ([https://www.wiley.com/en-us/The+R+Book%2C+2nd+Edition-p-9781118448960 The R Book, p214]).
 
The JohnsonJohnson boxplot also illustrates the notches’ curious behaviour in the boxes of the Qtr1 and Qtr2. The notches extended above 75th percentile and/or below the 25th percentile when the sample sizes are small and/or the within-sample variance is high. The notches act this way to warn of the likely invalidity of the test ([https://www.wiley.com/en-us/The+R+Book%2C+2nd+Edition-p-9781118448960 The R Book, p214]).
 +
<br>
 +
<br>
 +
<br>
 +
=== Interpretation ===
 +
* '''Symmetry:'''
 +
If a dataset is symmetric, the median is approximately in the middle of the box.
 +
 +
* '''Skewness:'''
 +
If a dataset is skewed, the median is closer to one end of the box than to the other. If the median is closer to the lower (or left) end of the box, the data is considered to be right-skewed. If the median is closer to the upper (or right) end of the box, the data is considered to be left-skewed.
 +
There are several ways to exactly measure the skewness of data, such as the Bowley-Galton skewness or Pearson's coefficient of skewness. Unfortunately, depending on the characteristics of the dataset, different skewness measures might give different or even contradictory results.
 +
A hands-on approach for R is included in the further readings below.
 +
 +
* '''Compactness of data sections:'''
 +
If one side of the box is larger than the other, it means that the values in this section are spread over a wider range. If it is smaller than the other one, the values in this section are more densely distributed.
 +
 +
*'''Distance and number of outliers:'''
 +
The number of outliers and their distance from the rest of the data can be easily read from the diagram.
 +
 +
* '''Significance of difference between two datasets:'''
 +
In case of a normal distribution, the significance of the difference between the data of two box plots can be roughly estimated: If the box of one box plot is higher or lower than the median of another box plot, the difference is likely to be significant. Further investigation is recommended.
  
 
== Conclusion ==
 
== Conclusion ==

Revision as of 17:44, 21 April 2021

Note: This entry revolves specifically around Histograms and Boxplots. For more general information on quantitative data visualisation, please refer to Introduction to statistical figures. For more info on Data distributions, please refer to the entry on Data distribution.


In short: Both histograms and boxplots are commonly used in (descriptive) statistics to visualise distribution of numerical data with two variables (typically the response variable on the X axis and the categorical explanatory variable on the Y axis).

Fig.1

Histograms

A histogram is a graphical display of data using bars (also called buckets or bins) of different height, where each bar groups numbers into ranges. Histograms reveal a lot of useful information about numerical data with a single explanatory variable. Histograms are used for getting a sense about the distribution of data, its median, and skewness.

#Fig.1
 par(mfrow = c(2,1))
  hist(beaver2$temp[1:38], main = "Body temperature of a beaver (in rest)", xlab = "Body Temperature in Celcius", breaks = 5)
  hist(beaver2$temp[39:100], main = "Body temperature of a beaver (in move)", xlab = "Body Temperature in Celcius", breaks = 5)

The two histograms are plotted from the “beaver2” dataset and illustrate how a beaver’s body temperature changes when it starts moving. Both histograms resemble the bell-curved shape of normal distribution. We can see a change in the beaver’s body temperature from approximately 37 degrees to 38 degrees.


Identifying and interpreting histograms

Histograms Vs. Bar charts Histograms are different than bar charts, and one should not confuse them. A histogram does not have gaps between the bars, but a bar chart does. Histograms have the response variable on the X-axis, and the Y-axis shows the frequency (or the probability density). In contrast, the X-axis in a bar chart shows the frequency and the Y-axis shows the response variable.

Fig.2

Patterns Histograms display well how data is distributed. For instance, a the symmetric, unimodal pattern of a histogram represents a normal distribution. Likewise, skewed right and left patterns in histograms display skewness of data - asymmetry of the distribution of data around the mean. See “Histogram: Study the Shape” to learn more about histogram patterns.

#Fig.2
 hist(beaver2$temp, main = "Bimodal Histogram: Body temperature of a beaver (in rest & in move)", xlab = "Body Temperature in Celcius", breaks = 12)

If the beaver2 dataset plotted into one histogram, it takes bimodal pattern and represents binomial distribution as there are two means of sample points - the temperature of a beaver in rest and in the move.

Number and width of the bars (bins) Histograms can become confusing depending on how the bin margin is put. As it is said in The R Book, p231 - “Wide bins produce one picture, narrow bins produce a different picture, unequal bins produce confusion.” Choice of number and width of bins techniques can heavily influence a histogram’s appearance, and choice of bandwidth can heavily influence the appearance of a kernel density estimate. Therefore, it is suggested that the bins stay in the same width and that the number of the bins is selected carefully to best display pattern in data.


Boxplots

The box plot is a diagram suited for showing the distribution of continuous, univariate data, by visualizing the following six characteristics of a dataset: minimum, first quartile, median, third quartile, maximum and outliers. The compact and space-saving design of box plots allows the comparison of more than one dataset, which is why they have an edge over other diagrams for continuous data, like histograms. A good example for comparative visualisation of both, boxplots and histograms, can be found in the R for Data Science book, chapter 7.51:

Screenshot 2021-03-29 at 12.42.17.png

Components of a boxplot

Prerequisite: The values of the dataset should be sorted in ascending order.

  • Minimum:

Lowest value of the dataset (outliers excluded)

  • First quartile:

Value which seperates the lower 25% from the upper 75% of the values of the dataset

  • Median:

Middle value of the dataset

  • Third quartile:

Value which seperates the upper 25% from the lower 75% of the values of the dataset

  • Interquartile range:

Range between first and third quartile

  • Maximum:

Highest value of the dataset (outliers excluded)

  • Whiskers:

Lines ranging from minimum to first quartile and from third quartile to maximum

  • Outliers:

Abnormal values represented by circles beyond the whiskers

How to plot a boxplot in R?

Single box plot

Fig.3
#Fig.3
boxplot(airquality$Ozone,
        main = "Mean of Ozone on Roosevelt Island from May to Sep 1973 ",
        xlab="Ozone",
        ylab="Parts per Billion",
        boxwex = 0.5,     # defines width of box
        las = 1,          # flips labels on y-axis into horizontal position
        col="red",        # defines colour of box
        border = "black"  # turns frame and median of box to black
        )

By default, box plots are plotted vertically. It can be flipped into a horizontal position, by passing the argument horizontal and setting it to TRUE. Furthermore, the box can be equipped with a notch, by passing the argument notch and setting it to TRUE.

Multiple box plots with interaction

Fig.4
#Fig.4
boxplot(chickwts$weight ~ chickwts$feed,     
        las = 1, 
        main = "Weights of chicks given different diets for 6 weeks", 
        xlab = "Feed", 
        ylab = "Weight in grams",
        col = "red",
        border = "black"
        )
#creates interaction between weight and feed column of dataset


Sorting multiple box plots

Multiple box plots can be sorted according to their characteristics. The following box plot shows the plot from above, sorted in ascending order by the median.

Fig.5
#Fig.5
par(mfrow = c(1, 1))     
# sets the plot window to a one by one matrix
medians <- reorder(chickwts$feed, chickwts$weight, median)
# sorts columns feed and weight according to the median  
boxplot(chickwts$weight ~ medians, las = 1, main = "Weights of chicks given different diets for 6 weeks", xlab = "Feed")
# plots interaction of weight and medians

Boxplot with notches

Simple boxplots are not very useful at illustrating if the median values between the groups of sets of data are significantly different. Therefore, we use boxplot with nothces to show the distribution of data points around the median and to see whether or not the median values are different between the groups. The notches are drawn as a ‘waist’ on either side of the median and are intended to give a rough impression of significance of the differences between two medians. Boxes in which the notches do not overlap are likely to prove to have significantly different medians under an appropriate statistical test (The R Book, p213).

The size of the notch increases with the magnitude of the interquartile range and declines with the square root of replication:

Notchformula1.png

where ‘IQR’ is the interquartile range, and ‘n’ is the replication per sample. Notches are based on assumptions of asymptotic normality of the median and roughly equal sample sizes for two medians being compared and are said to be rather insensitive to the underlying distribution of the samples. The idea is to give roughly a 95% confidence interval for the difference if two medians. (The R Book, p213)

Fig.6

Here are the beaver2 data that was used earlier for plotting histograms:

#Fig.6
 par(mfrow = c(1,2))
  boxplot(beaver2$temp[1:38],beaver2$temp[39:100], names = c("in rest", "in move"), ylab = "Body temperature", main = "Boxplot without notches")
  boxplot(beaver2$temp[1:38],beaver2$temp[39:100], names = c("in rest", "in move"), ylab = "Body temperature", notch = T, main = "Boxplot with notches")

Because the boxes do not overlap, it can be assumed that the difference in the median of the two data samples will be highly significant. However, the same cannot be said for the dataset plotted in the boxplots below.

Fig.7

Below example JohnsonJohnson dataset in R contains quarterly earnings (dollars) per Johnson & Johnson shares share between 1960 and 1980.

#Fig.7
JnJ <- as.data.frame(type.convert(.preformat.ts(JohnsonJohnson)))
boxplot(JnJ,notch = T)

Warning message in bxp(list(stats = structure(c(0.61, 1.16, 2.79, 6.93, 14.04, 0.63, : “some notches went outside hinges ('box'): maybe set notch=FALSE”

The odd-looking boxplots The JohnsonJohnson boxplot also illustrates the notches’ curious behaviour in the boxes of the Qtr1 and Qtr2. The notches extended above 75th percentile and/or below the 25th percentile when the sample sizes are small and/or the within-sample variance is high. The notches act this way to warn of the likely invalidity of the test (The R Book, p214).


Interpretation

  • Symmetry:

If a dataset is symmetric, the median is approximately in the middle of the box.

  • Skewness:

If a dataset is skewed, the median is closer to one end of the box than to the other. If the median is closer to the lower (or left) end of the box, the data is considered to be right-skewed. If the median is closer to the upper (or right) end of the box, the data is considered to be left-skewed. There are several ways to exactly measure the skewness of data, such as the Bowley-Galton skewness or Pearson's coefficient of skewness. Unfortunately, depending on the characteristics of the dataset, different skewness measures might give different or even contradictory results. A hands-on approach for R is included in the further readings below.

  • Compactness of data sections:

If one side of the box is larger than the other, it means that the values in this section are spread over a wider range. If it is smaller than the other one, the values in this section are more densely distributed.

  • Distance and number of outliers:

The number of outliers and their distance from the rest of the data can be easily read from the diagram.

  • Significance of difference between two datasets:

In case of a normal distribution, the significance of the difference between the data of two box plots can be roughly estimated: If the box of one box plot is higher or lower than the median of another box plot, the difference is likely to be significant. Further investigation is recommended.

Conclusion

We can conclude that both histogram and boxplots are a great tool in displaying how our values are distributed. As stated in Introduction to statistical figures, boxplots compared to histogram show the variance of continuous data across different factor levels and are a solid graphical representation of the Analysis of Variance. One could see box plots less informative than a histogram or kernel density estimate but they take up less space and are better at comparing distributions between several groups or sets of data.


Further links & reading material

Practice more

If you do not have installed R or if you cannot run in on your computer, you can run R Code online with Snippets and folow the examples in below resources.

Cookbook for R

  1. Histogram and density plot in base R
  2. Boxplots in base R
  3. Plotting distributions with ggplot2

The author of this entry is Ilkin Bakhtiarov.