Difference between revisions of "Introduction to statistical figures"

From Sustainability Methods
 
(74 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
'''In short:''' This entry introduces you to the most relevant forms of [[Glossary|data]] visualisation, and links to dedicated entries on specific visualisation forms with R examples.
 +
 
== Basic forms of data visualisation ==
 
== Basic forms of data visualisation ==
'''This section introduces you to the most relevant forms of data visualisation.''' For R examples, please check the section below.
+
__TOC__
 
 
[[File:2Barplots.png|300px|frameless|right]]
 
 
The easiest way to represent count information are basically '''barplots'''. They are a bit over simplistic if they contain only one level of information such as three groups and their abundance, and can be more advanced if they contain two levels of information such as in stacked barplots. These can be shown as either absolute numbers or proportions, which may make a dramatic difference for the analysis or interpretation.
 
The easiest way to represent count information are basically '''barplots'''. They are a bit over simplistic if they contain only one level of information such as three groups and their abundance, and can be more advanced if they contain two levels of information such as in stacked barplots. These can be shown as either absolute numbers or proportions, which may make a dramatic difference for the analysis or interpretation.
  
 +
'''Correlation plots''' ('xyplots') are the next staple in statistical graphics and most often the graphical representation of a correlation. Further, often also a regression is implemented to show effect strengths and variance. Fitting a [[Regression Analysis|regression]] line is often the most important visual aid to showcase the trend. Through point size or color can another information level be added, making this a really powerful tool, where one needs to keep a keen eye on the relation between correlation and causality. Such plots may also serve to show fluctuations in data over time, showing trends within data as well as harmonic patterns.
  
[[File:Xyplot.png|140px|frameless|left]]
+
'''Boxplots''' are the last in what I would call the trinity of statistical figures. Showing the variance of continuous data across different factor levels is what these plots are made of. While histograms reveal more details and information, boxplots are a solid graphical representation of the Analysis of Variance. A rule of thumb is that if one box is higher or lower than the median (the black line) of the other box, the difference may be signifiant.
'''Correlation plots''' ('xyplots') are the next staple in statistical graphics and most often the graphical representation of a correlation. Further, often also a regression is implemented to show effect strengths and variance. Fitting a [[Regression Analysis|regression]] line is often the most important visual aid to showcase the trend. Through point size or color can another information level be added, making this a really powerful tool, where one needs to keep a keen eye on the relation between correlation and causality. Such plots may also serve to show fluctuations in data over time, showing trends within data as well as harmonic patterns.
 
  
 +
[[File:Xyplot.png|250px|thumb|left|'''A Correlation plot.''' The line shows the regression, the dots are the data points.]]
 +
[[File:Boxplot3.png|250px|thumb|right|'''Boxplots.''']]
 +
[[File:2Barplots.png|420px|thumb|center|'''Barplots.''' The left diagram shows absolute, the right one relative Barplots.]]
  
[[File:Boxplot3.png|150px|frameless|right]]
 
'''Boxplots''' are the last in what I would call the trinity of statistical figures. Showing the variance of continuous data across different factor levels is what these plots are made of. While histograms reveal more details and information, boxplots are a solid graphical representation of the Analysis of Variance. A rule of thumb is that if one box is higher or lower than the median (the black line) of the other box, the difference may be signifiant.
 
  
 +
[[File:Histogram structure.png|300px|thumb|right|'''A Histogram.''']]
 +
A '''histogram''' is a graphical display of data using bars (also called buckets or bins) of different height, where each bar groups numbers into ranges. They can help reveal a lot of useful information about numerical data with a single explanatory variable. Histograms are used for getting a sense about the distribution of data, its median, and skewness.
  
Simple '''pie charts''' are not really ideal, as they camouflage the real proportions of the data they show. '''Venn diagrams''' are a simple way to compare 2-4 groups and their overlaps, allowing for multiple hits. Larger co-connections can either be represented by a '''bipartite plot''', if the levels are within two groups, or, if multiple interconnections exist, then a '''structural equation model''' representation is valuable for more deductive approaches, while rather inductive approaches can be shown by '''circular network plots'''.
+
Simple '''pie charts''' are not really ideal, as they camouflage the real proportions of the data they show. '''Venn diagrams''' are a simple way to compare 2-4 groups and their overlaps, allowing for multiple hits. Larger co-connections can either be represented by a '''bipartite plot''', if the levels are within two groups, or, if multiple interconnections exist, then a '''structural equation model''' representation is valuable for more deductive approaches, while rather inductive approaches can be shown by '''circular network plots''' (aka [[Chord Diagram]]).
 +
[[File:Introduction to Statistical Figures - Venn Diagram example.png|200px|thumb|left|'''A Venn Diagram showing the number of articles in a systematic review that revolve around one or more of three topics.''' Source: Partelow et al. 2018. A Sustainability Agenda for Tropical Marine Science.]]
 +
[[File:Introduction to Statistical Figures - Bipartite Plot example.png|300px|thumb|right|'''A bipartite plot showing the affiliation of publication authors and the region where a study was conducted.''' Source: Brandt et al. 2013. A review of transdisciplinary research in sustainability science.]]
 +
[[File:Introduction to Statistical Figures - Structural Equation Model.png|400px|thumb|center|'''A piecewise structural equation model quantifying hypothesized relationships between economic and technological power, military strength, biophysical reserves and net imports of resources as well as trade in value added per exported resource item in global trade in 2015.''' Source: Dorninger et al. 2021. Global patterns of ecologically unequal exchange: Implications for sustainability in the 21st century.]]
  
  
 
Multivariate data can be principally shown by three ways of graphical representation: '''ordination plots''', '''cluster diagrams''' or '''network plots'''. Ordination plots may encapsulate such diverse approaches as decorana plots, principal component analysis plots, or results from a non-metric dimensional scaling. Typically, the first two most important axis are shown, and additional information can be added post hoc. While these plots show continuous patterns, cluster dendrogramms show the grouping of data. These plots are often helpful to show hierarchical structures in data. Network plots show diverse interactions between different parts of the data. While these can have underlying statistical analysis embedded, such network plots are often more graphical representations than statistical tests.
 
Multivariate data can be principally shown by three ways of graphical representation: '''ordination plots''', '''cluster diagrams''' or '''network plots'''. Ordination plots may encapsulate such diverse approaches as decorana plots, principal component analysis plots, or results from a non-metric dimensional scaling. Typically, the first two most important axis are shown, and additional information can be added post hoc. While these plots show continuous patterns, cluster dendrogramms show the grouping of data. These plots are often helpful to show hierarchical structures in data. Network plots show diverse interactions between different parts of the data. While these can have underlying statistical analysis embedded, such network plots are often more graphical representations than statistical tests.
  
 +
[[File:Introduction to Statistical Figures - Ordination example.png|450px|thumb|left|'''An Ordination plot (Principal Component Analysis) in which analyzed villages (colored abbreviations) in Transylvania are located according to their natural capital assets alongside two main axes, explaining 50% and 18% of the variance.''' Source: Hanspach et al 2014. A holistic approach to studying social-ecological systems and its application to southern Transylvania.]]
 +
 +
[[File:Introduction to Statistical Figures - Circular Network Plots.png|530px|thumb|center|'''A circular network plot showing how sub-topics of social-ecological processes were represented in articles assessed in a systematic review. The proportion of the circle represents a topic's importance in the research, and the connections show if topics were covered alongside each other.''' Source: Partelow et al. 2018. A sustainability agenda for tropical marine science.]]
  
 
'''Descriptive Infographics''' can be a fantastic way to summarise general information. A lot of information can be packed in one figure, basically all single variable information that is either proportional or absolute can be presented like this. It can be tricky if the number of categories is very high, which is when a miscellaneous category could be added to a part of an infographic. Infographics are a fine [[Glossary|art]], since the balance of information and aesthetics demands a high level of experience, a clear understanding of the data, and knowledge in the deeper design of graphical representation.
 
'''Descriptive Infographics''' can be a fantastic way to summarise general information. A lot of information can be packed in one figure, basically all single variable information that is either proportional or absolute can be presented like this. It can be tricky if the number of categories is very high, which is when a miscellaneous category could be added to a part of an infographic. Infographics are a fine [[Glossary|art]], since the balance of information and aesthetics demands a high level of experience, a clear understanding of the data, and knowledge in the deeper design of graphical representation.
Line 27: Line 36:
  
 
== How to visualize data in R ==
 
== How to visualize data in R ==
'''The following section provides some examples and R code for simple forms of data visualisation.''' For more info on data formats, please refer to [[Data formats]].
+
'''The following overview includes all forms of data visualisation that we consider important.''' <br>
 
+
Based on your data, have a look which forms of visualisation might be relevant for you. Just hover over the individual visualisation type and it will show you its name. It will also show you a quick example which this kind of visualisation might be helpful for. '''By clicking, you will be redirected to a dedicated entry with exemplary R code.'''<br>
  
=== Barplot ===
+
Tip: If you are unsure whether you have qualitative or quantitative data, have a look at the entry on [[Data formats]]. Keep in mind: categorical (qualitative) data that is counted in order to visualise each category's occurrence, is not quantitative (= numeric) data. It's still qualitative data that is just transformed into count data. So the visualisations on the left do indeed display some kind of quantitative information, but the underlying data was always qualitative.
'''Description'''<br/>
 
A barplot(also known as 'bar chart' or 'column chart') displays quantitative values for different categories. The chart comprises line marks (bars) not rectangular areas – with the size attribute (length or height) used to represent the quantitative value for each category.
 
  
[[File:Bar chart structure.png|This figure shows the structure of a bar chart.]]
+
<imagemap>Image:Statistical Figures Overview 27.05.png|1050px|frameless|center|
 +
circle 120 201 61 [[Big problems for later|Factor analysis]]
 +
circle 312 201 61 [[Venn Diagram|Venn Diagram, e.g. variables TREE SPECIES IN EUROPE, TREE SPECIES IN ASIA, TREE SPECIES IN AMERICA as three colors, with joint species in the overlaps]]
 +
circle 516 190 67 [[Venn Diagram|Venn Diagram, e.g. variables TREE SPECIES IN EUROPE, TREE SPECIES IN ASIA as two colors, with joint species in the overlaps]]
 +
circle 718 178 67 [[Stacked Barplots|Stacked Barplot, e.g. count data of different species (colors) for the variable TREES]]
 +
circle 891 179 67 [[Barplots, Histograms and Boxplots#Barplots|Barplot, e.g. different kinds of trees (x) as count data (y) for the variable TREES]]
 +
circle 1318 184 67 [[Barplots, Histograms and Boxplots#Histograms|Histogram, e.g. the variable EXAM POINTS as count data (y) per interval (x)]]
 +
circle 1510 187 67 [[Correlation_Plots#Line_chart|Line Chart, e.g. TIME (x) and BITCOIN VALUE (y)]]
 +
circle 1689 222 67 [[Bubble Plots|Bubble Plot, e.g. GDP (x), LIFE EXPECTANCY (y), POPULATION (bubble size)]]
 +
circle 1896 238 67 [[Big problems for later|Ordination, e.g. numeric variables (AGE, INCOME, HEIGHT) are transformed into Principal Components (x & y) along which data points are arranged and explained]]
 +
circle 202 326 67 [[Treemap|Treemap, e.g. FORESTS (colors) and count data of the included species (rectangles)]]
 +
circle 410 323 67 [[Stacked Barplots|Simple Stacked Barplot, e.g. different species of trees (absolute count data per color) for the variables TREES in ASIA, TREES IN AMERICA, TREES IN AFRICA, TREES IN EUROPE (x)]]
 +
circle 608 295 67 [[Stacked Barplots|Proportions Stacked Barplot, e.g. relative count data (y) of different species (colors) for the variables TREES in ASIA & TREES IN EUROPE (x)]]
 +
circle 812 277 67 [[Pie Charts|Pie Chart, e.g. different kinds of trees (relative count data per color) for the variable TREE SPECIES]]
 +
circle 1015 308 67 [[Barplots, Histograms and Boxplots#Boxplots|Boxplot, e.g. TREE HEIGHT (y) for beeches]]
 +
circle 1228 287 67 [[Kernel density plot|Kernel Density Plot, e.g. count data (y) of EXAM POINTS per point (x)]]
 +
circle 1422 294 67 [[Correlation_Plots#Scatter_Plot|Scatter Plot, e.g. RUNNER ENERGY LEVEL (y) per KILOMETERS (x)]]
 +
circle 1574 379 67 [[Big problems for later|Heatmap with lines]]
 +
circle 1788 401 67 [[Correlation_Plots#Correlogram|Correlogram, e.g. the CORRELATION COEFFICIENT (shade) for each pair of the numeric variables KILOMETERS PER LITER, CYLINDERS, HORSEPOWER, WEIGHT]]
 +
circle 297 441 67 [[Wordcloud|Wordcloud]]
 +
circle 516 434 67 [[Big problems for later|Spider Plot, e.g. relative count data of different species (shape) for the variables TREES IN EUROPE (green), TREES IN ASIA (blue), TREES IN AMERICA (red)]]
 +
circle 710 402 67 [[Stacked Barplots|Simple Stacked Barplot, e.g. absolute count data (y) of different species (colors) for the variables TREES in ASIA & TREES IN EUROPE (x)]]
 +
circle 1323 410 67 [[Regression Analysis#Simple linear regression in R|Linear Regression Plot, e.g. INCOME (y) per AGE (x)]]
 +
circle 392 558 67 [[Chord Diagram, e.g. count data of FAVORITE SNACKS (colors) with the connections connecting shared favorites]]
 +
circle 621 521 67 [[Sankey Diagrams|Sankey Diagram, e.g. count data of FAVORITE SNACKS (colors) with the connections connecting shared favorites (if connections are valued: 3 variables)]]
 +
circle 853 496 67 [[Barplots, Histograms and Boxplots#Boxplots|Boxplot, e.g. different TREE SPECIES (x) and TREE HEIGHT (y)]]
 +
circle 1014 521 67 [[Stacked Area Plot|Stacked Area Plot, e.g. INCOME (x) and count data (y) of BOUGHT ITEMS (colors) (if y is EXPENSES: three variables)]]
 +
circle 1174 502 67 [[Kernel density plot|Kernel Density Plot, e.g. count data (y) of EXAM POINTS IN MATHS (blue) and EXAM POINTS in HISTORY (green) per point (x)]]
 +
circle 1438 521 67 [[Big problems for later|Multiple Regression, e.g. INCOME (y) per AGE (x) in different COUNTRIES]]
 +
circle 1657 554 67 [[Clustering Methods|Cluster Analysis, e.g. car data points are grouped by their similarity according to the numeric variables KILOMETERS PER LITER, CYLINDERS, HORSEPOWER, WEIGHT]]
 +
circle 517 648 67 [[Sankey Diagrams|Sankey Diagram, e.g. count data of VOTER PREFERENCES (colors) with movements from Y1 to Y2 to Y3]]
 +
circle 755 621 67 [[Stacked Barplots|Simple Stacked Barplot, e.g. absolute count data (y) of different species (colors) for the variables TREES in ASIA & TREES IN EUROPE (x), with numeric PHYLOGENETIC DIVERSITY (bar width)]]
 +
circle 912 679 67 [[Barplots, Histograms and Boxplots#Boxplots|Boxplot, e.g. TREE SPECIES (x), TREE HEIGHT (y), COUNTRIES (colors)]]
 +
circle 1095 693 67 [[Bubble Plots|Bubble Plot, e.g. GDP (x), LIFE EXPECTANCY (y), COUNTRY (bubble color)]]
 +
circle 1267 645 67 [[Heatmap|Heatmap, e.g. TREE SPECIES (x) with FERTILIZER BRAND (y) and HEIGHT (colors)]]
 +
circle 1509 696 67 [[Big problems for later|Network Plot, e.g. calculated connection strength (line width) between actors (nodes) based on LOCAL PROXIMITY, RATE OF INTERACTION, AGE, CASH FLOWS (nodes may be categorical)]]
 +
circle 622 812 67 [[Clustering Methods|Cluster Analysis, e.g. car data points are grouped by their similarity according to the numeric variables KILOMETERS PER LITER, CYLINDERS, HORSEPOWER and categorical BRAND]]
 +
circle 733 759 67 [[Bubble Plots|Bubble Plot, e.g. GDP (x), LIFE EXPECTANCY (y), COUNTRY (bubble color), POPULATION SIZE (bubble size)]]
 +
circle 782 875 67 [[Big problems for later|Factor analysis]]
 +
circle 1271 825 67 [[Big problems for later|Structural Equation Plot]]
 +
circle 1394 771 67 [[Big problems for later|Ordination, e.g. numeric and categorical variables (AGE, INCOME, HEIGHT, PROFESSION) are transformed into Principal Components (x & y) along which data points are arranged and explained]]
 +
</imagemap>
  
'''R Code'''<br/>
 
We will plot a basic bar chart based on a dataset built-in to R called <syntaxhighlight lang="R" inline>mtcars</syntaxhighlight>. The data set contains data on specifications of different cars. One such specification is the number of gears a given car's transmission has. 
 
We will first create a summary table that contains the number of cars for a given count of gears. Then, we will use that table to create the plot.
 
 
The table that contains information about the frequency of cars for a given number of gears looks like this:
 
 
{| class="wikitable"
 
|-
 
! gears !! freq
 
|-
 
| 3|| 15
 
|-
 
| 4|| 12
 
|-
 
| 5|| 5
 
|-
 
| ... || ...
 
|}
 
 
Here, the data for <syntaxhighlight lang="R" inline>gears</syntaxhighlight> column are categories, and the data for <syntaxhighlight lang="R" inline>freq</syntaxhighlight> columns are numeric.
 
 
<syntaxhighlight lang="R" line>
 
# get the data
 
gears <- table(mtcars$gear)
 
 
# Plot a basic bar chart with a title and labels
 
barplot(gears,
 
        main = "Frequency of Vehicles of each Gear Type",  # title of the plot
 
        xlab = "Number of Gears", ylab = "Number of Cars")  # labels of the plot
 
</syntaxhighlight>
 
 
'''Result in R'''
 
[[File:Bar Chart.png]]
 
 
For more on Barplots, please refer to the entry on [[Stacked Barplots]].
 
 
 
=== Scatter Plot ===
 
''' Description '''<br/>
 
Scatter plots can be useful for showing the relationship between two things, because they allow you to encode data simultaneously on a horizontal x‐axis and vertical y‐axis to see whether - and which kind of - a relationship exists.
 
 
You can create scatter plots if you have a pair of continuous (or numeric) data.
 
 
 
''' R Code '''<br/>
 
First, we will create a basic Scatter Plot based on a dataset that comes built-in with R, called <syntaxhighlight lang="R" inline>trees</syntaxhighlight>. This data set contains data on the girth, height and the volume of different trees.
 
 
The data frame for <syntaxhighlight lang="R" inline>trees</syntaxhighlight> dataset looks like this:
 
 
{| class="wikitable"
 
|-
 
! Girth !! Height !! Volume
 
|-
 
| 8.3|| 70|| 10.3
 
|-
 
| 8.6|| 65|| 10.3
 
|-
 
| 8.8 || 63 || 10.2
 
|-
 
| ...|| ...|| ...
 
|}
 
Here, the data for all the columns are numeric. So, no further data transformation is necessary.
 
 
<syntaxhighlight lang="R" line>
 
# Plot a basic scatter plot
 
plot(x = trees$Girth, y = trees$Height)
 
</syntaxhighlight>
 
 
'''Result in R'''
 
[[File:Basic Scatter Plot.png.png|This is a basic scatter plot made using R.]]
 
 
Now, we can take this plot and customize it by changing the shape and color of the points, and by adding a title and x- and y-axis labels to the plot.
 
 
<syntaxhighlight lang="R">
 
# Create a scatter plot with labels and colors
 
plot(x=trees$Girth, y=trees$Height, # choose the x- and y-values
 
    pch=16,                        # choose how points look on the plot
 
    col='blue',                    # choose the color of the points
 
    main='Scatter Plot of Girth and Height of Trees', # main header of the plot
 
    xlab='Tree girth', ylab='Tree height')            # x- and y-axis labels
 
</syntaxhighlight>
 
 
'''Result in R'''
 
[[File:Better Scatter Plot.png|Minor customizations make the plot look more professional and understandable.]]
 
 
These minor customizations make the plot look more professional and understandable.
 
 
=== Line chart ===
 
''' Description '''<br/>
 
A line chart is a  shows how quantitative values for different categories have changed over time. They are typically structured around a temporal x-axis with equal intervals from the earliest to latest point in time. Quantitative values are plotted using joined-up lines that effectively connect consecutive points positioned along a y-axis. The resulting slopes formed between the two ends of each line provide an indication of the local trends between points in time. As this sequence is extended to plot all values across the time frame it forms an overall line representative of the quantitative change over time story for a single categorical value. 
 
 
Multiple categories can be displayed in the same view, each represented by a unique line. Sometimes a point (circle/dot) is also used to substantiate the visibility of individual values. The lines used in a line chart will generally be straight. However, sometimes curved line interpolation may be used as a method of estimating values between known data points. This approach can be useful to help emphasise a general trend. While this might slightly compromise the visual accuracy of discrete values if you already have approximations, this will have less impact.
 
 
 
''' R Code '''<br/>
 
We will first plot a basic line chart based on a built-in dataset called <syntaxhighlight lang="R" inline>EuStockMarkets</syntaxhighlight>. The data set contains data on the closing stock prices of different European stock indices over the years 1991 to 1998.
 
 
To make things easier, we will first transform the built-in dataset into a data frame object. Then, we will use that data frame to create the plot.
 
 
The table that contains information about the different market indices looks like this:
 
 
{| class="wikitable"
 
|-
 
! DAX !! SMI !! CAC !! FTSE
 
|-
 
| 1628.75|| 1678.1 || 1772.8 || 2443.6
 
|-
 
| 1613.63|| 1688.5 || 1750.5 || 2460.2
 
|-
 
| 1606.51|| 1678.6 || 1718.0 || 2448.2
 
|-
 
| ... || ... || ... || ...
 
|}
 
 
Here, the data for all the columns are numeric.
 
 
The following line chart shows how the <syntaxhighlight lang="R" inline>DAX</syntaxhighlight> index from the table from previous section.
 
 
<syntaxhighlight lang="R" line>
 
# read the data as a data frame
 
eu_stocks <- as.data.frame(EuStockMarkets)
 
 
# Plot a basic line chart
 
plot(eu_stocks$DAX,  # simply select a stock index
 
    type='l')      # choose 'l' for line chart
 
</syntaxhighlight>
 
 
'''Result in R'''
 
[[File:Simple line chart.png]]
 
 
As you can see, the plot is very simple. We can enhance the way this plot looks by making a few tweaks, making it more informative and aesthetically pleasing.
 
 
<syntaxhighlight lang="R">
 
# get the data
 
eu_stocks <- as.data.frame(EuStockMarkets)
 
 
# Plot a basic line chart
 
plot(eu_stocks$DAX, # select the data
 
    type='l',      # choose 'l' for line chart
 
    col='blue',    # choose the color of the line
 
    lwd = 2,      # choose the line width
 
    main = 'Line Chart of DAX Index (1991-1998)',        # title of the plot
 
    xlab = 'Time (1991 to 1998)', ylab = 'Prices in EUR') # x- and y-axis labels
 
</syntaxhighlight>
 
 
'''Result in R'''
 
[[File:Line chart.png]]
 
 
You can see that this plot looks much more informative and attractive.
 
 
 
=== ggplot ===
 
'''Description'''<br/>
 
'''R Code'''<br/>
 
COMING SOON
 
 
=== Boxplot ===
 
''' Description '''<br/>
 
Boxplot show data as a box on a x-y-scale, with information on the distribution of the data as well as outliers. For more information on boxplots, please refer to the entry on [[Histograms and Boxplots]].
 
 
''' R Code '''<br/>
 
[[File:meanofozoneonrooseveltislandfrommaytosep1973.png|250px|frameless|right]]
 
<syntaxhighlight lang="R" line>
 
boxplot(airquality$Ozone,
 
        main = "Mean of Ozone on Roosevelt Island from May to Sep 1973 ",
 
        xlab="Ozone",
 
        ylab="Parts per Billion",
 
        boxwex = 0.5,    # defines width of box
 
        las = 1,          # flips labels on y-axis into horizontal position
 
        col="red",        # defines colour of box
 
        border = "black"  # turns frame and median of box to black
 
        )
 
</syntaxhighlight>
 
 
By default, box plots are plotted vertically. It can be flipped into a horizontal position, by passing the argument '''horizontal''' and setting it to '''TRUE'''. Furthermore, the box can be equipped with a '''notch''', by passing the argument notch and setting it to '''TRUE'''.
 
 
 
===Histogram===
 
''' Description '''<br/>
 
A histogram is a graphical display of data using bars (also called buckets or bins) of different height, where each bar groups numbers into ranges. Histograms reveal a lot of useful information about numerical data with a single explanatory variable. Histograms are used for getting a sense about the distribution of data, its median, and skewness. For more on Histograms, please refer to the entry on [[Histograms and Boxplots]].
 
 
[[File:Histogram structure.png|This is what a histogram looks like.]]
 
 
 
''' R Code '''<br/>
 
We will first plot a histogram based on a built-in dataset called <syntaxhighlight lang="R" inline>cars</syntaxhighlight>. This data set contains data on stopping distance of different cars at different speeds.
 
 
The table that contains information about the stopping distance of different cars at a given speed looks like this:
 
 
{| class="wikitable"
 
|-
 
! speed !! dist
 
|-
 
| 4|| 2
 
|-
 
| 4|| 10
 
|-
 
| 7|| 4
 
|-
 
| 7|| 22
 
|-
 
| 8|| 16
 
|-
 
| 9|| 10
 
|-
 
| ...|| ...
 
|}
 
 
Here, the data for both <syntaxhighlight lang="R" inline>speed</syntaxhighlight> and <syntaxhighlight lang="R" inline>dist</syntaxhighlight> columns are numeric. Therefore, we don't need to transform the data in any way in order to plot a histogram.
 
 
(with <syntaxhighlight lang="R" inline>speed</syntaxhighlight> variable)
 
 
<syntaxhighlight lang="R" line>
 
# data that we are going to use
 
View(cars)
 
 
# Plot a basic histogram
 
hist(cars$speed,
 
    main = "Histogram for speed of cars", # main title
 
    xlab = "Speed") # x-axis label
 
</syntaxhighlight>
 
 
'''Result in R'''
 
[[File:Simple Histogram.png]]
 
 
We can now make this look better.
 
(with <syntaxhighlight lang="R" inline>dist</syntaxhighlight> variable)
 
 
<syntaxhighlight lang="R">
 
 
# data that we are going to use
 
View(cars)
 
 
# Plot a basic histogram
 
hist(cars$dist,
 
    breaks = 15, # define the number of bins you want in the histogram
 
    col = 'seagreen', # define the color of the bars in the histogram
 
    main = "Histogram for stopping distance of cars", # main title
 
    xlab = "Stopping Distance") # x-axis label
 
</syntaxhighlight>
 
 
'''Result in R'''
 
[[File:Better Histogram.png|This is a better looking histogram.]]
 
 
 
=== Venn diagrams ===
 
'''Description'''<br/>
 
Venn diagrams are a simple way to compare 2-4 groups and their overlaps, allowing for multiple hits. This kind of visualisation allows to see logical relationships between data sets, where each is represented by a circle. The overlaps indicate elements common to both data sets.
 
 
''' R Code'''<br/>
 
 
Venn Diagrams are the most useful to visualise relationships between 2-3 datasets, otherwise the diagram becomes difficult to read if used to represent more groups.
 
 
The VennDiagram package in R allows to build Venn Diagrams with the help of its in-built function venn.diagram().
 
[[File:Insects Vann diagram1.png|250px|frameless|right]]
 
<syntaxhighlight lang="R">
 
 
#First download and install the VennDiagramm package
 
install.packages("VennDiagram")
 
library(VennDiagram)
 
 
# let's generate three datasets. Each set would represent a sample
 
# with identified insect species. The first sample includes 70 insects,
 
# the second samples includes 63 insects, and the third one includes
 
# 86 insects.
 
sample1 <- paste(rep("species_", 70), sample(c(1:100), 70, replace=F), sep="")
 
sample2 <- paste(rep("species_", 63), sample(c(1:100), 63, replace=F), sep="")
 
sample3 <- paste(rep("species_", 86), sample(c(1:100), 86, replace=F), sep="")
 
 
# Now let's create a venn diagram visualising these three data sets and the number of items
 
# they have in common. The file with your Venn Diagram will be saved on your hard disk,
 
# so you can view and use it seperately.
 
venn.diagram(
 
  x <- list(sample1, sample2, sample3),
 
  category.names = c("Sample 1" , "Sample 2 " , "Sample 3"),
 
  file = "Insects_Vann_diagram1.PNG",
 
  fill = c("red", "green", "blue"),
 
  alpha = c(0.5, 0.5, 0.5),
 
  cex = 2,
 
  cat.fontface = 2,
 
  lwd = 2,
 
)
 
</syntaxhighlight>
 
  
 
=== Other statistical figures ===
 
=== Other statistical figures ===
As mentioned above, we have found the website [https://www.data-to-viz.com/#connectedscatter "From data to Viz"] to be extremely helpful when choosing appropriate data visualisation. You can select the type of data you have (numeric, categoric, or both), and click through the exemplified figures. There is also R code examples.
+
For further data types and visualisation exploration, we have found the website [https://www.data-to-viz.com/#connectedscatter "From data to Viz"] to be extremely helpful when choosing appropriate data visualisation. You can select the type of data you have (numeric, categoric, or both), and click through the exemplified figures. There is also R code examples.
  
 
'''Further visualisation examples on this Wiki:'''
 
'''Further visualisation examples on this Wiki:'''
* [[Clustering Methods|Clustering]] entry.
+
* [[Clustering Methods|Clustering methods]]  
* LINK TO ORDINATION AND NETWORK ENTRIES
+
* [[Big problems for later|Ordination and Network Analysis]]
 
* [[Chord Diagram]]
 
* [[Chord Diagram]]
 
* [[Kernel density plot]]
 
* [[Kernel density plot]]
  
 
+
== Graphical etiquette – a rough guide how to make scientific quantitative figures ==
==Graphical etiquette – a rough guide how to make scientific quantitative figures==
 
 
There is an almost uncountable amount of scientific figures out there. While diversity is great, it can often be overwhelming to consider which graphics to do with data, and which other types of figures can be beneficial. In addition, there are some general norms and conventions to consider when making graphics. Let us start with these. While these are mere suggestions, and reflect my own style and experience, they may serve as a reflection basis for others.  
 
There is an almost uncountable amount of scientific figures out there. While diversity is great, it can often be overwhelming to consider which graphics to do with data, and which other types of figures can be beneficial. In addition, there are some general norms and conventions to consider when making graphics. Let us start with these. While these are mere suggestions, and reflect my own style and experience, they may serve as a reflection basis for others.  
  
 
'''1) Only show data in a figure that contains enough information to justify a figure'''
 
'''1) Only show data in a figure that contains enough information to justify a figure'''
 +
<br>
 
Many barplots I have seen contain 2 values. Could these not be fit into a sub-sentence? A graphic is to this end downright trivial, and wastes a lot of space. The same is also true for a piechart, but even worse, because it can look like a Pacman, and pie-charts are just generally misleading. Hence consider how many values you want to show, and if these really justify the journal space.  
 
Many barplots I have seen contain 2 values. Could these not be fit into a sub-sentence? A graphic is to this end downright trivial, and wastes a lot of space. The same is also true for a piechart, but even worse, because it can look like a Pacman, and pie-charts are just generally misleading. Hence consider how many values you want to show, and if these really justify the journal space.  
  
  
'''2) One should not show plots that violate the statistics'''
+
'''2) One should not show plots that violate the statistics'''<br>
 
A common example are barplots. There can be good reasons to show barplots with error bars, however the majority of data is better represented by a boxplot. Boxplots show more values, and are sensible for data that contains a range of integers. A typical borderline case is the [[Likert Scale]], which is often shown in boxplots. While this is not entirely wrong, it is a bit strange, as the scale contains five values, and a boxplot is constructed of five parts at least.  Another example is the standard fitted [https://www.statisticshowto.com/lowess-smoothing/ loess line] in a ggplot. If you show non-linear statistics, there should be a reason for it. Some of these trends show nothing at all, and follow no assumption beside adding some graphical interest. Scientific figures are however not only about aesthetics, but also about soundness and validity.  
 
A common example are barplots. There can be good reasons to show barplots with error bars, however the majority of data is better represented by a boxplot. Boxplots show more values, and are sensible for data that contains a range of integers. A typical borderline case is the [[Likert Scale]], which is often shown in boxplots. While this is not entirely wrong, it is a bit strange, as the scale contains five values, and a boxplot is constructed of five parts at least.  Another example is the standard fitted [https://www.statisticshowto.com/lowess-smoothing/ loess line] in a ggplot. If you show non-linear statistics, there should be a reason for it. Some of these trends show nothing at all, and follow no assumption beside adding some graphical interest. Scientific figures are however not only about aesthetics, but also about soundness and validity.  
  
  
'''3) Avoid empty spaces in plots. After all, your graphics are no Zen garden'''
+
'''3) Avoid empty spaces in plots. After all, your graphics are no Zen garden'''<br>
 
Speaking of aesthetics, your plots should be balanced, and should not contain open spaces. Try to set the x and y axes in a way that empty space is avoided. Such space could be ideal for a legend, or maybe some other additional information. Often people use weird short names in the legend, and then explain them in the figure caption. This should be avoided at all costs. Try to create legends that are self-explanatory. Make sure that the density of the information that is being shown is balanced, at least as long as the data and analysis allows for this. If figures contain much open space you can still definitely try to make them smaller, and maybe even combine several of these in a panel plot. In certain graphics such as clearcut correlations empty space is hard to avoid. This is a good indicator that maybe this figure is not needed at all, but instead can be replaced by some relevant values (e.g. p-value, R<sup>2</sup> ) in the text.  
 
Speaking of aesthetics, your plots should be balanced, and should not contain open spaces. Try to set the x and y axes in a way that empty space is avoided. Such space could be ideal for a legend, or maybe some other additional information. Often people use weird short names in the legend, and then explain them in the figure caption. This should be avoided at all costs. Try to create legends that are self-explanatory. Make sure that the density of the information that is being shown is balanced, at least as long as the data and analysis allows for this. If figures contain much open space you can still definitely try to make them smaller, and maybe even combine several of these in a panel plot. In certain graphics such as clearcut correlations empty space is hard to avoid. This is a good indicator that maybe this figure is not needed at all, but instead can be replaced by some relevant values (e.g. p-value, R<sup>2</sup> ) in the text.  
  
  
'''4) Use diverse, but also non-confusing colours'''
+
'''4) Use diverse, but also non-confusing colours'''<br>
 
I am colourblind, so I am in the privileged position to judge on the diversity of colours in a plot. Actually, no test ever picked up any colour blindness, but I definitely have some problem there. Hence I would say that colours should be diverse, and ideally brightness and hue allow also to differentiate groups. Figures become problematic if there are too many colours, and I define this threshold for me at around 8. In addition, consider using colours that have proven their value when combined. I can endorse weandersonpalette as a really cool package that gives you a pleasing combination of colours.  
 
I am colourblind, so I am in the privileged position to judge on the diversity of colours in a plot. Actually, no test ever picked up any colour blindness, but I definitely have some problem there. Hence I would say that colours should be diverse, and ideally brightness and hue allow also to differentiate groups. Figures become problematic if there are too many colours, and I define this threshold for me at around 8. In addition, consider using colours that have proven their value when combined. I can endorse weandersonpalette as a really cool package that gives you a pleasing combination of colours.  
  
  
'''5) Label orientation goes a long way'''
+
'''5) Label orientation goes a long way'''<br>
 
Most label orientations in most plots are wrong. Try to bring your labels -if space permit- into a horizontal orientation. quite often tables are too densely packed on the x-axis then, yet you could also consider making them in a 45 ° fashion. As much as a 90 ° label orientation can offer a little stretch for the neck, it is also a one-sided exercise. Hence consider flipping the labels whenever possible.
 
Most label orientations in most plots are wrong. Try to bring your labels -if space permit- into a horizontal orientation. quite often tables are too densely packed on the x-axis then, yet you could also consider making them in a 45 ° fashion. As much as a 90 ° label orientation can offer a little stretch for the neck, it is also a one-sided exercise. Hence consider flipping the labels whenever possible.
  
  
'''6) Compose the figure right to its intended size'''
+
'''6) Compose the figure right to its intended size'''<br>
 
One of the most common mistakes in the creation of figures is the proportion of different parts of the figure. Often the axes labels are really small, but the heading is massive. Ideally when designing a figure, one should consider to size with which this figure should be printed. If you know that a figure is only 6x6 centimetres, you need to make the labels sufficiently large and sparse to be readable, and orderly. A larger figure has different proportions. Hence design your figures in the right composition interns of the different text sizes.
 
One of the most common mistakes in the creation of figures is the proportion of different parts of the figure. Often the axes labels are really small, but the heading is massive. Ideally when designing a figure, one should consider to size with which this figure should be printed. If you know that a figure is only 6x6 centimetres, you need to make the labels sufficiently large and sparse to be readable, and orderly. A larger figure has different proportions. Hence design your figures in the right composition interns of the different text sizes.
  
  
'''7) Use one font only'''
+
'''7) Use one font only'''<br>
 
Ok, font types are like borderline religion, but still I guess we can all agree that one should only use one font in a figure. If you want to make a typesetter cry you may use one font with serifs and another one without serifs, but otherwise do not do that. It can make sense to use a narrow font (iei. Arial narrow) if you do not have enough space in your figure.  
 
Ok, font types are like borderline religion, but still I guess we can all agree that one should only use one font in a figure. If you want to make a typesetter cry you may use one font with serifs and another one without serifs, but otherwise do not do that. It can make sense to use a narrow font (iei. Arial narrow) if you do not have enough space in your figure.  
  
  
'''8) Occam's razor applies to scientific figures, too'''
+
'''8) Occam's razor applies to scientific figures, too'''<br>
 
A good figure is balanced, and ideally contains as much information as possible, but not more. You can however also decide to make figures that contain more information, and are borderline like mandalas. This is really cool if you have complex information to show for, and maybe the result basically says that it's complex. Other figures may be graspable in a split second, which is very cool if you want people to understand something quickly. In between is the Occam's razor sweet spot, where you need a few seconds with the figure, but then you kind of got it. To this send, let it be known that this sweet spot is different for different people.  
 
A good figure is balanced, and ideally contains as much information as possible, but not more. You can however also decide to make figures that contain more information, and are borderline like mandalas. This is really cool if you have complex information to show for, and maybe the result basically says that it's complex. Other figures may be graspable in a split second, which is very cool if you want people to understand something quickly. In between is the Occam's razor sweet spot, where you need a few seconds with the figure, but then you kind of got it. To this send, let it be known that this sweet spot is different for different people.  
  

Latest revision as of 12:31, 13 April 2022

In short: This entry introduces you to the most relevant forms of data visualisation, and links to dedicated entries on specific visualisation forms with R examples.

Basic forms of data visualisation

The easiest way to represent count information are basically barplots. They are a bit over simplistic if they contain only one level of information such as three groups and their abundance, and can be more advanced if they contain two levels of information such as in stacked barplots. These can be shown as either absolute numbers or proportions, which may make a dramatic difference for the analysis or interpretation.

Correlation plots ('xyplots') are the next staple in statistical graphics and most often the graphical representation of a correlation. Further, often also a regression is implemented to show effect strengths and variance. Fitting a regression line is often the most important visual aid to showcase the trend. Through point size or color can another information level be added, making this a really powerful tool, where one needs to keep a keen eye on the relation between correlation and causality. Such plots may also serve to show fluctuations in data over time, showing trends within data as well as harmonic patterns.

Boxplots are the last in what I would call the trinity of statistical figures. Showing the variance of continuous data across different factor levels is what these plots are made of. While histograms reveal more details and information, boxplots are a solid graphical representation of the Analysis of Variance. A rule of thumb is that if one box is higher or lower than the median (the black line) of the other box, the difference may be signifiant.

A Correlation plot. The line shows the regression, the dots are the data points.
Boxplots.
Barplots. The left diagram shows absolute, the right one relative Barplots.


A Histogram.

A histogram is a graphical display of data using bars (also called buckets or bins) of different height, where each bar groups numbers into ranges. They can help reveal a lot of useful information about numerical data with a single explanatory variable. Histograms are used for getting a sense about the distribution of data, its median, and skewness.

Simple pie charts are not really ideal, as they camouflage the real proportions of the data they show. Venn diagrams are a simple way to compare 2-4 groups and their overlaps, allowing for multiple hits. Larger co-connections can either be represented by a bipartite plot, if the levels are within two groups, or, if multiple interconnections exist, then a structural equation model representation is valuable for more deductive approaches, while rather inductive approaches can be shown by circular network plots (aka Chord Diagram).

A Venn Diagram showing the number of articles in a systematic review that revolve around one or more of three topics. Source: Partelow et al. 2018. A Sustainability Agenda for Tropical Marine Science.
A bipartite plot showing the affiliation of publication authors and the region where a study was conducted. Source: Brandt et al. 2013. A review of transdisciplinary research in sustainability science.
A piecewise structural equation model quantifying hypothesized relationships between economic and technological power, military strength, biophysical reserves and net imports of resources as well as trade in value added per exported resource item in global trade in 2015. Source: Dorninger et al. 2021. Global patterns of ecologically unequal exchange: Implications for sustainability in the 21st century.


Multivariate data can be principally shown by three ways of graphical representation: ordination plots, cluster diagrams or network plots. Ordination plots may encapsulate such diverse approaches as decorana plots, principal component analysis plots, or results from a non-metric dimensional scaling. Typically, the first two most important axis are shown, and additional information can be added post hoc. While these plots show continuous patterns, cluster dendrogramms show the grouping of data. These plots are often helpful to show hierarchical structures in data. Network plots show diverse interactions between different parts of the data. While these can have underlying statistical analysis embedded, such network plots are often more graphical representations than statistical tests.

An Ordination plot (Principal Component Analysis) in which analyzed villages (colored abbreviations) in Transylvania are located according to their natural capital assets alongside two main axes, explaining 50% and 18% of the variance. Source: Hanspach et al 2014. A holistic approach to studying social-ecological systems and its application to southern Transylvania.
A circular network plot showing how sub-topics of social-ecological processes were represented in articles assessed in a systematic review. The proportion of the circle represents a topic's importance in the research, and the connections show if topics were covered alongside each other. Source: Partelow et al. 2018. A sustainability agenda for tropical marine science.

Descriptive Infographics can be a fantastic way to summarise general information. A lot of information can be packed in one figure, basically all single variable information that is either proportional or absolute can be presented like this. It can be tricky if the number of categories is very high, which is when a miscellaneous category could be added to a part of an infographic. Infographics are a fine art, since the balance of information and aesthetics demands a high level of experience, a clear understanding of the data, and knowledge in the deeper design of graphical representation.


Of course, there is more. While the figures introduced above represent a vast share of the visual representations of data that you will encounter, there are different forms that have not yet been touched. We have found the website "From data to Viz" to be extremely helpful when choosing appropriate data visualisation. You can select the type of data you have (numeric, categoric, or both), and click through the exemplified figures. There is also R code examples.


How to visualize data in R

The following overview includes all forms of data visualisation that we consider important.
Based on your data, have a look which forms of visualisation might be relevant for you. Just hover over the individual visualisation type and it will show you its name. It will also show you a quick example which this kind of visualisation might be helpful for. By clicking, you will be redirected to a dedicated entry with exemplary R code.

Tip: If you are unsure whether you have qualitative or quantitative data, have a look at the entry on Data formats. Keep in mind: categorical (qualitative) data that is counted in order to visualise each category's occurrence, is not quantitative (= numeric) data. It's still qualitative data that is just transformed into count data. So the visualisations on the left do indeed display some kind of quantitative information, but the underlying data was always qualitative.

Factor analysis Venn Diagram, e.g. variables TREE SPECIES IN EUROPE, TREE SPECIES IN ASIA, TREE SPECIES IN AMERICA as three colors, with joint species in the overlaps Venn Diagram, e.g. variables TREE SPECIES IN EUROPE, TREE SPECIES IN ASIA as two colors, with joint species in the overlaps Stacked Barplot, e.g. count data of different species (colors) for the variable TREES Barplot, e.g. different kinds of trees (x) as count data (y) for the variable TREES Histogram, e.g. the variable EXAM POINTS as count data (y) per interval (x) Line Chart, e.g. TIME (x) and BITCOIN VALUE (y) Bubble Plot, e.g. GDP (x), LIFE EXPECTANCY (y), POPULATION (bubble size) Ordination, e.g. numeric variables (AGE, INCOME, HEIGHT) are transformed into Principal Components (x & y) along which data points are arranged and explained Treemap, e.g. FORESTS (colors) and count data of the included species (rectangles) Simple Stacked Barplot, e.g. different species of trees (absolute count data per color) for the variables TREES in ASIA, TREES IN AMERICA, TREES IN AFRICA, TREES IN EUROPE (x) Proportions Stacked Barplot, e.g. relative count data (y) of different species (colors) for the variables TREES in ASIA & TREES IN EUROPE (x) Pie Chart, e.g. different kinds of trees (relative count data per color) for the variable TREE SPECIES Boxplot, e.g. TREE HEIGHT (y) for beeches Kernel Density Plot, e.g. count data (y) of EXAM POINTS per point (x) Scatter Plot, e.g. RUNNER ENERGY LEVEL (y) per KILOMETERS (x) Heatmap with lines Correlogram, e.g. the CORRELATION COEFFICIENT (shade) for each pair of the numeric variables KILOMETERS PER LITER, CYLINDERS, HORSEPOWER, WEIGHT Wordcloud Spider Plot, e.g. relative count data of different species (shape) for the variables TREES IN EUROPE (green), TREES IN ASIA (blue), TREES IN AMERICA (red) Simple Stacked Barplot, e.g. absolute count data (y) of different species (colors) for the variables TREES in ASIA & TREES IN EUROPE (x) Linear Regression Plot, e.g. INCOME (y) per AGE (x) Chord Diagram, e.g. count data of FAVORITE SNACKS (colors) with the connections connecting shared favorites Sankey Diagram, e.g. count data of FAVORITE SNACKS (colors) with the connections connecting shared favorites (if connections are valued: 3 variables) Boxplot, e.g. different TREE SPECIES (x) and TREE HEIGHT (y) Stacked Area Plot, e.g. INCOME (x) and count data (y) of BOUGHT ITEMS (colors) (if y is EXPENSES: three variables) Kernel Density Plot, e.g. count data (y) of EXAM POINTS IN MATHS (blue) and EXAM POINTS in HISTORY (green) per point (x) Multiple Regression, e.g. INCOME (y) per AGE (x) in different COUNTRIES Cluster Analysis, e.g. car data points are grouped by their similarity according to the numeric variables KILOMETERS PER LITER, CYLINDERS, HORSEPOWER, WEIGHT Sankey Diagram, e.g. count data of VOTER PREFERENCES (colors) with movements from Y1 to Y2 to Y3 Simple Stacked Barplot, e.g. absolute count data (y) of different species (colors) for the variables TREES in ASIA & TREES IN EUROPE (x), with numeric PHYLOGENETIC DIVERSITY (bar width) Boxplot, e.g. TREE SPECIES (x), TREE HEIGHT (y), COUNTRIES (colors) Bubble Plot, e.g. GDP (x), LIFE EXPECTANCY (y), COUNTRY (bubble color) Heatmap, e.g. TREE SPECIES (x) with FERTILIZER BRAND (y) and HEIGHT (colors) Network Plot, e.g. calculated connection strength (line width) between actors (nodes) based on LOCAL PROXIMITY, RATE OF INTERACTION, AGE, CASH FLOWS (nodes may be categorical) Cluster Analysis, e.g. car data points are grouped by their similarity according to the numeric variables KILOMETERS PER LITER, CYLINDERS, HORSEPOWER and categorical BRAND Bubble Plot, e.g. GDP (x), LIFE EXPECTANCY (y), COUNTRY (bubble color), POPULATION SIZE (bubble size) Factor analysis Structural Equation Plot Ordination, e.g. numeric and categorical variables (AGE, INCOME, HEIGHT, PROFESSION) are transformed into Principal Components (x & y) along which data points are arranged and explainedStatistical Figures Overview 27.05.png
About this image


Other statistical figures

For further data types and visualisation exploration, we have found the website "From data to Viz" to be extremely helpful when choosing appropriate data visualisation. You can select the type of data you have (numeric, categoric, or both), and click through the exemplified figures. There is also R code examples.

Further visualisation examples on this Wiki:

Graphical etiquette – a rough guide how to make scientific quantitative figures

There is an almost uncountable amount of scientific figures out there. While diversity is great, it can often be overwhelming to consider which graphics to do with data, and which other types of figures can be beneficial. In addition, there are some general norms and conventions to consider when making graphics. Let us start with these. While these are mere suggestions, and reflect my own style and experience, they may serve as a reflection basis for others.

1) Only show data in a figure that contains enough information to justify a figure
Many barplots I have seen contain 2 values. Could these not be fit into a sub-sentence? A graphic is to this end downright trivial, and wastes a lot of space. The same is also true for a piechart, but even worse, because it can look like a Pacman, and pie-charts are just generally misleading. Hence consider how many values you want to show, and if these really justify the journal space.


2) One should not show plots that violate the statistics
A common example are barplots. There can be good reasons to show barplots with error bars, however the majority of data is better represented by a boxplot. Boxplots show more values, and are sensible for data that contains a range of integers. A typical borderline case is the Likert Scale, which is often shown in boxplots. While this is not entirely wrong, it is a bit strange, as the scale contains five values, and a boxplot is constructed of five parts at least. Another example is the standard fitted loess line in a ggplot. If you show non-linear statistics, there should be a reason for it. Some of these trends show nothing at all, and follow no assumption beside adding some graphical interest. Scientific figures are however not only about aesthetics, but also about soundness and validity.


3) Avoid empty spaces in plots. After all, your graphics are no Zen garden
Speaking of aesthetics, your plots should be balanced, and should not contain open spaces. Try to set the x and y axes in a way that empty space is avoided. Such space could be ideal for a legend, or maybe some other additional information. Often people use weird short names in the legend, and then explain them in the figure caption. This should be avoided at all costs. Try to create legends that are self-explanatory. Make sure that the density of the information that is being shown is balanced, at least as long as the data and analysis allows for this. If figures contain much open space you can still definitely try to make them smaller, and maybe even combine several of these in a panel plot. In certain graphics such as clearcut correlations empty space is hard to avoid. This is a good indicator that maybe this figure is not needed at all, but instead can be replaced by some relevant values (e.g. p-value, R2 ) in the text.


4) Use diverse, but also non-confusing colours
I am colourblind, so I am in the privileged position to judge on the diversity of colours in a plot. Actually, no test ever picked up any colour blindness, but I definitely have some problem there. Hence I would say that colours should be diverse, and ideally brightness and hue allow also to differentiate groups. Figures become problematic if there are too many colours, and I define this threshold for me at around 8. In addition, consider using colours that have proven their value when combined. I can endorse weandersonpalette as a really cool package that gives you a pleasing combination of colours.


5) Label orientation goes a long way
Most label orientations in most plots are wrong. Try to bring your labels -if space permit- into a horizontal orientation. quite often tables are too densely packed on the x-axis then, yet you could also consider making them in a 45 ° fashion. As much as a 90 ° label orientation can offer a little stretch for the neck, it is also a one-sided exercise. Hence consider flipping the labels whenever possible.


6) Compose the figure right to its intended size
One of the most common mistakes in the creation of figures is the proportion of different parts of the figure. Often the axes labels are really small, but the heading is massive. Ideally when designing a figure, one should consider to size with which this figure should be printed. If you know that a figure is only 6x6 centimetres, you need to make the labels sufficiently large and sparse to be readable, and orderly. A larger figure has different proportions. Hence design your figures in the right composition interns of the different text sizes.


7) Use one font only
Ok, font types are like borderline religion, but still I guess we can all agree that one should only use one font in a figure. If you want to make a typesetter cry you may use one font with serifs and another one without serifs, but otherwise do not do that. It can make sense to use a narrow font (iei. Arial narrow) if you do not have enough space in your figure.


8) Occam's razor applies to scientific figures, too
A good figure is balanced, and ideally contains as much information as possible, but not more. You can however also decide to make figures that contain more information, and are borderline like mandalas. This is really cool if you have complex information to show for, and maybe the result basically says that it's complex. Other figures may be graspable in a split second, which is very cool if you want people to understand something quickly. In between is the Occam's razor sweet spot, where you need a few seconds with the figure, but then you kind of got it. To this send, let it be known that this sweet spot is different for different people.


The author of this entry is Henrik von Wehrden.