Heatmap

Note: This entry revolves specifically around Heatmaps. For more general information on quantitative data visualisation, please refer to Introduction to statistical figures.

In short: A heatmap is a graphical representation of data where the individual numerical values are substituted with colored cells. In other words, it is a table with colors in place of numbers. As in the regular table, in the heatmap, each column is a feature and each row is an observation.

Why use a heatmap?

Heatmaps are useful to get an overall understanding of the data. While it can be hard to look at the table of numbers it is much easier to perceive the colors. Thus it can be easily seen which value is larger or smaller in comparison to others and how they are generally distributed.

Color assignment and normalization of data

The principle by which the colors in a heatmap are assigned is that all the values of the table are ranked from the highest to lowest and then segregated into bins. Each bin is then assigned a particular color. However, in the case of the small datasets, colors might be assigned based on the values themselves and not on the bins. Usually, for higher value, the color is more intense or darker, and for the smaller is paler or lighter, depending on which color palette is chosen.

It is important to remember that since each feature in a dataset does not always have the same scale of measurement, usually the normalization (scaling) of data is required. The goal of normalization is to change the values of numeric rows and/or columns in the dataset to a common scale, without distorting differences in the ranges of values.

It also means that if our data are not normalized, we can compare each value with any other by color across the whole heatmap. However, if the data are normalized, then the color is assigned based on the relative values in the row or column, and therefore each value can be compared with others only in their corresponding row or column, while the same color in a different row/column will not have the same value behind it or belong to the same bin.

R Code

To build the heatmap we will use the heatmap() function and mtcars dataset. It is important to note that the heatmap() function only takes a numeric matrix of the values as data for plotting. Therefore we need to check if our dataset only includes numbers and then transform our dataset into a matrix, using as.matrix() function.

data("mtcars")
matcars <- as.matrix(mtcars)

Also, for better representation, we are going to rename the columns, giving them their full names. It is not a mandatory step, but it makes our heatmap more comprehensible.

fullcolnames <- c("Miles per Gallon", "Number of Cylinders",
                  "Displacement", "Horsepower", "Rear Axle Ratio",
                  "Weight", "1/4 Mile Time", "Engine", "Transmission",
                  "Number of Gears", "Number of Carburetors")

Now we are using the transformed dataset (matcars) to create the heatmap. Other used arguments are explained below.

Fig.1

#Fig.1
heatmap(matcars, Colv = NA, Rowv = NA, 
        scale = "column", labCol = fullcolnames, 
        margins = c(11,5))

How to interpret a heatmap?

In the default color palette the interpretation is usually the following: the darker the color the higher the responding value, and vice versa. For example, let’s look at the feature “Number of Carburetors”. We can see that Maserati Bora has the darkest color, hence it has the largest number of carburetors, followed by Ferrari Dino, which has the second-largest number of carburetors. While other models such as Fiat X1-9 or Toyota have the lightest colors. It means that they have the lowest numbers of carburetors. This interpretation can be applied to every other column.

Explanation of used arguments

Colv = NA and Rowv = NA are used to remove the dendrograms from rows and columns. A dendrogram is a diagram that shows the hierarchical relationship between objects and is added on top of the heatmap by default if the argument is not specified. The main reason for removing it here is that it is a different method of data visualisation which is not mandatory for the heatmap representation and requires a separate article to review it fully.
scale = “column” is used to normalize the columns of the matrix (to absorb the variation between columns). As it was stated previously, normalization is needed due to the algorithm by which the colors are set. Here in our dataset, the values of features “Gross horsepower” and “Displacement” are much larger than the rest. Therefore, without normalization, these two columns will be all marked approximately equally high and all the other columns equally low. Normalizing means that we keep the relative values in each column but not the real numbers. In the interpretation sense it means that, for example, the same color of features “Miles per Gallon” and “Number of Cylinders” of Mazda RX4 does not mean that the actual values are the same or approximately the same (placed in the same bin). It only means that the relative values of each of these cells in corresponding columns are the same or are in the same bin.
margins is used to fit the columns and rows names into the graph. The reason we used it here is because of the renaming of the columns, which is resulted in longer names that did not fit well by themselves.

Coloring options for the heatmap The choice of color for the heatmap is one of the most important aspects of creating an understandable and nice-looking representation of the data. If you do not specify the color (as in the example above) then the default color palette will be applied. However, you can use the argument col and choose from a wide variety of palettes for coloring your heatmap.

There are two options of setting a color palette for the heatmap:

First option is to use the palettes from R: cm.colors(), heat.colors(), rainbow(), terrain.color() or topo.colors()
The second option is to install color palettes packages such as RColorBrewer

Additional materials

The author of this entry is Evgeniya Chetneva.