Difference between revisions of "Apply, Lapply and Tapply"
(Created page with "'''In short:''' The functions of the ''apply()''-family are useful tools to go through slices of your data and repetitively perform operations on them. In more sophisticated t...") |
m |
||
Line 346: | Line 346: | ||
The ''map_()''-functions of the '''purrr'''-package present an alternative for iteration tasks on data while the '''dplyr'''-package of the Tidyverse offers a plethora of functions for data manipulation such as ''select()'', ''group_by()'' or ''mutate()''. Some users like the consistency and speed of the Tidyverse functions more, while others prefer the stability and sometimes more concise code of Base R. In the end, it is matter of taste. | The ''map_()''-functions of the '''purrr'''-package present an alternative for iteration tasks on data while the '''dplyr'''-package of the Tidyverse offers a plethora of functions for data manipulation such as ''select()'', ''group_by()'' or ''mutate()''. Some users like the consistency and speed of the Tidyverse functions more, while others prefer the stability and sometimes more concise code of Base R. In the end, it is matter of taste. | ||
+ | |||
+ | ---- | ||
+ | [[Category:Statistics]] | ||
+ | [[Category:R examples]] |
Latest revision as of 11:25, 3 March 2021
In short: The functions of the apply()-family are useful tools to go through slices of your data and repetitively perform operations on them. In more sophisticated terms, these functions offer a concise and convenient way of implementing the SPLIT-APPLY-COMBINE-strategy of data analysis, i.e. splitting up some data into smaller pieces, applying a function to each piece and combining the results.
Contents
Why use apply()-functions?
One main advantage of the apply()-functions is that they let you iterate over your data without having to use regular loops. While loops in R are not necessarily much slower than the apply()-collection, their syntax is more complicated and redundant. The various apply()-functions offer a way of accessing specific elements of data in a convenient and simplified way.
More precisely, in contrast to regular loops, the apply()-functions take a top-down approach syntactically, making them easier to read and understand. Where loops start from the smallest element (`for (i in x) ...`), with the actual operation possibly nested deep within the loop, apply()-functions call the overlying structures first (matrices, lists, ...) before immediately *applying* a function that automatically runs over each element of the called object.
Concerning speed, vectorized functions like, for example, colmeans() still perform faster than variations of _apply(x, mean). The great benefit of the apply()-collection is their versatility, as you can pass any function, predefined or user-defined, on to them.
There exists a whole bunch of apply()-functions, namely apply(), lapply(), sapply(), vapply(), tapply(), rapply(), mapply(). Which one you use, depends mainly on the type of your input, and on the output you want to get. This article will focus mainly on apply(), lapply() and tapply().
apply()
The function apply() lets you iterate over the rows and columns of a matrix and run a function on them, returning a vector, list or array as the output.
It has the following structure, taking three arguments:
apply(X, MARGIN, FUN, ...)
X is the input and has to be a matrix or an array, while FUN can be any function that should be applied on the data. Be sure to always put in the function's name without parentheses.
MARGIN specifies if the function should be run on each of the columns, rows or cells of the matrix. With MARGIN = 1, you iterate over the rows, with MARGIN = 2, you iterate over the rows, and with MARGIN = 1:2, you iterate over each cell.
Let's use the apply()-function to analyze an actual dataset, in this case, USPersonalExpenditure.
USPersonalExpenditure ## Output: ## ## 1940 1945 1950 1955 1960 ## Food and Tobacco 22.200 44.500 59.60 73.2 86.80 ## Household Operation 10.500 15.500 29.00 36.5 46.20 ## Medical and Health 3.530 5.760 9.71 14.0 21.10 ## Personal Care 1.040 1.980 2.45 3.4 5.40 ## Private Education 0.341 0.974 1.80 2.6 3.64
Let's say we want to find out the total expenditures for each year. We set MARGIN to 2 to iterate over the columns and apply the function "sum" to each of them.
apply(USPersonalExpenditure, MARGIN = 2, FUN = sum) ## Output: ## ## 1940 1945 1950 1955 1960 ## 37.611 68.714 102.560 129.700 163.140
Likewise, we can look at the rows and compute the mean. Notice that you don't need to write down the names of the arguments if you give them in the default order.
apply(USPersonalExpenditure, 1, mean) ## Output: ## ## Food and Tobacco Household Operation Medical and Health Personal Care ## 57.260 27.540 10.820 2.854 ## Private Education ## 1.871
Storing the result in a variable and looking at its class, we can see that in this case, apply() returns a (named) numeric vector.
mean_Exp <- apply(USPersonalExpenditure, 1, mean) class(mean_Exp) ## Output: ## [1] "numeric"
In case you want to perform an action on each cell of the matrix, this is also possible. Let's say you want to convert the dollar values in USPersonalExpenditures to another currency, i.e. multiply each value by some exchange rate. Since apply() accepts any function, not just predefined ones, this can be easily done by writing a short anonymous function that takes each value as an argument and computes the new value. Actually, inserting anonymous functions into apply()-structures are a major feature of the collection as a whole.
apply(USPersonalExpenditure, 1:2, function(x) {x * 0.82}) ## Output: ## ## 1940 1945 1950 1955 1960 ## Food and Tobacco 18.20400 36.49000 48.8720 60.024 71.1760 ## Household Operation 8.61000 12.71000 23.7800 29.930 37.8840 ## Medical and Health 2.89460 4.72320 7.9622 11.480 17.3020 ## Personal Care 0.85280 1.62360 2.0090 2.788 4.4280 ## Private Education 0.27962 0.79868 1.4760 2.132 2.9848
As you can see, apply() returns another matrix this time instead of a vector. You might also use apply() with dataframes. Be aware, though, that the data types of the dataframe must be the same, as apply() will otherwise convert them to one single type. Therefore, either exclude problematic columns first or use another function.
lapply()
lapply() is quite similar in its use to apply(), but whereas the latter iterates over elements of a matrix, lapply() iterates over elements of a list. The returned object is also always a list.
`lapply(X, FUN, ...)`
where X is the input list and FUN the applied function.
list_characters <- list("Steve", "Mary", 2, 4, "Lisa") lapply(list_characters, is.character) ## Output: ## ## [[1]] ## [1] TRUE ## ## [[2]] ## [1] TRUE ## ## [[3]] ## [1] FALSE ## ## [[4]] ## [1] FALSE ## ## [[5]] ## [1] TRUE
Again, you can write your own function that lapply() should take as an argument.
list_numbers <- list(24, 12, 9, 9, 22, 24, 7) decode_func <- function(x) {letters[length(letters) - x + 1]} decode <- lapply(list_numbers, decode_func) unlist(decode) ## Output: ## [1] "c" "o" "r" "r" "e" "c" "t"
You can very well use lapply() with vectors, dataframes or matrices, but it will always transform them into lists. In the example, the columns of the dataframe infert are turned into elements of a list:
lapply(infert, class) ## Output: ## ## $education ## [1] "factor" ## ## $age ## [1] "numeric" ## ## $parity ## [1] "numeric" ## ## $induced ## [1] "numeric" ## ## $case ## [1] "numeric" ## ## $spontaneous ## [1] "numeric" ## ## $stratum ## [1] "integer" ## ## $pooled.stratum ## [1] "numeric"
lapply() can be particularly useful when you want to run several operations on top of each other for parts of your data. This method is way more convenient than using nested for-loops.
sapply()
In many situations, working with list data can be inconvenient. One solution to this problem would be to use sapply(), which will always try to simplify the result of lapply() into either a vector or a matrix.
sapply(infert, class) ## Output: ## ## education age parity induced case spontaneous ## "factor" "numeric" "numeric" "numeric" "numeric" "numeric" ## stratum pooled.stratum ## "integer" "numeric"
As the output values above (the elements of the returned list) are 1-dimensional, sapply() returns a vector. If on the other hand the results are n-dimensional, sapply() returns a matrix:
sapply(faithful, quantile) ## Output: ## ## eruptions waiting ## 0% 1.60000 43 ## 25% 2.16275 58 ## 50% 4.00000 76 ## 75% 4.45425 82 ## 100% 5.10000 96
Should the output values have different dimensions, sapply() cannot simplify. In the example below, the $education column is a 3-dimensional vector, whereas the others are 6-dimensional – therefore, sapply() returns a list and therefore works just like lapply() does.
sapply(infert, summary) ## Output: ## ## $education ## 0-5yrs 6-11yrs 12+ yrs ## 12 120 116 ## ## $age ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 21.00 28.00 31.00 31.50 35.25 44.00 ## ## $parity ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 1.000 2.000 2.093 3.000 6.000 ## ## $induced ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 0.0000 0.5726 1.0000 2.0000 ## ## $case ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 0.0000 0.3347 1.0000 1.0000 ## ## $spontaneous ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0000 0.0000 0.0000 0.5766 1.0000 2.0000 ## ## $stratum ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 21.00 42.00 41.87 62.25 83.00 ## ## $pooled.stratum ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.00 19.00 36.00 33.58 48.25 63.00
vapply()
While sapply() is a convenient function to quickly get handy results, one cannot be sure which type of object it will return. This may pose a problem if you have long, complex code. To tackle that issue, the function vapply() lets you specify the output. Thus, you can make sure to avoid errors and make debugging much easier. It may be viewed as safer version of sapply().
Consequently, vapply() takes three arguments
vapply(X, FUN, FUN.VALUE, ...)
where FUN.VALUE represents the output type.
vapply(infert, class, character(1)) ## Output: ## ## education age parity induced case spontaneous ## "factor" "numeric" "numeric" "numeric" "numeric" "numeric" ## stratum pooled.stratum ## "integer" "numeric"
While vapply accepts the output values above, since each is a character vector of length one as specified, it gives a error message below, because not all the output values are a numeric vector of length 6.
vapply(infert, summary, numeric(6)) ## Output: ## ## Error in vapply(infert, summary, numeric(6)) : values must be length 6, ## but FUN(X[[1]]) result is length 3
tapply()
Often, you do not only want to look at individual columns or rows but to separate your data further into categories before going into analysis. With tapply(), you can first split your data into groups, and only then apply a function to each of the elements.
tapply() takes at least three arguments
tapply(X, INDEX, FUN, ...)
where INDEX represents the grouping variable which splits the output into as many groups as the grouping variable has components. Naturally, this makes most sense with factor variables.
In the following example, the variable Petal.Length is grouped by the Species variable; subsequently, the function summary() is applied on each group.
tapply(iris$Petal.Length, iris$Species, mean) ## Output: ## ## setosa versicolor virginica ## 1.462 4.260 5.552
It is possible to group your data by several categorical variables, if you provide a list of variables to the INDEX-argument. In the example below, using the *UCBAdmission* dataset, tapply() provides a matrix showing the total student applications by gender and department.
admit_df <- data.frame(UCBAdmissions) tapply(admit_df$Freq, list(admit_df$Gender, admit_df$Dept), sum) ## Output: ## A B C D E F ## Male 825 560 325 417 191 373 ## Female 108 25 593 375 393 341
Again, you can write your own function for tapply(), for example if you want to check for the ratio of admissions for each department:
tapply(admit_df$Freq, list(admit_df$Gender, admit_df$Dept), function(x) {x[1]/sum(x)}) ## Output: ## A B C D E F ## Male 0.6206061 0.6303571 0.3692308 0.3309353 0.2774869 0.05898123 ## Female 0.8240741 0.6800000 0.3406408 0.3493333 0.2391858 0.07038123
You can easily combine several apply()-functions to get the information you are interested in. In the example below, the admission rates are first grouped by tapply(), and then summarized by apply() to do a statistical analysis of male and female admission rates.
ratio <- tapply(admit_df$Freq, list(admit_df$Gender, admit_df$Dept), function(x) {x[1]/sum(x)}) apply(ratio, 1, summary) ## Output: ## ## Male Female ## Min. 0.05898123 0.07038123 ## 1st Qu. 0.29084900 0.26454952 ## Median 0.35008301 0.34498707 ## Mean 0.38126623 0.41726920 ## 3rd Qu. 0.55776224 0.59733333 ## Max. 0.63035714 0.82407407
When is it better to use actual loops?
The functions of the apply()-familiy are built to loop over static data. They will process each data point independently of any others and only once. Thus, the apply()-functions will not work for recursive tasks. Therefore, in cases where the output depends on the previous output and input, loops are the way to go.
Similarly, apply()-functions cannot replace while-loops, when it is not defined at which point the looping has to stop.
Tidyverse vs. Base R
The map_()-functions of the purrr-package present an alternative for iteration tasks on data while the dplyr-package of the Tidyverse offers a plethora of functions for data manipulation such as select(), group_by() or mutate(). Some users like the consistency and speed of the Tidyverse functions more, while others prefer the stability and sometimes more concise code of Base R. In the end, it is matter of taste.