The title may seem tautological, but since the arrival of dplyr
0.7.x, there have been some efforts at using dplyr
without actually using it that I can’t quite understand. The tidyverse
has raised passions, for and against it, for some time already. There are excellent alternatives out there, and I myself use them when I find it suitable. But when I choose to use dplyr
, I find it most versatile, and I see no advantage in adding yet another layer that complicates things and makes problems even harder to debug.
Take the example of seplyr
. It stands for standard evaluation dplyr
, and enables us to program over dplyr
without having “to bring in (or study) any deep-theory or heavy-weight tools such as rlang
/tidyeval
”. Let’s consider the following interactive pipeline:
library(dplyr)
starwars %>%
group_by(homeworld) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE),
count = n())
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
Let’s say we want to parametrise the grouping variable and wrap the code above into a re-usable function. Apparently, this is difficult with dplyr
. But is it? Not at all: we just need to add one line and a bang-bang (!!
):
starwars_mean <- function(var) {
var <- enquo(var)
starwars %>%
group_by(!!var) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE),
count = n())
}
starwars_mean(homeworld)
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
The enquo()
function quotes the name we put in our function (homeworld
), and the bang-bang unquotes and uses that name instead of var
. That’s it. What about seplyr
? With seplyr
, we just have to (and I quote)
- Change dplyr verbs to their matching seplyr “*_se()» adapters.
- Add quote marks around names and expressions.
- Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the “c()” notation.
- Replace “=” in expressions with “:=”.
This is the result:
library(seplyr)
starwars_mean <- function(my_var) {
starwars %>%
group_by_se(my_var) %>%
summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)",
"mean_mass" := "mean(mass, na.rm = TRUE)",
"count" := "n()"))
}
starwars_mean("homeworld")
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
Basically, we had to change the entire pipeline. If re-usability was the goal, I think we lost some of it here. But, wait, we are still using non-standard evaluation in the first example. What if we really need to provide the grouping variable as a string? Easy enough, we just need to change enquo()
with as.name()
to convert the string to a name:
starwars_mean <- function(var) {
var <- as.name(var)
starwars %>%
group_by(!!var) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE),
count = n())
}
starwars_mean("homeworld")
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
But we can do even better if we remember that dplyr
provides scoped variants (see ?dplyr::scoped
) for most of the verbs. In this case, group_by_at()
comes in handy:
starwars_mean <- function(var) {
starwars %>%
group_by_at(var) %>%
summarise(mean_height = mean(height, na.rm = TRUE),
mean_mass = mean(mass, na.rm = TRUE),
count = n())
}
starwars_mean("homeworld")
## # A tibble: 49 x 4
## homeworld mean_height mean_mass count
## <chr> <dbl> <dbl> <int>
## 1 Alderaan 176.3333 64.0 3
## 2 Aleen Minor 79.0000 15.0 1
## 3 Bespin 175.0000 79.0 1
## 4 Bestine IV 180.0000 110.0 1
## 5 Cato Neimoidia 191.0000 90.0 1
## 6 Cerea 198.0000 82.0 1
## 7 Champala 196.0000 NaN 1
## 8 Chandrila 150.0000 NaN 1
## 9 Concord Dawn 183.0000 79.0 1
## 10 Corellia 175.0000 78.5 2
## # ... with 39 more rows
That’s it: no bang-bang, just strings and only one change to the original code. Let’s dwell on the potential of the scoped variants with a final example. We can make a completely generic re-usable “grouped mean” function using seplyr
and R’s paste0()
function to build up expressions:
grouped_mean <- function(data, grouping_variables, value_variables) {
result_names <- paste0("mean_", value_variables)
expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)")
data %>%
group_by_se(grouping_variables) %>%
summarize_se(c(result_names := expressions,
"count" := "n()"))
}
starwars %>%
grouped_mean("eye_color", c("mass", "birth_year"))
## # A tibble: 15 x 4
## eye_color mean_mass mean_birth_year count
## <chr> <dbl> <dbl> <int>
## 1 black 76.28571 33.00000 10
## 2 blue 86.51667 67.06923 19
## 3 blue-gray 77.00000 57.00000 1
## 4 brown 66.09231 108.96429 21
## 5 dark NaN NaN 1
## 6 gold NaN NaN 1
## 7 green, yellow 159.00000 NaN 1
## 8 hazel 66.00000 34.50000 3
## 9 orange 282.33333 231.00000 8
## 10 pink NaN NaN 1
## 11 red 81.40000 33.66667 5
## 12 red, blue NaN NaN 1
## 13 unknown 31.50000 NaN 3
## 14 white 48.00000 NaN 1
## 15 yellow 81.11111 76.38000 11
And the same with dplyr
’s scoped verbs (note that I’ve added the last rename_at()
on a whim, just to get exactly the same output as before, but it is not really necessary):
grouped_mean <- function(data, grouping_variables, value_variables) {
data %>%
group_by_at(grouping_variables) %>%
mutate(count = n()) %>%
summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>%
rename_at(value_variables, funs(paste0("mean_", .)))
}
starwars %>%
grouped_mean("eye_color", c("mass", "birth_year"))
## # A tibble: 15 x 4
## eye_color mean_mass mean_birth_year count
## <chr> <dbl> <dbl> <dbl>
## 1 black 76.28571 33.00000 10
## 2 blue 86.51667 67.06923 19
## 3 blue-gray 77.00000 57.00000 1
## 4 brown 66.09231 108.96429 21
## 5 dark NaN NaN 1
## 6 gold NaN NaN 1
## 7 green, yellow 159.00000 NaN 1
## 8 hazel 66.00000 34.50000 3
## 9 orange 282.33333 231.00000 8
## 10 pink NaN NaN 1
## 11 red 81.40000 33.66667 5
## 12 red, blue NaN NaN 1
## 13 unknown 31.50000 NaN 3
## 14 white 48.00000 NaN 1
## 15 yellow 81.11111 76.38000 11
Wrapping up, the tidyeval
paradigm may seem difficult at a first glance, but don’t miss the wood for the trees: the new version of dplyr
is full of tools that will make your life easier, not harder.