Programming with dplyr by using dplyr

The title may seem tautological, but since the arrival of dplyr 0.7.x, there have been some efforts at using dplyr without actually using it that I can’t quite understand. The tidyverse has raised passions, for and against it, for some time already. There are excellent alternatives out there, and I myself use them when I find it suitable. But when I choose to use dplyr, I find it most versatile, and I see no advantage in adding yet another layer that complicates things and makes problems even harder to debug.

Take the example of seplyr. It stands for standard evaluation dplyr, and enables us to program over dplyr without having “to bring in (or study) any deep-theory or heavy-weight tools such as rlang/tidyeval”. Let’s consider the following interactive pipeline:

library(dplyr)

starwars %>%
  group_by(homeworld) %>%
  summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

Let’s say we want to parametrise the grouping variable and wrap the code above into a re-usable function. Apparently, this is difficult with dplyr. But is it? Not at all: we just need to add one line and a bang-bang (!!):

starwars_mean <- function(var) {
  var <- enquo(var)
  starwars %>%
    group_by(!!var) %>%
    summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())
}

starwars_mean(homeworld)
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

The enquo() function quotes the name we put in our function (homeworld), and the bang-bang unquotes and uses that name instead of var. That’s it. What about seplyr? With seplyr, we just have to (and I quote)

  • Change dplyr verbs to their matching seplyr “*_se()» adapters.
  • Add quote marks around names and expressions.
  • Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the “c()” notation.
  • Replace “=” in expressions with “:=”.

This is the result:

library(seplyr)

starwars_mean <- function(my_var) {
  starwars %>%
    group_by_se(my_var) %>%
    summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)",
                   "mean_mass" := "mean(mass, na.rm = TRUE)",
                   "count" := "n()"))
}

starwars_mean("homeworld")
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

Basically, we had to change the entire pipeline. If re-usability was the goal, I think we lost some of it here. But, wait, we are still using non-standard evaluation in the first example. What if we really need to provide the grouping variable as a string? Easy enough, we just need to change enquo() with as.name() to convert the string to a name:

starwars_mean <- function(var) {
  var <- as.name(var)
  starwars %>%
    group_by(!!var) %>%
    summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())
}

starwars_mean("homeworld")
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

But we can do even better if we remember that dplyr provides scoped variants (see ?dplyr::scoped) for most of the verbs. In this case, group_by_at() comes in handy:

starwars_mean <- function(var) {
  starwars %>%
    group_by_at(var) %>%
    summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())
}

starwars_mean("homeworld")
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

That’s it: no bang-bang, just strings and only one change to the original code. Let’s dwell on the potential of the scoped variants with a final example. We can make a completely generic re-usable “grouped mean” function using seplyr and R’s paste0() function to build up expressions:

grouped_mean <- function(data, grouping_variables, value_variables) {
  result_names <- paste0("mean_", value_variables)
  expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)")
  data %>%
    group_by_se(grouping_variables) %>%
    summarize_se(c(result_names := expressions,
                   "count" := "n()"))
}

starwars %>% 
  grouped_mean("eye_color", c("mass", "birth_year"))
## # A tibble: 15 x 4
##        eye_color mean_mass mean_birth_year count
##            <chr>     <dbl>           <dbl> <int>
##  1         black  76.28571        33.00000    10
##  2          blue  86.51667        67.06923    19
##  3     blue-gray  77.00000        57.00000     1
##  4         brown  66.09231       108.96429    21
##  5          dark       NaN             NaN     1
##  6          gold       NaN             NaN     1
##  7 green, yellow 159.00000             NaN     1
##  8         hazel  66.00000        34.50000     3
##  9        orange 282.33333       231.00000     8
## 10          pink       NaN             NaN     1
## 11           red  81.40000        33.66667     5
## 12     red, blue       NaN             NaN     1
## 13       unknown  31.50000             NaN     3
## 14         white  48.00000             NaN     1
## 15        yellow  81.11111        76.38000    11

And the same with dplyr’s scoped verbs (note that I’ve added the last rename_at() on a whim, just to get exactly the same output as before, but it is not really necessary):

grouped_mean <- function(data, grouping_variables, value_variables) {
  data %>%
    group_by_at(grouping_variables) %>%
    mutate(count = n()) %>%
    summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>%
    rename_at(value_variables, funs(paste0("mean_", .)))
}

starwars %>% 
  grouped_mean("eye_color", c("mass", "birth_year"))
## # A tibble: 15 x 4
##        eye_color mean_mass mean_birth_year count
##            <chr>     <dbl>           <dbl> <dbl>
##  1         black  76.28571        33.00000    10
##  2          blue  86.51667        67.06923    19
##  3     blue-gray  77.00000        57.00000     1
##  4         brown  66.09231       108.96429    21
##  5          dark       NaN             NaN     1
##  6          gold       NaN             NaN     1
##  7 green, yellow 159.00000             NaN     1
##  8         hazel  66.00000        34.50000     3
##  9        orange 282.33333       231.00000     8
## 10          pink       NaN             NaN     1
## 11           red  81.40000        33.66667     5
## 12     red, blue       NaN             NaN     1
## 13       unknown  31.50000             NaN     3
## 14         white  48.00000             NaN     1
## 15        yellow  81.11111        76.38000    11

Wrapping up, the tidyeval paradigm may seem difficult at a first glance, but don’t miss the wood for the trees: the new version of dplyr is full of tools that will make your life easier, not harder.

Publicada en R

18 comentarios sobre “Programming with dplyr by using dplyr

  1. These *_at functions are the first i’m reading about. dplyr programming vignette make no mention of these functions. The newer tidyeval method is definitely less confusing than the old lazyeval method.

  2. Some of these scoped variants have been there for quite a while, and they have been completed and improved a lot with dplyr 0.7.x. But you are right, they are not mentioned in that vignette, and there is only one mention in the others (in this one). Completely agree also about tidyeval vs. lazyeval.

  3. Completely agree with you! It is kind of absurd to use dplyr by not using dplyr….great post!

  4. Iñaki –
    Totally agree with you.

    Great post,
    with super easy examples
    and very clear explanations.

    thank you so much,,,

  5. Iñaki,
    A quick question.
    In the very last function example:
    starwars %>%
    grouped_mean(«eye_color», c(«mass», «birth_year»))

    Q:
    Can you have more than 1 grouping variable?
    ie:
    «eye_color» AND «hair_color» vars for grouping

    I tried:
    starwars %>%
    grouped_mean(c(«hair_color»,»eye_color»), c(«mass», «birth_year»))
    but got this R message:
    Error: All arguments must be named

  6. Yes, you can. Apparently, the error is raised by «rename_at», and the problem seems to be the remaining grouping variable (summaries always drop one grouping variable). But I don’t understand why (and the error is not so informative), so it could be a bug. You can solve this by placing «ungroup» right before «rename_at» in the «grouped_mean».

  7. What if we really need to provide the grouping variable as a string? Easy enough, we just need to change enquo() with as.name() to convert the string to a name:

    Then I ask you: so, if I really need to provide a variable with a string sometimes, but I also want the first version of the function (let’s say, the interactive one, i.e. provide the variable itself as a name, instead of using strings)?

    Do I need to write two different functions with the same code?

    Cause I’ve tried and I could not write a if-clause to verify if the variable was provided as a string or as a name.

  8. That’s simply not possible. Suppose that you are trying to pass «homeworld» as a name, but there is a «homeworld» variable with a string stored in it… there’s no way to resolve this.

  9. Hi Iñaki,

    as you suggested,
    I did place: “ungroup
    right before “rename_at” in the “grouped_mean” function:

    grouped_mean %
    group_by_at(grouping_variables) %>%
    mutate(count = n()) %>%
    summarise_at( c(value_variables, «count»), mean, na.rm = TRUE) %>% ungroup() %>%
    rename_at(value_variables, funs(paste0(«mean_», .)))
    }

    But, then….
    starwars %>%
    grouped_mean( c(“hair_color”,”eye_color”), c(“mass”, “birth_year”) )

    gives this R message…
    Error: unexpected input in:
    «starwars %>%
    grouped_mean( c(�»

    Maybe I misunderstood your instructions,
    (still working my way up in R coding…).

    Can you please include
    the exact R code
    for grouped_mean with ungroup() ,
    that works OK for you?.

    Thanks Iñaki!! :-)

  10. sorry, the copy and paste
    of the grouped_mean() function
    with ungroup()
    (in my comment above),
    missed the <- symbol in the first line.
    It should read:

    grouped_mean %
    mutate(count = n()) %>%
    summarise_at( c(value_variables, “count”), mean, na.rm = TRUE) %>%
    ungroup() %>%
    rename_at(value_variables, funs(paste0(“mean_”, .)))
    }

  11. OK,
    the blog comment system
    will always substitute a single percentage sign %,
    for any R assignment symbol I use…. < –

  12. The best way of writing code is using the «pre» HTML tag:

    grouped_mean <- function(data, grouping_variables, value_variables) {
      data %>%
        group_by_at(grouping_variables) %>%
        mutate(count = n()) %>%
        summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>%
        ungroup() %>%
        rename_at(value_variables, funs(paste0("mean_", .)))
    }
    
    starwars %>% 
      grouped_mean("eye_color", c("mass", "birth_year"))
    

    Your code is ok. The problem was that you were probably copying from here and pasting directly into R, so that the quotes were not these ", but these “. Try again replacing the quotes.

  13. Thanks 10^6 Iñaki,
    all works now!

    Your 2 points to me
    make perfect sense.
    (Comments code inside PRE tags
    and check for straight double-quotes…).

    Great Blog and posts!.
    (+following you on Twitter now).

    Have a super day!
    SFer

Comentarios cerrados.