Programming with dplyr by using dplyr

The title may seem tautological, but since the arrival of dplyr 0.7.x, there have been some efforts at using dplyr without actually using it that I can’t quite understand. The tidyverse has raised passions, for and against it, for some time already. There are excellent alternatives out there, and I myself use them when I find it suitable. But when I choose to use dplyr, I find it most versatile, and I see no advantage in adding yet another layer that complicates things and makes problems even harder to debug.

Take the example of seplyr. It stands for standard evaluation dplyr, and enables us to program over dplyr without having “to bring in (or study) any deep-theory or heavy-weight tools such as rlang/tidyeval”. Let’s consider the following interactive pipeline:

library(dplyr)

starwars %>%
  group_by(homeworld) %>%
  summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

Let’s say we want to parametrise the grouping variable and wrap the code above into a re-usable function. Apparently, this is difficult with dplyr. But is it? Not at all: we just need to add one line and a bang-bang (!!):

starwars_mean <- function(var) {
  var <- enquo(var)
  starwars %>%
    group_by(!!var) %>%
    summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())
}

starwars_mean(homeworld)
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

The enquo() function quotes the name we put in our function (homeworld), and the bang-bang unquotes and uses that name instead of var. That’s it. What about seplyr? With seplyr, we just have to (and I quote)

  • Change dplyr verbs to their matching seplyr “*_se()» adapters.
  • Add quote marks around names and expressions.
  • Convert sequences of expressions (such as in the summarize()) to explicit vectors by adding the “c()” notation.
  • Replace “=” in expressions with “:=”.

This is the result:

library(seplyr)

starwars_mean <- function(my_var) {
  starwars %>%
    group_by_se(my_var) %>%
    summarize_se(c("mean_height" := "mean(height, na.rm = TRUE)",
                   "mean_mass" := "mean(mass, na.rm = TRUE)",
                   "count" := "n()"))
}

starwars_mean("homeworld")
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

Basically, we had to change the entire pipeline. If re-usability was the goal, I think we lost some of it here. But, wait, we are still using non-standard evaluation in the first example. What if we really need to provide the grouping variable as a string? Easy enough, we just need to change enquo() with as.name() to convert the string to a name:

starwars_mean <- function(var) {
  var <- as.name(var)
  starwars %>%
    group_by(!!var) %>%
    summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())
}

starwars_mean("homeworld")
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

But we can do even better if we remember that dplyr provides scoped variants (see ?dplyr::scoped) for most of the verbs. In this case, group_by_at() comes in handy:

starwars_mean <- function(var) {
  starwars %>%
    group_by_at(var) %>%
    summarise(mean_height = mean(height, na.rm = TRUE),
            mean_mass = mean(mass, na.rm = TRUE),
            count = n())
}

starwars_mean("homeworld")
## # A tibble: 49 x 4
##         homeworld mean_height mean_mass count
##             <chr>       <dbl>     <dbl> <int>
##  1       Alderaan    176.3333      64.0     3
##  2    Aleen Minor     79.0000      15.0     1
##  3         Bespin    175.0000      79.0     1
##  4     Bestine IV    180.0000     110.0     1
##  5 Cato Neimoidia    191.0000      90.0     1
##  6          Cerea    198.0000      82.0     1
##  7       Champala    196.0000       NaN     1
##  8      Chandrila    150.0000       NaN     1
##  9   Concord Dawn    183.0000      79.0     1
## 10       Corellia    175.0000      78.5     2
## # ... with 39 more rows

That’s it: no bang-bang, just strings and only one change to the original code. Let’s dwell on the potential of the scoped variants with a final example. We can make a completely generic re-usable “grouped mean” function using seplyr and R’s paste0() function to build up expressions:

grouped_mean <- function(data, grouping_variables, value_variables) {
  result_names <- paste0("mean_", value_variables)
  expressions <- paste0("mean(", value_variables, ", na.rm = TRUE)")
  data %>%
    group_by_se(grouping_variables) %>%
    summarize_se(c(result_names := expressions,
                   "count" := "n()"))
}

starwars %>% 
  grouped_mean("eye_color", c("mass", "birth_year"))
## # A tibble: 15 x 4
##        eye_color mean_mass mean_birth_year count
##            <chr>     <dbl>           <dbl> <int>
##  1         black  76.28571        33.00000    10
##  2          blue  86.51667        67.06923    19
##  3     blue-gray  77.00000        57.00000     1
##  4         brown  66.09231       108.96429    21
##  5          dark       NaN             NaN     1
##  6          gold       NaN             NaN     1
##  7 green, yellow 159.00000             NaN     1
##  8         hazel  66.00000        34.50000     3
##  9        orange 282.33333       231.00000     8
## 10          pink       NaN             NaN     1
## 11           red  81.40000        33.66667     5
## 12     red, blue       NaN             NaN     1
## 13       unknown  31.50000             NaN     3
## 14         white  48.00000             NaN     1
## 15        yellow  81.11111        76.38000    11

And the same with dplyr’s scoped verbs (note that I’ve added the last rename_at() on a whim, just to get exactly the same output as before, but it is not really necessary):

grouped_mean <- function(data, grouping_variables, value_variables) {
  data %>%
    group_by_at(grouping_variables) %>%
    mutate(count = n()) %>%
    summarise_at(c(value_variables, "count"), mean, na.rm = TRUE) %>%
    rename_at(value_variables, funs(paste0("mean_", .)))
}

starwars %>% 
  grouped_mean("eye_color", c("mass", "birth_year"))
## # A tibble: 15 x 4
##        eye_color mean_mass mean_birth_year count
##            <chr>     <dbl>           <dbl> <dbl>
##  1         black  76.28571        33.00000    10
##  2          blue  86.51667        67.06923    19
##  3     blue-gray  77.00000        57.00000     1
##  4         brown  66.09231       108.96429    21
##  5          dark       NaN             NaN     1
##  6          gold       NaN             NaN     1
##  7 green, yellow 159.00000             NaN     1
##  8         hazel  66.00000        34.50000     3
##  9        orange 282.33333       231.00000     8
## 10          pink       NaN             NaN     1
## 11           red  81.40000        33.66667     5
## 12     red, blue       NaN             NaN     1
## 13       unknown  31.50000             NaN     3
## 14         white  48.00000             NaN     1
## 15        yellow  81.11111        76.38000    11

Wrapping up, the tidyeval paradigm may seem difficult at a first glance, but don’t miss the wood for the trees: the new version of dplyr is full of tools that will make your life easier, not harder.

constants 0.0.1

The new constants package is available on CRAN. This small package provides the CODATA 2014 internationally recommended values of the fundamental physical constants (universal, electromagnetic, physicochemical, atomic…), provided as symbols for direct use within the R language. Optionally, the values with errors and/or the values with units are also provided if the errors and/or the units packages are installed as well.

But, what is CODATA? The Committee on Data for Science and Technology (CODATA) is an interdisciplinary committee of the International Council for Science. The Task Group on Fundamental Constants periodically provides the internationally accepted set of values of the fundamental physical constants. The version currently in force is the “2014 CODATA”, published on 25 June 2015.

This package wraps the codata dataset, defines unique symbols for each one of the 237 constants, and provides them enclosed in three sets of symbols: symssyms_with_errors and syms_with_units.

library(constants)

# the speed of light
with(syms, c0)
## [1] 299792458
# explore which constants are available
lookup("planck constant", ignore.case=TRUE)
##                  quantity  symbol            value      unit
## 7         Planck constant       h  6.626070040e-34       J s
## 8         Planck constant    h_eV  4.135667662e-15      eV s
## 9         Planck constant    hbar         h/(2*pi)       J s
## 10        Planck constant hbar_eV      h_eV/(2*pi)      eV s
## 11        Planck constant hbar.c0      197.3269788    MeV fm
## 212 molar Planck constant    Na.h 3.9903127110e-10 J s mol-1
## 213 molar Planck constant Na.h.c0   0.119626565582 J m mol-1
##     rel_uncertainty            type
## 7           1.2e-08       universal
## 8           6.1e-09       universal
## 9           1.2e-08       universal
## 10          6.1e-09       universal
## 11          6.1e-09       universal
## 212         4.5e-10 physicochemical
## 213         4.5e-10 physicochemical
# symbols can also be attached to the search path
attach(syms)
# the Planck constant
hbar
## [1] 1.054572e-34

If the errors/units package is installed in your system, constants with errors/units are available:

attach(syms_with_errors)
# the Planck constant with error
hbar
## 1.05457180(1)e-34
attach(syms_with_units)
# the Planck constant with units
hbar
## 1.054572e-34 J*s

The dataset is available for lazy loading:

data(codata)
head(codata)
##                             quantity    symbol        value        unit
## 1           speed of light in vacuum        c0    299792458       m s-1
## 2                  magnetic constant       mu0    4*pi*1e-7       N A-2
## 3                  electric constant  epsilon0 1/(mu0*c0^2)       F m-1
## 4 characteristic impedance of vacuum        Z0       mu0*c0           Ω
## 5  Newtonian constant of gravitation         G  6.67408e-11 m3 kg-1 s-2
## 6  Newtonian constant of gravitation G_hbar.c0  6.70861e-39    GeV-2 c4
##   rel_uncertainty      type
## 1         0.0e+00 universal
## 2         0.0e+00 universal
## 3         0.0e+00 universal
## 4         0.0e+00 universal
## 5         4.7e-05 universal
## 6         4.7e-05 universal
dplyr::count(codata, type, sort=TRUE)
## # A tibble: 15 x 2
##                          type     n
##                         <chr> <int>
##  1    atomic-nuclear-electron    31
##  2      atomic-nuclear-proton    26
##  3     atomic-nuclear-neutron    24
##  4            physicochemical    24
##  5      atomic-nuclear-helion    18
##  6        atomic-nuclear-muon    17
##  7            electromagnetic    17
##  8                  universal    16
##  9    atomic-nuclear-deuteron    15
## 10     atomic-nuclear-general    11
## 11         atomic-nuclear-tau    11
## 12      atomic-nuclear-triton    11
## 13                    adopted     7
## 14       atomic-nuclear-alpha     7
## 15 atomic-nuclear-electroweak     2

simmer 3.6.2

The second update of the 3.6.x release of simmer, the Discrete-Event Simulator for R, is on CRAN, thus inaugurating a bi-monthly release cycle. I must thank Duncan Garmonsway (@nacnudus) for creating and now maintaining “The Bank Tutorial: Part I” vignette, Franz Fuchs for finding an important and weird memory bug (here) that prevented simmer from freeing the allocated memory (all 3.x.x versions are affected up to this release), and the Rcpp people for enduring me while I was helplessly searching for a solution to this. :)

My special thanks to Kevin Ushey (@kevinushey), who finally found the bug. As it happens, the bug was not in simmer or Rcpp but in magrittr, and the problem is that the pipe operator, in its inscrutable magic, creates a new environment for unnamed functions (instead of the current one, as it should be), and there it stores a reference to the first object in the pipe. More or less. Further details here.

Anyway, if somebody faces the same problem, know that there is a workaround: you just need to delete that hidden reference, as simmer does in this release to get rid of the memory issues. Happy simmering!

Minor changes and fixes:

  • Update “The Bank Tutorial: Part I” vignette (@nacnudus in #90).
  • Fix trap()’s handler cloning and associated test (#91).
  • Apply select()’s policy also when resources is a function (#92).
  • Accept dynamic timeouts in batches (#93).
  • Change rollback()’s default behaviour to times=Inf, i.e., infinite loop (#95).
  • Stop and throw an error when timeout() returns a missing value (#96 and #97).
  • Fix memory management: resetting the environment was clearing but not deallocating memory (#98, fixed in #99).
  • Fix object destruction: workaround for tidyverse/magrittr#146 (#98, fixed in effcb6b).

x + x is not 2x

A few days ago, Joel Courtheyn posted the following issue in the errors package repository on GitHub:

Experimenting with the new package I detected a difference in calculation of the error depending on the way a formula was written. Originally I tried to calculate the error for z1 <- (x^3 - 2y)/x^0.5 but this gave me a value which was different from the manual calculated error. When I transformed this formula to z <- x^2.5 - 2y*x^(-1/2), then I came to the right results.

As I wrote there, the TL;DR version is that both calculations are correct, but the first formula is an abuse of notation. What do I mean by that? Let us consider a shorter but more intuitive version of the issue: x + x vs. 2*x. In the first place, we define a quantity with a relative uncertainty of 5 %:

library(errors)
options(errors.notation = "plus-minus")

x <- 30
errors(x) <- x * 0.05
x
## 30 +/- 2

Now, let us see what happens:

x + x
## 60 +/- 2
2*x
## 60 +/- 3

First of all, we need to keep in mind that measurements with errors are not mathematical variables anymore: they are physical (in a broad sense) quantities. Imagine that we want to measure the width of a table, but we have a ruler that is only about half its width. So we manage to put a mark, by the means of some method (using a string, for instance), approximately at about half of the table. Then, we have two options: 1) to measure the first half and multiply it by two, or 2) to measure both halves and sum them.

Intuitively, 1), which corresponds to the 2*x case, has a larger uncertainty, because we are not measuring the second half of the table (and note that this is exactly what we obtained before!). But in 2), even if the result of the second measurement matches the first one, x + x is an abuse of notation: they are different measurements, so we should write x + y instead, and the derived uncertainty is smaller.

Therefore, we can scale a certain measurement, apply any function to it… but to sum, multiply, divide… a measurement by itself has no physical meaning. x+x = 2x is mathematically true, but x + x has no physical sense. We should say x + y (even ifx is the same value as y), and x + y != 2*x when it comes to propagation of the uncertainty. The errors package helps us in the arduous task of uncertainty propagation, but checking the physical correctness of the expressions of derived measurements cannot be automated, and it is still our responsibility.

Load a Python/pandas data frame from an HDF5 file into R

The title is self-descriptive, so I will not dwell on the issue at length before showing the code. Just a small note: to my knowledge, there is only one public snippet out there that addresses this particular problem. It uses the Bioc package rhdf5 and you can find it here. The main problem is that it only works when the HDF5 file contains a single data frame, which is not very useful. This gist overcomes this limitation and uses the CRAN package h5 instead: