Of course, I’m paraphrasing Dirk’s fifteenth post in the rarely rational R rambling series: #15: Tidyverse and data.table, sitting side by side … (Part 1). I very much liked it, because, although I’m a happy tidyverse user, I’m always trying not to be tied into that verse too much by replicating certain tasks with other tools (and languages) as an exercise. In this article, I’m going to repeat Dirk’s exercise in base R.
First of all, I would like to clean up the tidyverse version a little, because the original was distributed in chunks and was a little bit too verbose. We can also avoid using lubridate
, because readr
already parses the end_date
column as a date (and that’s why it is significantly slower, among other reasons). This is how I would do it:
## Getting the polls
library(tidyverse)
library(zoo)
polls_2016 <- read_tsv(url("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"))
## Wrangling the polls
polls_2016 <- polls_2016 %>%
filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) %>%
right_join(data.frame(end_date = seq.Date(min(.$end_date), max(.$end_date), by="days")), by="end_date")
## Average the polls
rolling_average <- polls_2016 %>%
group_by(end_date) %>%
summarise(Clinton = mean(Clinton), Trump = mean(Trump)) %>%
mutate(Clinton.Margin = Clinton-Trump,
Clinton.Avg = rollapply(Clinton.Margin,width=14,
FUN=function(x){mean(x, na.rm=TRUE)},
by=1, partial=TRUE, fill=NA, align="right"))
ggplot(rolling_average) +
geom_line(aes(x=end_date, y=Clinton.Avg), col="blue") +
geom_point(aes(x=end_date, y=Clinton.Margin))
which, by the way, has exactly the very same number of lines of code than the data.table
version:
## Getting the polls
library(data.table)
library(zoo)
library(ggplot2)
pollsDT <- fread("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv")
## Wrangling the polls
pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
pollsDT[, end_date := as.IDate(end_date)]
pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]),
max(pollsDT[,end_date]), by="days")), on="end_date"]
## Average the polls
pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), by=end_date]
pollsDT[, Clinton.Margin := Clinton-Trump]
pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14,
FUN=function(x){mean(x, na.rm=TRUE)},
by=1, partial=TRUE, fill=NA, align="right")]
ggplot(pollsDT) +
geom_line(aes(x=end_date, y=Clinton.Avg), col="blue") +
geom_point(aes(x=end_date, y=Clinton.Margin))
Let’s translate this into base R. It is easier to start from the data.table
version, mainly because filtering and assigning have a similar look and feel. Unsurprisingly, we have base::merge
for the merge operation and stats::aggregate
for the aggregation phase. base::as.Date
works just fine for these dates and utils::read.csv
has the only drawback that you have to specify the separatorutils::read.delim
recognises the right separator by default. Without further ado, this is my version in base R:
## Getting the polls
library(zoo)
pollsB <- read.delim(url("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"))
## Wrangling the polls
pollsB <- pollsB[pollsB$sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
pollsB$end_date <- base::as.Date(pollsB$end_date)
endDate <- data.frame(end_date = seq.Date(min(pollsB$end_date), max(pollsB$end_date), by="days"))
pollsB <- merge(pollsB, endDate, by="end_date", all=TRUE)
## Average the polls
pollsB <- aggregate(cbind(Clinton, Trump) ~ end_date, data=pollsB, mean, na.action=na.pass)
pollsB$Clinton.Margin <- pollsB$Clinton - pollsB$Trump
pollsB$Clinton.Avg <- rollapply(pollsB$Clinton.Margin, width=14,
FUN=function(x){mean(x, na.rm=TRUE)},
by=1, partial=TRUE, fill=NA, align="right")
plot(pollsB$end_date, pollsB$Clinton.Margin, pch=16)
lines(pollsB$end_date, pollsB$Clinton.Avg, col="blue", lwd=2)
which is the shortest one! Finally, let’s repeat the benchmark too:
library(microbenchmark)
url <- "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"
file <- "/tmp/poll-responses-clean.tsv"
download.file(url, destfile=file, quiet=TRUE)
res <- microbenchmark(tidy=suppressMessages(readr::read_tsv(file)),
dt=data.table::fread(file, showProgress=FALSE),
base=read.delim(file))
res
## Unit: milliseconds
## expr min lq mean median uq max neval
## tidy 13.877036 15.127885 18.549393 15.861311 17.813541 202.389391 100
## dt 4.084022 4.505943 5.152799 4.845193 5.652579 7.736563 100
## base 29.029366 30.437742 32.518009 31.449916 33.600937 45.104599 100
Base R is clearly the slowest option for the reading phase. Or, one might say, both readr
and data.table
have done a great job in improving things! Let’s take a look at the processing part now:
tvin <- suppressMessages(readr::read_tsv(file))
dtin <- data.table::fread(file, showProgress=FALSE)
bsin <- read.delim(file)
library(tidyverse)
library(data.table)
library(zoo)
transformTV <- function(polls_2016) {
polls_2016 <- polls_2016 %>%
filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) %>%
right_join(data.frame(end_date = seq.Date(min(.$end_date), max(.$end_date), by="days")), by="end_date")
rolling_average <- polls_2016 %>%
group_by(end_date) %>%
summarise(Clinton = mean(Clinton), Trump = mean(Trump)) %>%
mutate(Clinton.Margin = Clinton-Trump,
Clinton.Avg = rollapply(Clinton.Margin,width=14,
FUN=function(x){mean(x, na.rm=TRUE)},
by=1, partial=TRUE, fill=NA, align="right"))
}
transformDT <- function(dtin) {
pollsDT <- copy(dtin) ## extra work to protect from reference semantics for benchmark
pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
pollsDT[, end_date := as.IDate(end_date)]
pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]),
max(pollsDT[,end_date]), by="days")), on="end_date"]
pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), by=end_date]
pollsDT[, Clinton.Margin := Clinton-Trump]
pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14,
FUN=function(x){mean(x, na.rm=TRUE)},
by=1, partial=TRUE, fill=NA, align="right")]
}
transformBS <- function(pollsB) {
pollsB <- pollsB[pollsB$sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
pollsB$end_date <- base::as.Date(pollsB$end_date)
endDate <- data.frame(end_date = seq.Date(min(pollsB$end_date), max(pollsB$end_date), by="days"))
pollsB <- merge(pollsB, endDate, by="end_date", all=TRUE)
pollsB <- aggregate(cbind(Clinton, Trump) ~ end_date, data=pollsB, mean, na.action=na.pass)
pollsB$Clinton.Margin <- pollsB$Clinton - pollsB$Trump
pollsB$Clinton.Avg <- rollapply(pollsB$Clinton.Margin, width=14,
FUN=function(x){mean(x, na.rm=TRUE)},
by=1, partial=TRUE, fill=NA, align="right")
}
res <- microbenchmark(tidy=transformTV(tvin),
dt=transformDT(dtin),
base=transformBS(bsin))
res
## Unit: milliseconds
## expr min lq mean median uq max neval
## tidy 20.68435 22.58603 26.67459 24.56170 27.85844 84.55077 100
## dt 17.25547 18.88340 21.43256 20.24450 22.26448 41.65252 100
## base 28.39796 30.93722 34.94262 32.97987 34.98222 109.14005 100
I don’t see so much difference between the tidyverse and data.table
as Dirk showed, perhaps because I’ve simplified the script a bit, and removed some redundant parts. Again, base R is the slowest option, but don’t set it aside: it is the shortest one, and it is always there, out of the box!
Update (2018-01-25): Use read.delim(file)
instead of read.csv(file, sep="\t")
as @JorisMeys suggested here.
[…] Article originally published in Enchufa2.es: Tidyverse and data.table, sitting side by side… and then base R walks in. […]
[…] Article originally published in Enchufa2.es: Tidyverse and data.table, sitting side by side… and then base R walks in. […]
[…] article was first published on R – Enchufa2, and kindly contributed to […]
Hola Iñaki,
Good work. I think the base version can be simplified, relying on defaults, as:
polls_2016 <- read.delim("http:…") as the url is parsed automatically and read.delim assumes a tab.
Probably using subset is clearer
polls_2016 <- subset(polls_2016,
sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"))
And, given that as.Date is part of base
pollsB$end_date <- as.Date(pollsB$end_date)
and using:
pollsB <- within(pollsb, {
Clinton.Margin <- Clinton – Trump
Clinton.Avg <- rollapply(Clinton.Margin, width=14,
FUN=function(x){mean(x, na.rm=TRUE)},
by=1, partial=TRUE, fill=NA, align="right")
})
is less wordy.
Overall, whichever way one codes the example doesn't make a difference to the user from the point of view of performance. Readability can be vastly different though, for which I'm partial to the tidyverse.
Saludos
Thanks, Luis, nice additions. But note that «as.Date» is masked by package zoo. So you really need to specify the namespace if you want to use the one from the base package. ;-)
Forgot about the masking. Thanks!