R – Página 4 – Enchufa2

simmer 3.7.0

3 marzo 2018 Iñaki Úcar

The 3.7.0 release of simmer, the Discrete-Event Simulator for R, is on CRAN. It includes several API improvements and bug fixes. Among the former, the new timeout_from_attribute() activity makes it easier and more efficient the common task of placing a timeout based on a previously set attribute. Another common task is to increment or decrement a given attribute. To this end, set_attribute and other setters get a new argument mod which, if set to "+" or "*", modifies the value correspondingly instead of substituting it.

This minor release also includes some minor breaking changes. In particular, all deprecations from the v3.6.x series have been finally removed, which should come as no surprise. Besides, get_mon_resources() loses the data argument, which was there for historical reasons and probably nobody was using it.

Finally, there are two additional vignettes:

“simmer: Discrete-Event Simulation for R” describes the internal design, the R API, provides some modelling examples and a performance evaluation. We are very proud to officially announce that it has been accepted for publication in the Journal of Statistical Software.
“Design and Analysis of 5G Scenarios” contains supplementary materials for a homonymous paper that has been accepted for publication in the IEEE Communications Magazine.

See the citation information for further details.

New features:

New timeout_from_attribute() activity makes it easier to set a timeout based on an attribute (#129).
The activities set_attribute(), set_prioritization(), set_capacity() and set_queue_size() get a new argument mod which, if set to "+" or "*", modifies the corresponding value instead of substituting it. This makes it easier to increment, decrement or scale one of these values (#130).
New *_selected() versions for the already available resource getters: get_capacity(), get_queue seize(), get_server_count()and get_queue_count() (#134).

Minor changes and fixes:

Broadcast signals with higher priority to prevent an arrival to catch its own signal with a trap() after a send() (#135).
Generate new arrivals with minimum priority to avoid wrong interactions with simultaneous activities (#136).
Remove v3.6.x deprecations: the old attribute retrieval system (see notes for v3.6.3), as well as methods create_trajectory() and onestep() (#117).
Remove get_mon_resources()’s data argument. It was there for historical reasons and probably nobody was using it (851d34b).
New vignette, “simmer: Discrete-Event Simuation for R”, paper accepted for publication in the Journal of Statistical Software. Remove “Terminology” vignette (#127).
New vignette, “Design and Analysis of 5G Scenarios”, supplementary materials for a paper accepted for publication in the IEEE Communications Magazine (#137).

Documenting R packages: roxygen2 vs. direct Rd input

3 febrero 20183 febrero 2018 Iñaki Úcar5 comentarios

As the reader may know,

R objects are documented in files written in “R documentation” (Rd) format, a simple markup language much of which closely resembles (La)TeX, which can be processed into a variety of formats, including LaTeX, HTML and plain text.

This LaTeX-like syntax, combined with the fact that the actual R objects live in a separate place, feels burdensome for many developers. As a consequence, there are a handful of tools aimed at improving the documentation process, one of which is roxygen2. We may say that the R community nowadays is divided between those who use roxygen2 and those who don’t.

The roxygen2 package allows us to write documentation right next to the code that is being described with decorated comments. The advantages are the following:

Code and documentation are adjacent so when you modify your code, it’s easy to remember that you need to update the documentation.

Roxygen2 dynamically inspects the objects that it’s documenting, so it can automatically add data that you’d otherwise have to write by hand.

It abstracts over the differences in documenting S3 and S4 methods, generics and classes so you need to learn fewer details.

Although both roxygenists and non-roxygenists surely agree that documentation is one of the most important aspects of good code, the alleged benefits of roxygen2 could turn into a disadvantage. In the words of Duncan Murdoch,

This isn’t the fashionable point of view, but I think it is easier to get good documentation [by directly editing Rd files] than using Roxygen. […]

The reason I think this is that good documentation requires work and thought. You need to think about the markup that will get your point across, you need to think about putting together good examples, etc. This is harder in Roxygen than if you are writing Rd files, because Roxygen is a thin front end to produce Rd files from comments in your .R files. To get good stuff in the help page, you need just as much work as in writing the .Rd file directly, but then you need to add another layer on top to put in in a comment. Most people don’t bother.

Basically, roxygen2’s point is that you don’t need to work in the syntax, so that you can use that time to write actual documentation. Duncan’s point, instead, is that, if you don’t put effort in the writing process, there’s a chance that you won’t put any effort at all. Although I’m a happy roxygen2 user, I can see there’s a point in there, and an interesting analysis to be done.

In fact, if you happen to have an uncompressed copy of CRAN under, let’s say, ~/cran, you can execute the following script:

## Requires: r-lib/pkgdown, readr
setwd("~/cran")

get_lines <- function(Rd) {
  # render as txt
  txt <- try(capture.output(tools::Rd2txt(Rd)), silent=TRUE)
  if (inherits(txt, "try-error")) # "rcqp" throws an error, why?
    return(c(documentation=NA, examples=NA))
  # remove blank lines
  txt <- txt[!grepl("^[[:space:]]*$", txt)]
  # split documentation and examples
  examples <- grep("_\bE_\bx_\ba_\bm_\bp_\bl_\be_\bs:", txt)
  if (length(examples)) {
    doc <- txt[1:(examples-1)]
    exm <- txt[(examples+1):length(txt)]
  } else {
    doc <- txt
    exm <- NULL
  }
  # remove titles
  doc <- doc[!grepl("_\b", doc)]
  # output
  c(documentation=length(doc), examples=length(exm))
}

do.call(rbind, parallel::mclapply(Sys.glob("*"), function(pkg) {
  message("Parsing ", pkg, "...")
  rds <- Sys.glob(file.path(pkg, "man", "*.[R|r]d"))
  if (!length(rds))
    df <- data.frame(documentation=0, examples=0, functions=0)
  else {
    # get no. lines for documentation & examples
    df <- as.data.frame(t(rowSums(sapply(rds, get_lines), na.rm=TRUE)))
    # get no. exported functions
    df$functions <- sum(sapply(rds, function(rd) {
      rd <- pkgdown:::rd_file(rd)
      length(pkgdown:::usage_funs(pkgdown:::topic_usage(rd)))
    }))
  }
  # RoxygenNote present?
  desc <- file.path(pkg, "DESCRIPTION")
  df$roxygen <- !is.na(read.dcf(desc, fields="RoxygenNote")[[1]])
  df$pkg <- pkg
  df
}, mc.cores=parallel::detectCores())) -> docLines

readr::write_csv(docLines, "docLines.csv")

to get this data frame. For each package on CRAN, we extract the number of lines of documentation and examples under the man directory, as rendered by tools::Rd2txt. We also count how many functions are documented, and we scan the DESCRIPTION file looking for the RoxygenNote, to tell which packages use roxygen2. This is all I need to see what I was looking for:

library(ggplot2)
library(dplyr)
library(tidyr)

docLines <- read.csv("docLines.csv") %>%
  filter(functions > 0) %>%
  gather("type", "lines", documentation, examples)

ggplot(docLines, aes(lines/functions, color=roxygen, fill=roxygen)) + theme_bw() + 
  geom_density(alpha=.3) + facet_wrap(~type) + scale_x_log10()

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 221 rows containing non-finite values (stat_density).

Limitations:

This talks about quantity, but not about quality.
The method of extraction of documentation and examples is very coarse. For sure there are better ways.
The amount of documentation must be weighted in some way. Just dividing it by the number of exported functions and methods may not be the best way.
roxygen2 appeared in 2011, but I think it became more popular in recent years. It may be interesting to restrict the analysis to recent packages.
Some developers prioritise vignettes over examples. It may be another interesting factor to analyse.

But all in all, I believe that this simple analysis proves Duncan right to some extent. And as a roxygen2 user that very much cares about documentation, this warns me against my own biases. If you care too, make sure that you really take advantage of the time you save with roxygen2.

Tidyverse and data.table, sitting side by side… and then base R walks in

23 enero 201825 enero 2018 Iñaki Úcar6 comentarios

Of course, I’m paraphrasing Dirk’s fifteenth post in the rarely rational R rambling series: #15: Tidyverse and data.table, sitting side by side … (Part 1). I very much liked it, because, although I’m a happy tidyverse user, I’m always trying not to be tied into that verse too much by replicating certain tasks with other tools (and languages) as an exercise. In this article, I’m going to repeat Dirk’s exercise in base R.

First of all, I would like to clean up the tidyverse version a little, because the original was distributed in chunks and was a little bit too verbose. We can also avoid using lubridate, because readr already parses the end_date column as a date (and that’s why it is significantly slower, among other reasons). This is how I would do it:

## Getting the polls

library(tidyverse)
library(zoo)

polls_2016 <- read_tsv(url("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"))

## Wrangling the polls

polls_2016 <- polls_2016 %>%
  filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) %>%
  right_join(data.frame(end_date = seq.Date(min(.$end_date), max(.$end_date), by="days")), by="end_date")

## Average the polls

rolling_average <- polls_2016 %>%
  group_by(end_date) %>%
  summarise(Clinton = mean(Clinton), Trump = mean(Trump)) %>%
  mutate(Clinton.Margin = Clinton-Trump,
         Clinton.Avg =  rollapply(Clinton.Margin,width=14,
                                  FUN=function(x){mean(x, na.rm=TRUE)},
                                  by=1, partial=TRUE, fill=NA, align="right"))

ggplot(rolling_average) +
  geom_line(aes(x=end_date, y=Clinton.Avg), col="blue") +
  geom_point(aes(x=end_date, y=Clinton.Margin))

which, by the way, has exactly the very same number of lines of code than the data.table version:

## Getting the polls

library(data.table)
library(zoo)
library(ggplot2)

pollsDT <- fread("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv")

## Wrangling the polls

pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
pollsDT[, end_date := as.IDate(end_date)]
pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]),
                                              max(pollsDT[,end_date]), by="days")), on="end_date"]

## Average the polls

pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), by=end_date]
pollsDT[, Clinton.Margin := Clinton-Trump]
pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14,
                                   FUN=function(x){mean(x, na.rm=TRUE)},
                                   by=1, partial=TRUE, fill=NA, align="right")]

ggplot(pollsDT) +
  geom_line(aes(x=end_date, y=Clinton.Avg), col="blue") +
  geom_point(aes(x=end_date, y=Clinton.Margin))

Let’s translate this into base R. It is easier to start from the data.table version, mainly because filtering and assigning have a similar look and feel. Unsurprisingly, we have base::merge for the merge operation and stats::aggregate for the aggregation phase. base::as.Date works just fine for these dates and ~~utils::read.csv has the only drawback that you have to specify the separator~~utils::read.delim recognises the right separator by default. Without further ado, this is my version in base R:

## Getting the polls

library(zoo)

pollsB <- read.delim(url("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"))

## Wrangling the polls

pollsB <- pollsB[pollsB$sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
pollsB$end_date <- base::as.Date(pollsB$end_date)
endDate <- data.frame(end_date = seq.Date(min(pollsB$end_date), max(pollsB$end_date), by="days"))
pollsB <- merge(pollsB, endDate, by="end_date", all=TRUE)

## Average the polls

pollsB <- aggregate(cbind(Clinton, Trump) ~ end_date, data=pollsB, mean, na.action=na.pass)
pollsB$Clinton.Margin <- pollsB$Clinton - pollsB$Trump
pollsB$Clinton.Avg <- rollapply(pollsB$Clinton.Margin, width=14,
                                FUN=function(x){mean(x, na.rm=TRUE)},
                                by=1, partial=TRUE, fill=NA, align="right")

plot(pollsB$end_date, pollsB$Clinton.Margin, pch=16)
lines(pollsB$end_date, pollsB$Clinton.Avg, col="blue", lwd=2)

which is the shortest one! Finally, let’s repeat the benchmark too:

library(microbenchmark)

url <- "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv"
file <- "/tmp/poll-responses-clean.tsv"
download.file(url, destfile=file, quiet=TRUE)
res <- microbenchmark(tidy=suppressMessages(readr::read_tsv(file)),
                      dt=data.table::fread(file, showProgress=FALSE),
                      base=read.delim(file))
res

## Unit: milliseconds
##  expr       min        lq      mean    median        uq        max neval
##  tidy 13.877036 15.127885 18.549393 15.861311 17.813541 202.389391   100
##    dt  4.084022  4.505943  5.152799  4.845193  5.652579   7.736563   100
##  base 29.029366 30.437742 32.518009 31.449916 33.600937  45.104599   100

Base R is clearly the slowest option for the reading phase. Or, one might say, both readr and data.table have done a great job in improving things! Let’s take a look at the processing part now:

tvin <- suppressMessages(readr::read_tsv(file))
dtin <- data.table::fread(file, showProgress=FALSE)
bsin <- read.delim(file)

library(tidyverse)
library(data.table)
library(zoo)

transformTV <- function(polls_2016) {
  polls_2016 <- polls_2016 %>%
    filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) %>%
    right_join(data.frame(end_date = seq.Date(min(.$end_date), max(.$end_date), by="days")), by="end_date")
  
  rolling_average <- polls_2016 %>%
    group_by(end_date) %>%
    summarise(Clinton = mean(Clinton), Trump = mean(Trump)) %>%
    mutate(Clinton.Margin = Clinton-Trump,
           Clinton.Avg =  rollapply(Clinton.Margin,width=14,
                                    FUN=function(x){mean(x, na.rm=TRUE)},
                                    by=1, partial=TRUE, fill=NA, align="right"))
}

transformDT <- function(dtin) {
  pollsDT <- copy(dtin) ## extra work to protect from reference semantics for benchmark
  pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
  pollsDT[, end_date := as.IDate(end_date)]
  pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]),
                                                max(pollsDT[,end_date]), by="days")), on="end_date"]
  
  pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), by=end_date]
  pollsDT[, Clinton.Margin := Clinton-Trump]
  pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14,
                                     FUN=function(x){mean(x, na.rm=TRUE)},
                                     by=1, partial=TRUE, fill=NA, align="right")]
}

transformBS <- function(pollsB) {
  pollsB <- pollsB[pollsB$sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ]
  pollsB$end_date <- base::as.Date(pollsB$end_date)
  endDate <- data.frame(end_date = seq.Date(min(pollsB$end_date), max(pollsB$end_date), by="days"))
  pollsB <- merge(pollsB, endDate, by="end_date", all=TRUE)
  
  pollsB <- aggregate(cbind(Clinton, Trump) ~ end_date, data=pollsB, mean, na.action=na.pass)
  pollsB$Clinton.Margin <- pollsB$Clinton - pollsB$Trump
  pollsB$Clinton.Avg <- rollapply(pollsB$Clinton.Margin, width=14,
                                  FUN=function(x){mean(x, na.rm=TRUE)},
                                  by=1, partial=TRUE, fill=NA, align="right")
}

res <- microbenchmark(tidy=transformTV(tvin),
                      dt=transformDT(dtin),
                      base=transformBS(bsin))
res

## Unit: milliseconds
##  expr      min       lq     mean   median       uq       max neval
##  tidy 20.68435 22.58603 26.67459 24.56170 27.85844  84.55077   100
##    dt 17.25547 18.88340 21.43256 20.24450 22.26448  41.65252   100
##  base 28.39796 30.93722 34.94262 32.97987 34.98222 109.14005   100

I don’t see so much difference between the tidyverse and data.table as Dirk showed, perhaps because I’ve simplified the script a bit, and removed some redundant parts. Again, base R is the slowest option, but don’t set it aside: it is the shortest one, and it is always there, out of the box!

Update (2018-01-25): Use read.delim(file) instead of read.csv(file, sep="\t") as @JorisMeys suggested here.

simmer.bricks 0.1.0: new add-on for simmer

17 enero 201816 enero 2018 Iñaki Úcar3 comentarios

The new package simmer.bricks has found its way to CRAN. The simmer package provides a rich and flexible API to build discrete-event simulations. However, there are certain recurring patterns that are typed over and over again, higher-level tasks which can be conceptualised in concrete activity sequences. This new package is intended to capture this kind of patterns into usable bricks, i.e., methods that can be used as simmer activities, but return an arrangement of activities implementing higher-level tasks.

For instance, consider an entity visiting a resource:

library(simmer)

trajectory("customer") %>%
  seize("clerk") %>%
  timeout(10) %>%
  release("clerk")

## trajectory: customer, 3 activities
## { Activity: Seize        | resource: clerk, amount: 1 }
## { Activity: Timeout      | delay: 10 }
## { Activity: Release      | resource: clerk, amount: 1 }

The simmer.bricks package wraps this pattern into the visit() brick:

library(simmer.bricks)

trajectory("customer") %>%
  visit("clerk", 10)

## trajectory: customer, 3 activities
## { Activity: Seize        | resource: clerk, amount: 1 }
## { Activity: Timeout      | delay: 10 }
## { Activity: Release      | resource: clerk, amount: 1 }

This is a very naive example though. As a more compelling use case, consider a resource that becomes inoperative for some time after each release; i.e., the clerk above needs to do some paperwork after each customer leaves. There are several ways of programming this with simmer. The most compact implementation requires a clone() activity to let a clone hold the resource for some more time while the original entity continues its way. This package encapsulates all this logic in a very easy-to-use brick called delayed_release():

env <- simmer()

customer <- trajectory("customer") %>%
  log_("waiting") %>%
  seize("clerk") %>%
  log_("being attended") %>%
  timeout(10) %>%
  # inoperative for 5 units of time
  delayed_release(env, "clerk", 5) %>%
  log_("leaving")

env %>%
  add_resource("clerk") %>%
  add_generator("customer", customer, at(0, 1)) %>%
  run() %>% invisible

## 0: customer0: waiting
## 0: customer0: being attended
## 1: customer1: waiting
## 10: customer0: leaving
## 15: customer1: being attended
## 25: customer1: leaving

The reference index lists all the available bricks included in this inital release. The examples included in the help page for each method show the equivalence in plain activities. This is very important if you want to mix bricks with rollbacks to produce loops, since the rollback() activity works in terms of the number of activities. For instance, this is what a delayed_release() does behind the scenes:

customer

## trajectory: customer, 11 activities
## { Activity: Log          | message }
## { Activity: Seize        | resource: clerk, amount: 1 }
## { Activity: Log          | message }
## { Activity: Timeout      | delay: 10 }
## { Activity: Clone        | n: 2 }
##   Fork 1, continue,  trajectory: anonymous, 2 activities
##   { Activity: SetCapacity  | resource: clerk, value: 0x55a7c5b524c0 }
##   { Activity: Release      | resource: clerk, amount: 1 }
##   Fork 2, continue,  trajectory: anonymous, 2 activities
##   { Activity: Timeout      | delay: 5 }
##   { Activity: SetCapacity  | resource: clerk, value: 0x55a7c59ddc18 }
## { Activity: Synchronize  | wait: 0 }
## { Activity: Log          | message }

As always, we are more than happy to receive feedback and suggestions, either via the mailing list or via GitHub issues and PRs. If you identify any pattern that you frequently use in your simulations and you think it could become a useful simmer brick, please don’t hesitate to share it!

simmer 3.6.5

4 enero 2018 Iñaki Úcar3 comentarios

The fifth update of the 3.6.x release of simmer, the Discrete-Event Simulator for R, is on CRAN. This release extends the attributes API by allowing users to set/get multiple attributes at once (a pretty straightforward change as well as useful; I don’t know why it didn’t occurred to me before…). Vectors as attributes and other data types are not supported yet, but they are on the roadmap.

This version also fixes some minor bugs (many thanks to the users of the simmer-devel mailing list for taking their simulations to edge cases, where these bugs arise), deprecates the onestep() function and provides the new stepn() instead. Since onestep() serves primarily for debugging purposes, the transition to the new one may go unnoticed. Finally, there is a new vignette about the Dining Philosophers Problem.

New features:

set_attribute() (and set_global() by extension) can set multiple attributes at once by providing vectors of keys and values (or functions returning such keys and/or values). get_attribute() (and get_global() by extension) can retrieve multiple keys (#122).
New stepn() method deprecates onestep() (e452975).

Minor changes and fixes:

Restore ostream after formatting (9ff11f8).
Fix arrival cloning to copy attributes over to the clone (#118).
Fix self-induced preemption through set_capacity() (#125).
Update “Queueing Systems” vignette (a0409a0, 8f03f4f).
Update “Advanced Trajectory Usage” vignette (4501927).
Fix print methods to return the object invisibly (#128).
New “Dining Philosophers Problem” vignette (ff6137e).

New features:

Minor changes and fixes:

Comparte esto:

Comparte esto:

Comparte esto:

Comparte esto:

New features:

Minor changes and fixes:

Comparte esto: