New Timings for a Grouped In-Place Aggregation Task

John Mount

6 years ago

This article was first published on python – Win-Vector Blog , and kindly contributed to python-bloggers. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.

Roughly, the task was to add in some derived per-group aggregation columns to a few million row data set. In the application the groups tend to be small session logs from many users. So the groups are numerous and small.

We can create an abstract version of such data in R as follows.

set.seed(2020)
n <- 1000000

mk_data <- function(n) {
d <- data.frame(x = rnorm(n))
d$g <- sprintf("level_%09g",
sample.int(n, size = n, replace = TRUE))
return(d)
}

d <- mk_data(n)

The sampling with replacement has an expected number of unique IDs in the ballpark of n/log(n) via the coupon collector’s problem. So we expect lots of small groups in such data.

Our task can be specified in rquery/rqdatatable notation as follows.

library(rqdatatable)

ops_rqdatatable <- local_td(d, name = 'd') %.>%
extend(.,
rn %:=% row_number(),
cs %:=% cumsum(x),
partitionby = 'g',
orderby = 'x') %.>%
order_rows(.,
c('g', 'x'))

The key step is the extend(), which adds the new columns rn and cs in a per-g group manner in a by-x order. We feel the notation is learnable and expressive. (Note: normally we would use := for assignment, but as we are also running direct data.table examples we didn’t load this operator and instead used %:=% to stay out of data.table’s way.)

We translated the same task in to several different notations: data.table, dplyr, dtplyr, and data_algebra. The observed task times are given below.

Mean task run times in second (smaller is better)
Method	Interface Language	Data Engine	Mean run time in seconds
rqdatatable	R	data.table	3.8
data.table	R	data.table	2.1
dplyr	R	dplyr	35.1
dtplyr	R	data.table	5.1
data_algebra	Python	Pandas	17.1

What is missing is a direct Pandas timing (to confirm if the length of the Python run-time is from data_algebra overhead or from the underlying Pandas engine).

What stands out is how fast data.table, and even the data.table based methods, are compared to all other methods.

Details of the benchmark runs (methods, code, data, versions, and so on) can be found here.

To leave a comment for the author, please follow the link and comment on their blog: python – Win-Vector Blog .

Want to share your content on python-bloggers? click here.