r - Aggregating duplicate rows by taking sum -

- July 15, 2011

following on questions:
1. identifying whether set of variables uniquely identifies each row of data or not;
2. tagging rows duplicates in terms of given set of variables,
aggregate/consolidate duplicate rows in terms of given set of variables, taking sum.

solution 1:

there guidance on how here, when there large number of levels of variables form index, ddply method recommended there slow, in case trying tag duplicates given set of variables.

# values of (f1, f2, f3, f4) uniquely identify observations dfunique = expand.grid(f1 = factor(1:16),                        f2 = factor(1:41),                        f3 = factor(1:2),                        f4 = factor(1:104))  # sample rows , rbind them dfdup = rbind(dfunique, dfunique[sample(1:nrow(dfunique), 100), ])  # dummy data  dfdup$data = rnorm(nrow(dfdup))  # aggregate duplicate rows taking sum dfdupagg = ddply(dfdup, .(f1, f2, f3, f4), summarise, data = sum(data))

solution 2:

the second solution use data.table, , following advice here, do

# data.table solution indexvars = paste0('f', 1:4, sep = '') dtdup = data.table(dfdup, key = indexvars) dtdupagg = dtdup[, list(data = sum(data)), = key(dtdup)]

i have couple of questions:
1. there way make ddply version faster?
2. data.table correct? want check since new data.table.

regarding data.table solution, don't need set key aggregation operations. can directly do:

indexvars = paste0('f', 1:4, sep = '') dtdup <- as.data.table(dfdup) ## faster data.table(.) dtdupagg = dtdup[, list(data = sum(data)), = c(indexvars)]

data.table version 1.9.2+ implements function setdt enables conversion of data.frames data.tables by reference (which means, there no copy , therefore takes no time in conversion, useful on large data.frames).

so, instead of doing:

dtdup <- as.data.table(dfdup) dtdup[...]

you do:

## data.table v1.9.2+ setdt(dfdup) ## faster as.data.table(.) dfdup[...]   ## dfdup data.table, converted reference

on first question, plyr not known speed. check why plyr slow? (and many informative comments there) more info.

perhaps maybe interested in dplyr, orders of magnitude faster plyr, still slower data.table, imho. here's equivalent dplyr version:

dfdup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))

here's benchmark between data.table , dplyr on data (all timings minimum of 3 consecutive runs):

## data.table v1.9.2+ system.time(ans1 <- dtdup[, list(data=sum(data)), by=c(indexvars)]) #  user  system elapsed  # 0.049   0.009   0.057   ## dplyr (commit ~1360 github) system.time(ans2 <- dfdup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))) #  user  system elapsed  # 0.374   0.013   0.389

i don't have patience run plyr version (stopped after 93 seconds of first run). can see dplyr faster plyr, ~7x times slower data.table here.

check if results equal sure:

all.equal(as.data.frame(ans1[order(f1,f2,f3,f4)]),            as.data.frame(ans2)) # [1] true

hth

Search This Blog

XPATH