r - Aggregating duplicate rows by taking sum -
following on questions:
1. identifying whether set of variables uniquely identifies each row of data or not;
2. tagging rows duplicates in terms of given set of variables,
aggregate/consolidate duplicate rows in terms of given set of variables, taking sum.
solution 1:
there guidance on how here, when there large number of levels of variables form index, ddply
method recommended there slow, in case trying tag duplicates given set of variables.
# values of (f1, f2, f3, f4) uniquely identify observations dfunique = expand.grid(f1 = factor(1:16), f2 = factor(1:41), f3 = factor(1:2), f4 = factor(1:104)) # sample rows , rbind them dfdup = rbind(dfunique, dfunique[sample(1:nrow(dfunique), 100), ]) # dummy data dfdup$data = rnorm(nrow(dfdup)) # aggregate duplicate rows taking sum dfdupagg = ddply(dfdup, .(f1, f2, f3, f4), summarise, data = sum(data))
solution 2:
the second solution use data.table
, , following advice here, do
# data.table solution indexvars = paste0('f', 1:4, sep = '') dtdup = data.table(dfdup, key = indexvars) dtdupagg = dtdup[, list(data = sum(data)), = key(dtdup)]
i have couple of questions:
1. there way make ddply
version faster?
2. data.table
correct? want check since new data.table
.
regarding data.table
solution, don't need set key aggregation operations. can directly do:
indexvars = paste0('f', 1:4, sep = '') dtdup <- as.data.table(dfdup) ## faster data.table(.) dtdupagg = dtdup[, list(data = sum(data)), = c(indexvars)]
data.table
version 1.9.2+ implements function setdt
enables conversion of data.frames
data.tables
by reference (which means, there no copy , therefore takes no time in conversion, useful on large data.frames).
so, instead of doing:
dtdup <- as.data.table(dfdup) dtdup[...]
you do:
## data.table v1.9.2+ setdt(dfdup) ## faster as.data.table(.) dfdup[...] ## dfdup data.table, converted reference
on first question, plyr
not known speed. check why plyr slow? (and many informative comments there) more info.
perhaps maybe interested in dplyr
, orders of magnitude faster plyr
, still slower data.table
, imho. here's equivalent dplyr
version:
dfdup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))
here's benchmark between data.table
, dplyr
on data (all timings minimum of 3 consecutive runs):
## data.table v1.9.2+ system.time(ans1 <- dtdup[, list(data=sum(data)), by=c(indexvars)]) # user system elapsed # 0.049 0.009 0.057 ## dplyr (commit ~1360 github) system.time(ans2 <- dfdup %.% group_by(f1, f2, f3, f4) %.% summarise(data = sum(data))) # user system elapsed # 0.374 0.013 0.389
i don't have patience run plyr
version (stopped after 93 seconds of first run). can see dplyr
faster plyr
, ~7x times slower data.table
here.
check if results equal sure:
all.equal(as.data.frame(ans1[order(f1,f2,f3,f4)]), as.data.frame(ans2)) # [1] true
hth
Comments
Post a Comment