Collapse
-collapse- is not
very fast. The author no doubt surmised that even if would be used with large
datasets, it wouldn't be inside a loop. But sometimes it is, and it can become the
rate limiting step in a seriously long-running program. It is easily replaced with
faster code, but the total benefit isn't as great as one would hope.
Suppose we have monthly income for 5 million individuals, and wish to aggregate
to annual income. We could write:
collapse (sum) month_inc,by(personid)
and find that the collapse command does about .45 seconds/million observations.
-collapse- is multi-threaded, and three cores were in use for a brief period.
Alternatively, we could do the work "by hand":
by personid: gen annual_inc = sum(month_inc)
by personid: keep if _n==_N
Of course, this assumes the data are already sorted, but that is commonly
the case. The -sum- function in a -generate- statement does the cumulative
sum, starting over at zero for each new by group. (The -egen- statement is
different). This takes about .016 seconds/million for the first line,
which would seem like a big win, but the -keep- statement takes another
.22 seconds, so the overall speed is only twice as fast. -keep- may be
doing a lot of data movement. If -save- ever allowed an -if- qualifier,
there could be a substantial savings.
Maximum, minimum etc are easlily done:
by personid: gen max_inc = max(month_inc,max_inc)
by personid: gen min_inc = min(month_inc,max_inc)
Quartiles, means, standard deviation require just a bit more code.
Sergio Correia's -fcollapse- command in the
-ftools- SSC package apparently achieves a similar
degree of improvement, and keeps the -collapse-
syntax.
Mauricio Caceres Bravo has written a C-language (partial) replacement for
-collapse- which is part of the
-gtools- and may be much faster than -ftools- but is much slower
than Stata collapse.
Last update 28 october 2019 by drf