Spreading values across a by group without -egen-
Since I wrote the following Sergio Corriea has posted his -fegen- package to
SSC, which promises a similar improvement in speed but retains the egen syntax.
-egen- is convenient for spreading values across a by group, but can be slow for
very large files, or if repeated often. In this demonstration there are 10 million
records in 2 million groups and we wish to place the maximum of variable x for each
group in all the records for that group. The variable is random and sometimes
missing. We can substitute 2 statements for one -egen- command and speed processing
by a factor of 20. The method is quite general, it works for minimum, sum, etc with
slight and obvious modification.
We use
by id: replace x = max(x[_n-1],x)
to get the maximum within the group into the last member of the group. Then we use
by id: gen groupmax = x[_N]
to copy the last (and greatest) value to all the other records as the new variable
"groupmax".
The speed increase in the test run is a factor of twenty over the equivalent -egen-
command. The final -assert- statement shows that the results are equivalent. Here is
the full .do file for the demonstration:
log using spread,text replace
set obs 10000000
set rmsg on
gen id = 5*int(_n/5-.1)
sort id
gen x = runiform()
replace x = . if runiform()>.95
by id: replace x = max(x[_n-1],x)
by id: gen groupmax = x[_N]
by id: egen egenmax = max(x)
assert egenmax == groupmax
last modified October 15, 2016 by drf