Regressions on many subsets
Non-0verlapping subsets
If a dataset is sorted by a by variable, and you wish to run a separate
and independent regression by each value of the by variable, you might
want a "grouped regression" that can be supplied by the -statsby-
procedure in Stata. Michael Droste has his -Statsby- for fast
execution of grouped regression, collecting the results in a .dta file.
Overlapping subsets
Statsby and regressby don't do a rolling regression, where a separate
regression is run for each observation, based on a fixed number of prior
observations. You might try something like:
generate smpl = 0
forvalues i = 1/`=_N' {
replace smpl = 0
replace smpl = 1 if inrange(_n,_n-9,_n)
by smpl: reg y x if smpl
}
This does about 60 regressions/second from a dataset with 100,000 observations. There is a -rolling-
command that does rolling regressions in one line. For example the following one-liner will run a
separate regression of y on x for each observation in the dataset and save the estimated
coefficients as a replacement for the original data. The data for each regression will include that
observation and the previous 9:
tsset n
rolling _b[_con] _b[x] ,window(10) clear : regress y x
With Statamp on 8 cores it runs about 40 regressions per second with 1 independent variable. While
running on our system it kept 6 or 7 cores busy for the entire run. Note that the regressions/second
number is nearly proportional to the number of observations so the typical rolling regression over
(say) postwar quarterly data would be much faster - this page is uses a rolling regression only
because that is a familiar example. The typical user concerned about speed would likely have a
different procedure in mind.
An alternative command from SSC is -rangerun- which is more flexible than rolling (it gives you
more freedom to specify the exact nature of the subset) and somewhat faster. The command sequence:
program myprog
quietly {
regress y x
gen b_cons = _b[_cons]
gen b_x = _b[x ]
}
end
rangerun myprog, interval(n -10 0)
is the near equivalent of the -rolling- command above, but runs 4 times as
fast, even though it uses only one CPU. The difference is that it reports
limited results for the first 9 observations while -rolling- only reports results
where it has a full window available.
Nevertheless, the primary advantage of -rangerun- is the ability to specify
specify subsets other than a rolling regression.
All of these times are 100-1,000 times longer than a single regression on the full dataset. Most
of the additional time is spent selecting the observations to be included in each regression. In a
general purpose language that selection would not require examining all 100,000 observations and a
great deal of time could be saved in the creation of x`x for each regression, as 90% of the work of
creating each x`x is repeated for the next regression.
last modified 14 November 2019 by feenberg@nber.org