Multiple Fixed Effects and -reghdfe-

Multiple Fixed Effects

xtreg, tsls and their ilk are good for one fixed effect, but what if you have more than one? Possibly you can take out means for the largest dimensionality effect and use factor variables for the others. That works untill you reach the 11,000 variable limit for a Stata regression. An attractive alternative is -reghdfe- on SSC which is an iterative process that can deal with multiple high dimensional fixed effects. A regression with 60,000 and 25,000 catagories in two separate fixed effects took 4,900 seconds on a test dataset with 100 million observation (limited to 2 cores). So it is very practical. See Abowd, Creecy and Kramarz for more information about the statistical properties..

I recently received a message From Sergio Correia with some information about a recent revision to the -reghdfe- command.

Hi Daniel,

Just wanted to share with you some results about the latest version of the
-reghdfe- command. I ran some benchmarks on the last version (v3.0), available
at github, and found that it's actually 3-4 times as fast as areg and xtreg even
for one fixed effect.

As seen in the benchmark do-file (ran with Stata 13 on a laptop), on a dataset
of 100,000 obs., areg takes 2 seconds., xtreg_fe takes 2.5s, and the new version
of reghdfe takes 0.4s

Without clusters, the only difference is that -areg- takes 0.25s which makes it
faster but still in the same ballpark as -reghdfe-.

All results are robust to changing the size of the dataset and the number of
allowed cores, although for datasets below 1000 observations, the cost of
compiling the mata files makes reghdfe slower than the built-in alternatives.

Also, since I'm using a different estimation algorithm, the speed improvements
when using 2 or more FEs, are even larger.

Betas will be consistent except in the most extreme cases such as having 
one fixed effect for each observation. 

Sure, the estimate for the intercept of a particular individual may not be
consistent, but most people care about the betas. In fact, -xtreg- doesn't
even give you point estimates for the FEs, only -areg- does.

Best,
Sergio


PS: By the way, several of the other pain points mentioned in your Stata guide,
such as separate slope coefficients, fast IV with fixed effects, multiway
clustering, etc. can also be solved efficiently with reghdfe

Memory Footprint

If you have many non-fixed effect variables in your -reghdfe- regression it can take an unreasonable amount of memory. At NBER if you ask for more memory than is available your job will get little or no CPU time and may never complete. Sergio Correia (the author of -reghdfe-) has some suggestions for drastically reducing the amount of memory used:

From sergio.correia@gmail.com Wed Dec 12 18:24:23 2018
Date: Wed, 12 Dec 2018 18:24:03 -0500
From: Sergio Correia 
To: Ezra Karger , Daniel Feenberg 
Subject: Re: question about installing ftools for reghdfe

Hi Ezra and Daniel,
You are right on your intuition, but the memory usage can be reduced
drastically at only a small speed cost. First, more details on what's going on:

Most of the memory cost is independent on whether you are demeaning wrt. 1, 2,
or 10 sets of fixed effects, and has most to do with the number of covariates.
First, I'll try to explain what's the point of highest memory usage. What
reghdfe does is the following:

 1. Before running, you have a dataset in memory that will remain there (so if
    the dataset is 10gb, that's 10gb used from the get-go)
 2. reghdfe loads the data into Mata, which can potentially double the memory
    usage if all covariates are double. If the covariates are instead stored as
    byte (which saves memory), the memory increase will be larger b/c Mata only
    supports double types.
 3. reghdfe does "y = partial_out(y)". This operation will, for one instant,
    have two copies of the data, potentially doubling again the dataset.
Thus, in this worst case scenario you will triple the amount of memory just due
to the covariates being stored in Mata. On top of that you have the set of fixed
effects, which adds a bit more memory. In your case, the key problem is the
covariates, because the main use of reghdfe does not involve having 1000
covariates, and instead if you don't really care about those 1000 covariates you
usually end up absorbing them (specially if they are dummies).

Now, there are two options that I use to reduce memory usage drastically:
pool(#) and compact.

The pool(1) option will perform the partial on the covariates one by one, so
the memory usage is 2x instead of 3x. As you increase the number to e.g. pool(5)
you increase memory usage a bit but speed increases.

Further, the compact option will first -preserve- the data, and then drop
observations as they are loaded into Mata. Thus, the 2x memory usage is now
closer to 1x memory usage.

Hope this helps! Best,

Sergio

last revision December 13, 2018