xtreg, tsls and their ilk are good for one fixed effect, but what if you have more than one? Possibly you can take out means for the largest dimensionality effect and use factor variables for the others. That works untill you reach the 11,000 variable limit for a Stata regression. An attractive alternative is -reghdfe- on SSC which is an iterative process that can deal with multiple high dimensional fixed effects. A regression with 60,000 and 25,000 catagories in two separate fixed effects took 4,900 seconds on a test dataset with 100 million observation (limited to 2 cores). So it is very practical. See Abowd, Creecy and Kramarz for more information about the statistical properties..
I recently received a message From Sergio Correia with some information about a recent revision to the -reghdfe- command.
Hi Daniel, Just wanted to share with you some results about the latest version of the -reghdfe- command. I ran some benchmarks on the last version (v3.0), available at github, and found that it's actually 3-4 times as fast as areg and xtreg even for one fixed effect. As seen in the benchmark do-file (ran with Stata 13 on a laptop), on a dataset of 100,000 obs., areg takes 2 seconds., xtreg_fe takes 2.5s, and the new version of reghdfe takes 0.4s Without clusters, the only difference is that -areg- takes 0.25s which makes it faster but still in the same ballpark as -reghdfe-. All results are robust to changing the size of the dataset and the number of allowed cores, although for datasets below 1000 observations, the cost of compiling the mata files makes reghdfe slower than the built-in alternatives. Also, since I'm using a different estimation algorithm, the speed improvements when using 2 or more FEs, are even larger. Betas will be consistent except in the most extreme cases such as having one fixed effect for each observation. Sure, the estimate for the intercept of a particular individual may not be consistent, but most people care about the betas. In fact, -xtreg- doesn't even give you point estimates for the FEs, only -areg- does. Best, Sergio PS: By the way, several of the other pain points mentioned in your Stata guide, such as separate slope coefficients, fast IV with fixed effects, multiway clustering, etc. can also be solved efficiently with reghdfe
From sergio.correia@gmail.com Wed Dec 12 18:24:23 2018 Date: Wed, 12 Dec 2018 18:24:03 -0500 From: Sergio Correia To: Ezra Karger , Daniel Feenberg Subject: Re: question about installing ftools for reghdfe Hi Ezra and Daniel, You are right on your intuition, but the memory usage can be reduced drastically at only a small speed cost. First, more details on what's going on: Most of the memory cost is independent on whether you are demeaning wrt. 1, 2, or 10 sets of fixed effects, and has most to do with the number of covariates. First, I'll try to explain what's the point of highest memory usage. What reghdfe does is the following: 1. Before running, you have a dataset in memory that will remain there (so if the dataset is 10gb, that's 10gb used from the get-go) 2. reghdfe loads the data into Mata, which can potentially double the memory usage if all covariates are double. If the covariates are instead stored as byte (which saves memory), the memory increase will be larger b/c Mata only supports double types. 3. reghdfe does "y = partial_out(y)". This operation will, for one instant, have two copies of the data, potentially doubling again the dataset. Thus, in this worst case scenario you will triple the amount of memory just due to the covariates being stored in Mata. On top of that you have the set of fixed effects, which adds a bit more memory. In your case, the key problem is the covariates, because the main use of reghdfe does not involve having 1000 covariates, and instead if you don't really care about those 1000 covariates you usually end up absorbing them (specially if they are dummies). Now, there are two options that I use to reduce memory usage drastically: pool(#) and compact. The pool(1) option will perform the partial on the covariates one by one, so the memory usage is 2x instead of 3x. As you increase the number to e.g. pool(5) you increase memory usage a bit but speed increases. Further, the compact option will first -preserve- the data, and then drop observations as they are loaded into Mata. Thus, the 2x memory usage is now closer to 1x memory usage. Hope this helps! Best, Sergio