Stata I/O with very large files
While Statamp can make estimation very fast, Amdahl's law makes
the single-threaded I/O commands an increasingly large percentage of
runtime. In our work datasets are often tens of gigabytes, and sometimes
hundreds of gigabytes when multiple years of Medicare data are combined.
Use
First, the good news. For example, the -use- statement takes variable
lists and -if- qualifiers which can dramatically speed up input if only a
fraction of the data is needed. They also reduce core usage. The varlists
and -if- qualifiers provide an order of magnitude improvement in speed in
typical applications here.
Append
The -append- statement doesn't take the -if- qualifier, though it
does take a varlist. If the appended files are too big for available
memory you can't use:
forvalues year=2001/2010 {
append med`year' if diagnosis=="ami", keep( varlist)
}
Note that you can -append- to an empty dataset (no need to -use-
the first year and -append- the rest). It is the -if- qualifier that is
the problem. A plausible but I/O intensive solution is:
forvalues year=2001/2010 {
clear
use id diagnosis using med`year' if diagnosis=="ami"
save ami`year'
}
forvalues year=2001/2010 {
append od diagnosis using ami`year'
}
Merge
In a Statalist posting David
Kantor explained that -merge- reserves memory for the largest
possible set of kept results, even if only a fraction match and are
actually kept. In our case, this is a huge difference. The workaround is
to split the -using- file into pieces, merge them individually, save and
append the results.
-merge- statements are quite slow compared to -use-. Our fairly
ordinary Linux boxes can read 3.4 million rows per second of 10 floats.
Merging that with a single variable in the workspace runs at only a tenth
that speed. If only one variable is kept (varlist), or only a tiny
percentage of the using rows are kept (-keep(match)-) the speed can be
partially restored to about 1.2 million rows/second. It is possible that
something about the way data is stored internally makes this impossible to
improve, but it is unfortunate.
Suppose there is in core a list of patients with an AMI, and you wish
to merge in the doctors visits of those patients from the annual op
(out-patient) files. You might hope to do something like this:
forvalues `year'=2002/2010 {
merge 1:m id using op`year'
}
In addition to the ambiguity about which rows to keep that can't work
because after the first merge, there are duplicate ids in core (for
multiple doctors visits in the first year). The best workaround I can come
up with is:
forvalues `year'=2002/2010 {
clear
use ami
merge 1:m id using op`year',keep(match)
save ami`year',replace
}
forvalues `year'=2002/2010 {
append using ami`year'
}
-append- allows multiple files to be concatenated, but -merge- doesn't
allow them to be joined.
Save
The -save- command is much more restricted than other I/O commands - no
varlist, -if- or -in- support. So dividing a file into subsets requires
rereading the file for each subset. For example instead of:
forvalues state=1/50 (
save state`state' if state=`i'
}
prior to Stata16 we have:
forvalues i=1/50 {
clear
use file if state==`i'
save state`i'
}
In version 16 Frames were introduced, which partially obviate the
inefficiency:
forvalues i=1/50 {
frame put if state==`i',into(f)
frame f: save state`i'
frame drop f
}
but this doesn't avoid scanning the full dataset 50 times, so it is still
slow. The -runby- command may be useful - see
this Statalist message
Other commands
Suprisingly the commands -infile-,-infix-, -xmlsave-, -export sasxport-,
-fdause- and -fdasave- and the new -export delimited- allow -if-, -in- and
a variable list, while -insheet-, -outsheet- -xmluse-, -import sasxport-
and -save- don't allow any of those (the -varlist- option in -infile- does
something different).. The new -import delimited- replacement for insheet
has -rowrange- (which acts like a more limited -in-) and -colrange- (which
acts like a varlist) which can improve throughput if only a fraction of
the data are read.
I should note that the -in- qualifier on the -use- statement can be a
bit of a disappointment in Stata 15. That is:
use med2009 in 1/100
doesn't stop reading at record 100. Instead it reads all 143 million
records, but doesn't add any past 100 to the in-core dataset. Apparently
this is a consequence of value labels and long strings being
stored at the
end of the data file. See this example and this
this result. The -saveold- command is a way to avoid this
retrogression if you control the creation of the file.
In version 16 this is fixed, so it it feasible to read a large file in chunks.
See
for suggestions.
Daniel Feenberg
NBER
Last update 26 December 2021