Efficient Sample selection
With some thought, and some lesser know options to the -merge- command
it should be possible to dispense with the SAS step for nearly all work, as
long as the final analysis dataset will fit in core. With memory
allocations of up to 40 GB now readily available that would cover 40
million observations on 1,000 float variables in the analysis subsample and
no limit on the size of the full file from which the sample is drawn.
In particular, there is no need for the sample selection and analysis
variables to be in core at the same time.
Row and column selection with -merge-
In order to minimize the amount of memory used, we need to separate the
sample selection decision from the variable inclusion decision. We assume
that the initial ("full") .dta file will not fit in core, but has been
created by Stat/Transfer, or on a larger computer or is in a fixed format.
We also assume a record id is present and that the data are sorted by that
value. To get such a record id we can record the value of _n when the full
file is created. The procedure is as follows:
- List only the variables needed in sample selection in the -use- command
with the -using- keyword. Use the -if- qualifier to subset records, to the
extent they do not have cross-observation restrictions.
use varlist1... if
in_sample_condition
using filename
- If the data is read via a Stata dictionary, list only the variables
necessary for sample selection in the dictionary, and use the -if-
qualifier to the -infile- command.
- Apply any cross-observation qualifications to identify the subsample required
for analysis.
- Drop observations not required for analysis.
keep if condition
- -keep- only the record id.
keep recno
- Do a 1:1 merge back to the full dataset, with the -keep- and -keepusing-
options to subset by record and variable at the same time.
sort recno
merge 1:1 recno using filename,nogenerate
keep(master match) sorted
keepusing(varlist2...)
- Save the analysis dataset.
The full dataset needs to be in recno order for the -merge- to
work. See the benchmarking script below for a complete example. Note that
if we use 10% of the variables in the sample selection, and select 5% of
the records, then the first step uses memory of only .5% of the original
file unless there are cross-observation restrictions. If we then need 10%
of the variables for the analysis, that step also uses the same amount of
memory.
We have done some initial experiments, and found that with 100 variables
in the full dataset, 10 variables in sample selection, and 10 different
variables in analysis subsample our Linux PC can convert the full file to
the analysis file at a rate of about 1 million full sample records per
second. Doubling the number of variables used in both steps adds about 10%
to that time. Nearly all the time is used in the merge - which appears to
be inefficient compared to the other Stata steps. There is no noticable
improvement with 4-core statamp. Increasing segmentsize to 1g helps a little.
Merging files
If datafiles are provided in sort order by record id on the disk, it
should not be necessary to sort them before merging if the -sorted- option is
specified on the merge statement. Actually, the master (in-core) dataset will
need to be sorted before the merge, but this is very fast for an already
sorted dataset.
Limitations
Unhappily, the -append- and -save- statements do not allow -if- or -in-
qualifiers, or variable selection. So while you might hope to combine
multiple files with:
forvalues year=1991/2010 {
append varlist... using "/`year'/medpar" if...
}
the append statement doesn't allow varlists, if or in. It does allow a
keep(varlist) option. If that isn't sufficient you would use something
like this:
forvalues year=1991/2010 {
clear
use varlist... using "/`year'/medpar" if...
save "/tmp/`year'"
}
forvalues year=1991/2010 {
clear
append using "/tmp/`year'"
}
Which is a nuisance but isn't too inefficient if the extracted sample is
only a small percentage of the total.
Nick Cox on sample selection
Date: Tue, 16 Oct 2012 01:55:54 +0100
From: Nick Cox
Reply-To: statalist@hsphsun2.harvard.edu
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: Unable to use "replace" on specific values (but others work ok)
A totally orthogonal comment is that (e.g.)
... if _n == 1
is a very inefficient alternative to (e.g)
... in 1
Here is the benchmark test program for the subsampling merge process.
log using medibench,replace text
set more off
* First we create the "full" dataset (not timed, but slow)
cd /tmp
set memory 1000m
set obs 1000000
gen double recno=_n
forvalues i = 1(1)100 {
quietly gen x`i' = mod(_n,1000)+`i'
}
sort recno /*See note below*/
save full, replace
clear
timer on 1
*Read in the sample selection variable(s)
*Here we take a 5% sample of observations
use recno x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 using full if mod(x1,20)==0
*Save the sample
keep recno
save sample,replace
clear
*Merge sample ids and analysis variables.
*Here we have 10 analysis variables.
use sample
merge 1:1 recno using full,nogenerate keep(master match) sorted ///
keepusing(x91 x92 x93 x94 x95 x96 x97 x98 x99 x100)
*Save analysis file - now 5% of records and 10% of variables.
save analysis,replace
*Reload analysis file
clear
use analysis
timer off 1
timer list
log close
exit,clear