Efficient Sample selection

With some thought, and some lesser know options to the -merge- command it should be possible to dispense with the SAS step for nearly all work, as long as the final analysis dataset will fit in core. With memory allocations of up to 40 GB now readily available that would cover 40 million observations on 1,000 float variables in the analysis subsample and no limit on the size of the full file from which the sample is drawn.

In particular, there is no need for the sample selection and analysis variables to be in core at the same time.

Row and column selection with -merge-

In order to minimize the amount of memory used, we need to separate the sample selection decision from the variable inclusion decision. We assume that the initial ("full") .dta file will not fit in core, but has been created by Stat/Transfer, or on a larger computer or is in a fixed format. We also assume a record id is present and that the data are sorted by that value. To get such a record id we can record the value of _n when the full file is created. The procedure is as follows:

List only the variables needed in sample selection in the -use- command with the -using- keyword. Use the -if- qualifier to subset records, to the extent they do not have cross-observation restrictions.
use varlist1... if in_sample_condition using filename
If the data is read via a Stata dictionary, list only the variables necessary for sample selection in the dictionary, and use the -if- qualifier to the -infile- command.
Apply any cross-observation qualifications to identify the subsample required for analysis.
Drop observations not required for analysis.
keep if condition
-keep- only the record id.
keep recno
Do a 1:1 merge back to the full dataset, with the -keep- and -keepusing- options to subset by record and variable at the same time.
sort recno
merge 1:1 recno using filename,nogenerate keep(master match) sorted keepusing(varlist2...)
Save the analysis dataset.

The full dataset needs to be in recno order for the -merge- to work. See the benchmarking script below for a complete example. Note that if we use 10% of the variables in the sample selection, and select 5% of the records, then the first step uses memory of only .5% of the original file unless there are cross-observation restrictions. If we then need 10% of the variables for the analysis, that step also uses the same amount of memory.

We have done some initial experiments, and found that with 100 variables in the full dataset, 10 variables in sample selection, and 10 different variables in analysis subsample our Linux PC can convert the full file to the analysis file at a rate of about 1 million full sample records per second. Doubling the number of variables used in both steps adds about 10% to that time. Nearly all the time is used in the merge - which appears to be inefficient compared to the other Stata steps. There is no noticable improvement with 4-core statamp. Increasing segmentsize to 1g helps a little.

Merging files

If datafiles are provided in sort order by record id on the disk, it should not be necessary to sort them before merging if the -sorted- option is specified on the merge statement. Actually, the master (in-core) dataset will need to be sorted before the merge, but this is very fast for an already sorted dataset.

Limitations

Unhappily, the -append- and -save- statements do not allow -if- or -in- qualifiers, or variable selection. So while you might hope to combine multiple files with: forvalues year=1991/2010 { append varlist... using "/`year'/medpar" if... } the append statement doesn't allow varlists, if or in. It does allow a keep(varlist) option. If that isn't sufficient you would use something like this: forvalues year=1991/2010 { clear use varlist... using "/`year'/medpar" if... save "/tmp/`year'" } forvalues year=1991/2010 { clear append using "/tmp/`year'" } Which is a nuisance but isn't too inefficient if the extracted sample is only a small percentage of the total.

Nick Cox on sample selection

Date: Tue, 16 Oct 2012 01:55:54 +0100 From: Nick Cox <njcoxstata@gmail.com> Reply-To: statalist@hsphsun2.harvard.edu To: statalist@hsphsun2.harvard.edu Subject: Re: st: Unable to use "replace" on specific values (but others work ok) A totally orthogonal comment is that (e.g.) ... if _n == 1 is a very inefficient alternative to (e.g) ... in 1

Here is the benchmark test program for the subsampling merge process.

log using medibench,replace text set more off * First we create the "full" dataset (not timed, but slow) cd /tmp set memory 1000m set obs 1000000 gen double recno=_n forvalues i = 1(1)100 { quietly gen x`i' = mod(_n,1000)+`i' } sort recno /*See note below*/ save full, replace clear timer on 1 *Read in the sample selection variable(s) *Here we take a 5% sample of observations use recno x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 using full if mod(x1,20)==0 *Save the sample keep recno save sample,replace clear *Merge sample ids and analysis variables. *Here we have 10 analysis variables. use sample merge 1:1 recno using full,nogenerate keep(master match) sorted /// keepusing(x91 x92 x93 x94 x95 x96 x97 x98 x99 x100) *Save analysis file - now 5% of records and 10% of variables. save analysis,replace *Reload analysis file clear use analysis timer off 1 timer list log close exit,clear