Memory Managementin Stata 14
The -help obs_advice- command in Stata 14 suggests using some memory management
commands to speed very large Stata jobs. I tried:
set min_memory 32G
set segmentsize 2G
with datasets of 20 million to 5 billion observations but found a modest effect.
Briefly, the more variables the more improvement was seen in the -generate-
statements, but the effect was never more than 25% with 250 variables and the effect
on other statements was not noticable. A place where substantial improvement might
be expected is the -merge- command, but it continued to be orders of magnitude
slower than -use-. With 8 cores available, I rarely saw CPU load above 150% in the
test job. Still, you might try it on your project.
My test program creates two files and then merges them, then does the same thing
after setting the memory management parameters. Here it is:
set rmsg on
set more off
log using test14,text replace
local numobs = 20000000
local numvars = 250
set obs `numobs'
timer on 1
gen recid = _n
foreach num of numlist 1/`numvars' {
gen x`num' = _n
}
timer off 1
timer on 3
save "/tmp/stuffx",replace
timer off 3
drop x1-x`numvars'
timer on 5
foreach num of numlist 1/`numvars' {
gen y`num' = _n
}
timer off 5
timer on 7
save "/tmp/stuffy",replace
timer off 7
clear
timer on 9
use "/tmp/stuffx"
timer off 9
timer on 11
merge 1:1 _n using "/tmp/stuffy"
timer off 11
memory
timer list
clear
set min_memory 48G
set segmentsize 2G
set obs `numobs'
timer on 2
gen recid = _n
foreach num of numlist 1/`numvars' {
gen x`num' = _n
}
timer off 2
timer on 4
save "/tmp/stuffx",replace
timer off 4
drop x1-x`numvars'
timer on 6
foreach num of numlist 1/`numvars' {
gen y`num' = _n
}
timer off 6
timer on 8
save "/tmp/stuffy",replace
timer off 8
clear
timer on 10
use "/tmp/stuffx"
timer off 10
timer on 12
merge 1:1 _n using "/tmp/stuffy"
timer off 12
memory
timer list
local y=`numobs'/1e6
quietly {
noisily di " Times in seconds/million observations"
noisily di " 1 2"
noisily di %9.2f r( t1)/`y' " " %9.2f r( t2)/`y' " Create " `numvars' " variables with " `numobs' " observations"
noisily di %9.2f r( t3)/`y' " " %9.2f r( t4)/`y' " Save data to local drive"
noisily di %9.2f r( t5)/`y' " " %9.2f r( t6)/`y' " Create another " `numobs' " variables "
noisily di %9.2f r( t7)/`y' " " %9.2f r( t8)/`y' " Save again"
noisily di %9.2f r( t9)/`y' " " %9.2f r(t10)/`y' " Use first dataset"
noisily di %9.2f r(t11)/`y' " " %9.2f r(t12)/`y' " Merge two datasets"
noisily di " "
noisily di "Col 1 with default mem and segmentsize"
noisily di "Col 2 with min_memory 32GB and segmentsize 2GB"
}
exit,clear
v