Faster Recoding
The typical Stata job involves considerable recoding of variables. There
are many ways to recode variables and how this is done can have a
substantial effect on runtime. Consider the conversion of the FIPS state
code to the IRS state code. Stata includes the -recode- statement for just
this task:
recode fips = ( 1 = 1) ( 2 = 2) ( 4 = 3) ( 5 = 4) ( 6 = 5)
( 8 = 6) ( 9 = 7) (10 = 8) (11 = 9) (12 = 10) (13 = 11)
(15 = 12) (16 = 13) (17 = 14) (18 = 15) (19 = 16) (20 = 17)
(21 = 18) (22 = 19) (23 = 20) (24 = 21) (25 = 22) (26 = 23)
(27 = 24) (28 = 25) (29 = 26) (30 = 27) (31 = 28) (32 = 29)
(33 = 30) (34 = 31) (35 = 32) (36 = 33) (37 = 34) (38 = 35)
(39 = 36) (40 = 37) (41 = 38) (42 = 39) (44 = 40) (45 = 41)
(46 = 42) (47 = 43) (48 = 44) (49 = 45) (50 = 46) (51 = 47)
(53 = 48) (54 = 49) (55 = 50) (56 = 51),generate(irs);
That statement takes about 11 seconds per million observations. Given that
the source values are all small integers, it is quite possible to substitute an
array assignment:
matrix define fips2irs = (1,2,.,3,4,5,.,6,7,8,9,10,11,.,
12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,
33,34,35,36,37,38,39,.,40,41,42,43,44,45,46,47,.,48,49,50,51);
generate irs = fips2irs[1,fips];
which takes only .18 seconds per million observations, or a factor of 60
speedup. A series of 51 replace...if statements
...
generate irs = 1 if fips==1
replace irs = 2 if fips==2
replace irs = 3 if fips==4
...
would fall in between those times, at 1.6 seconds/million values but is
maximally flexible.
Still another way to recode would use a sort and merge with a translation
dataset, which is .65 seconds in my test dataset, but would be highly
variable.
In some cases the -group- option on the -egen- statement will be practical and is
also quite fast. It doesn't let you choose the new values (they are consecutive
integers) but it may be used to rapidly and easily compress long strings into single
bytes. See also -encode- or -multencode- (SSC), In my tests -encode- is nearly
instantaneous - less than a second for converting a million strings with a thousand
possible values.
Note that if you want regular quantiles such as deciles or percentiles, there are
fast one-line commands.
At least one of the techniques mentioned above should suit any
recoding task, and even the most flexible methods are likely to be an
order of magnitude faster than recode.