These files accompany the paper: " A Note on Longitudinally Matching Current Population Survey Respondents" NBER Technical Working Paper No. 247 Brigitte C. Madrian, University of Chicago and NBER Lars John Lefgren, University of Chicago The following files (all STATA do files) are used in the alorithm to match CPS respondents in this paper: matchcps.do educate.do educate2.do married.do married2.do multobs.do race.do year.do matchcps.do is the master program. The other files are all called within this program. This program is written to match consecutive March CPS surveys, but can be easily modified to match other CPS surveys where matching is possible. educate.do and educate2.do redefine the education variables in the CPS which are not consistently coded over time for the time t and time t+1 data to be used in the CPS merge. married.do and married2.do redefine the marital status variables in the CPS for the time t and time t+1 data to be used in the CPS merge. race.do recodes the race variable for the CPS merge. year.do defines the year for the time t data. multobs.do deals with the fact that some individuals do not actually have a unique identifier at a point in time in the CPS. The working paper does not have much detail on this problem. This is discussed further below. The steps involved in matching CPS surveys are as follows: 1) Make two data extracts, one for time t and one for time t+1, both of which contain the variables necessary to merge and any additional variables to be used in a statistical analysis. For the analysis in this paper, this was done on a PC using the CPS Utilities data extraction program. The variables that we extracted for this paper are listed in Appendix Table B1. For a March-to-March merge, respondents with a MIS of 1-4 should be included in the time t extract, while respondents with a MIS of 5-8 should be included in the time t+1 extract. Respondents with a MIS of 5-8 in time t or with a MIS of 1-4 in time t+1 can be excluded since these respondents are not included in the sampling frame of both surveys. For a month-to-month merge (e.g. March-to-April), respondents with a MIS of 1-3 or 5-7 should be included in the time t extract, while respondents with a MIS of 2-4 or 6-8 should be included in the time t+1 extract. The program matchcps.do is written to perform a March-to-March merge of the CPS. It includes code to restrict the sample to the appropriate MIS ranges in time t and t+1 if extracts not subsetted on the basis of MIS are being used. matchcps.do assumes that the CPS extracts are in a directory called e:\cpswin\cpsdata\match\marchXX.dta This will clearly need to be changed to reflect the actual location of the data being used. 2) Run the program matchcps.do This program does the following: A) Recodes MIS in the time t+1 data to correspond the appropriate value that respondents in time t would have if they were in both surveys. For March-to-March merges, this subtract 4 from the t+1 MIS. For a month-to-month merge, subtract 1 from the t+1 MIS. B) Renames other variables that will be used to determine the validity of matches (e.g. sex, race, age, etc.) in the two extracts so that both the time t and time t+1 values are preserved. C) Sorts the time t and t+1 data by MIS, HHID, HHNUM and LINENO. For a March 1994-to-March 1995 merge, the data must be sorted by MIS, STATE, HHID, HHNUM, and LINENO. This would be true for some of the month-to-month merges in the 1994-1995 time period as well and results from the fact that the CPS only assigns unique household identifiers (HHID) within state over part of this time period. D) Match merges the sorted t and t+1 data extracts on the basis of the variables used to sort the data above. E) Deals with the problem of potential multiple observations on merged individuals (multobs.do). This is discussed further below. F) Applies the criteria discussed in the paper to flag those merged observations that do not appear to represent the same individuals. The program creates flags that correspond to the following merge criteria discussed in the paper: s|r|a|e, S|R|A|E, s|r|a, S|R|A, s|2, S|2, any2 and ANY2. Any of the merge criteria listed in Table 3 or Figure 4 can easily be coded from the variables sexdif, racedif, nragedif, ragedif, nredudif, and redudif that are defined in the program. Similarly, other merge criteria relying on these or other variables not included in our matchcps.do program could also be defined at this point. 3) Determine which merged observations to keep. The program matchcps.do does not apply any of the merge criteria discussed in the paper--all naively merged observations are retained and it is up to the user to determine any further observations to be deleted. More on the issue of multiple merged observations on the same individual. One problem that may arise (depending on which CPSs are being merged), is the presence of multiple post-merge observations with the same identifying variables (HHID, HHNUM, LINENO). This occurs because even though HHID, HHNUM and LINENO are meant to uniquely identify individuals, in some CPS surveys there are multiple respondents who have the same HHID, HHNUM and LINENO. If, for example, there are two individuals with the same HHID, HHNUM and LINEO in both of the CPS surveys being matched, we will end of with four merged observations. Two of the merged observations will be (potentially) correct, and two of them will be incorrect. We deal with this issue in a way designed to preserve as many potentially correct matches as possible (see the program multobs.do). First, we create a unique identifier for all respondents in both t (obsno) and t+1 (obsno2) (these identifiers are not unique across t and t+1, only within t and t+1-- we do not merge on the basis of these identifiers). After merging the t and t+1 data extracts as described above, we flag the post-merge observations that do not have a unique value of the t and/or t+1 identifiers that we create. Among these flagged observations, we then deleted those that do not have the same sex in t and t+1. We then flag the remaining post-merge observations that still did not have a unique value of the t and/or t+1 identifiers that we created. Among these flagged observations, we then delete those that do not have the same race in t and t+1. We repeat this process, deleting those observations with different values of education according to the less-restrictive age criteria and then according to the more-restrictive age criteria, different values of education according to the less-restrictive education criteria and then according to the more-restrictive education criteria, and finally, we delete those flagged observations with differences in their relationship to household head in time t and t+1. We then go through and flag any remaining observations with non-unique identifiers and delete all of these observations. The table below shows how many post-merge observations there on time t and time t+1 respondents with non-unique individual identifiers for each of the 1980-1998 March-to-March merges that are possible. It also notes how many observations with non-unique identifiers remain after we apply each of the deletion criteria just described. Note that for many of the March-to-March merges, non-unique identifiers are not a problem. For the March-to-March merges in which there are individuals with non-unique identifiers, these individuals constitute only a small fraction of the total sample. Details on Observations with Non-Unique Individual Identifiers When Longitudinally Merging the CPS Observations with non-unique identifiers remaining after deletion on the basis of differences in: Number of post-merge observations with non-unique IDs Sex Race Age Educ. HH Re. 1980-1981 1980 respondents 210 40 40 2 2 2 1981 respondents 261 64 64 8 6 4 1981-1982 1981 respondents 208 62 60 6 6 2 1982 respondents 228 34 34 12 10 6 1982-1983 1982 respondents 298 86 86 10 10 6 1983 respondents 234 62 60 6 6 2 1983-1984 1983 respondents 266 76 76 16 14 6 1984 respondents 343 104 100 22 116 8 1984-1985 1984 respondents 280 62 60 10 8 2 1985 respondents 285 66 64 18 18 6 1986-1987 1986 respondents 264 86 86 16 14 10 1987 respondents 357 102 92 8 8 4 1987-1988 1987 respondents 0 0 0 0 0 0 1988 respondents 318 92 82 14 8 0 1988-1989 1988 respondents 0 0 0 0 0 0 1989 respondents 0 0 0 0 0 0 1989-1990 1989 respondents 0 0 0 0 0 0 1990 respondents 0 0 0 0 0 0 1990-1991 1990 respondents 0 0 0 0 0 0 1991 respondents 0 0 0 0 0 0 1991-1992 1991 respondents 0 0 0 0 0 0 1992 respondents 0 0 0 0 0 0 1992-1993 1992 respondents 0 0 0 0 0 0 1993 respondents 0 0 0 0 0 0 1993-1994 1993 respondents 24 10 10 0 0 0 1994 respondents 0 0 0 0 0 0 1994-1995 1994 respondents 0 0 0 0 0 0 1995 respondents 243 149 132 6 6 6 1996-1997 1996 respondents 0 0 0 0 0 0 1997 respondents 2 0 0 0 0 0 1997-1998 1997 respondents 0 0 0 0 0 0 1998 respondents 0 0 0 0 0 0