Linking Individuals Across Historical Sources: a Fully Automated Approach

Ran Abramitzky; Roy Mill; Santiago Pérez

doi:10.3386/w24324

Linking Individuals Across Historical Sources: a Fully Automated Approach

Ran Abramitzky, Roy Mill & Santiago Pérez

Working Paper 24324

DOI 10.3386/w24324

Issue Date February 2018

Revision Date March 2019

Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. In the first part of the paper, we suggest a fully automated probabilistic method for linking historical datasets that enables researchers to create samples at the frontier of minimizing type I (false positives) and type II (false negatives) errors. The first step guides researchers in the choice of which variables to use for linking. The second step uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two records correspond to the same individual. The third step suggests how to use these estimated probabilities to choose which records to use in the analysis. In the second part of the paper, we apply the method to link historical population censuses in the US and Norway, and use these samples to estimate measures of intergenerational occupational mobility. The estimates using our method are remarkably similar to the ones using IPUMS’, which relies on hand linking to create a training sample. We created an R code and a Stata command that implement this method.

An early version of this paper, entitled “Linking Records across Historical Sources”, was Chapter 3 of Roy Mill's dissertation completed at Stanford in June 2013. We have benefited from conversations with Jaime Arellano-Bover, Leah Boustan, Alvaro Calderón, Raj Chetty, Jacob Conway, Katherine Eriksson, James Feigenbaum, Helen Kissel, Randall Walsh, Tom Zohar and participants in the UC Berkeley complete count census workshop. The views expressed herein are those of the authors and do not necessarily reflect the views of the National Bureau of Economic Research.
Copy Citation

Ran Abramitzky, Roy Mill, and Santiago Pérez, "Linking Individuals Across Historical Sources: a Fully Automated Approach," NBER Working Paper 24324 (2018), https://doi.org/10.3386/w24324.

Download Citation

MARC RIS BibTeΧ
- implementation codes
- February 13, 2018
- October 31, 2018

Linking Individuals Across Historical Sources: a Fully Automated Approach

Published Versions

Related

Topics

Programs

More from the NBER