Ancestry.com and IPUMS Complete Count Restricted File

Ancestry.com has sponsored the digitization of the available complete count census files and allowed IPUMS to offer all but the respondent names on its website to the general user population.

Confidentiality Considerations

The name (namefrst and namelast) fields are available to affiliated users at the NBER by special arrangement through IPUMS. NBER affiliates wishing to use the IPUMS-RESTRICTED (Ancestry.com) census files for a new project should sign the application and NDA agreement forms here and send them to Carla Tokman for our submission to IPUMS. Once approved and assigned a project number by IPUMS, the project will be forwarded to the NBER IRB for review. The IRB may follow-up with additional questions.

Once all approvals are in place you can be added to the Linux groups with permission to read the data. These files (or extracts) may be processed on our servers but should not be downloaded from them.

To add an investigator or RA to an existing project send a signed agreement form marked with the project title and IPUMS number. The approval process is the same and you will be notified when the new researcher can access the data.

Linking between the restricted and public versions is fine, but the data must be maintained/analyzed on the NBER server.

Please ensure that your extracts are not world-readable. It is important to respect the agreement to ensure continued access to this important resource, for you and your colleagues. We can create a shared directory for you and others working on the same project.

Citation

Publications and research reports based on the IPUMS USA database must cite it appropriately. The citation should include the following:

Steven Ruggles, Catherine A. Fitch, Ronald Goeken, Josiah Grover, J. David Hacker, Matt Nelson, Jose Pacas, Evan Roberts, and Matthew Sobek. IPUMS Restricted Complete Count Data: Version 2.0 [dataset]. Minneapolis: University of Minnesota, 2020.

Exporting data

It is sometimes possible to export data from the NBER for external processing, such as address geocoding or name classification. Here is what you will need to include in your application to do that. This is not about releasing data for public access.

Brief description of why you need to remove the data from the server.
Who will have access to the data removed from the server?
A statement that the approved data will not be shared with non-approved persons and not distributed publicly (this data is only for internal use, we have a separate protocol for requesting to distribute data publicly)
What many observations are in the data (please state the unit of observations such as persons, households, cities, etc)
What is the file size? (MB, GB, or TB)
What variables are you removing from the server?
What are the formats of the variables (we just need to know string vs. numeric)
Is the particular variable you are removing a variable from IPUMS, or is it created by you/another source?
Is the data for a particular variable aggregated or an individual record?
If the data is aggregated, at what level? (e.g. household, city, state)

Once IPUMS approves the request, IT staff will verify that the final dataset you are removing conforms to the approved data. Once confirmed, you can then remove the data from the server.

File locations

This page documents the NBER 1880-1940 collection. The 1790-1840 files are available in /home/data/cens1930. An earlier version of 1940 is in /home/data/cens1940.

Earlier versions of the data are kept available. There is no need to keep a personal copy.

Starting with the June 2019 distributions we keep our copies of the files in

/home/data/census-ipums/ and its subdirectories. We have 1850-1940 except 1890 (1890 was lost in a fire). These files contain all the named fields, not just the restricted fields. Unedited (numbered) fields are not included. If you have need for unedited fields, contact Carla Tokman - we can add what you need. Each IPUMS revision set is kept in a separate directory, starting with /home/data/census-ipums/v2019 for the version obtained by NBER in June through October 2019. Within that directory are directories "do", "sps", etc for programs that can read the data files in various packages. These programs are provided for reference, as we have already modified the Stata code to convert the raw ASCII files (in ./dat) to dta format in ./dta. Comma-delimited files are in the ./csv directory and Parquet format in ./parquet. Other formats can be added if useful and requested. Please do not make private copies of the datasets. The locations of earlier versions of the files will not change, and those files will not be deleted or updated. /home/data/census-ipums/current will always point to the latest available revision, but older versions will be retained for consistency. For this reason, there is no reason for you to make private copies of the full datasets.

Documentation

The IPUMS website covers all the publicly available variables. The additional variables in the restricted-use file include:

namefrst: 16 character first name (and possibly middle initial)
namelast: 16 character last name
histid: 36 character person id for matching across IPUMS versions (but not census decades)
street: street address
Here is a compact concordance of variables and descriptions.

File Structure

The original files are hierarchical, but we have created the dta etc files as rectangular person datasets. That is, the household record is appended to each person record. We also apply the scaling factors in the IPUMS supplied code, which should conform the data to the documentation. Value labels that are merely the ASCII expression of the numeric value are dropped. There are no other changes.

Resource considerations

These are very large files as evidenced by record counts and file sizes (up to 100GB) but with some thought, it is practical to work with only traditional econometric software. The .dta files are somewhat more reasonable. There is advice for Stata users on dealing with very large files here but see especially this which is highly relevant.

A new, fast and compact format is Parquet. This is column-oriented, so if you load just a few variables only a fraction of the file need be read. Please see here for details.

Matching

The Census Linking Project at Princeton created a set of linked datasets between every historical Census pair using a variety of automated methods. There are considerable savings in time and resources in using a pre-made match. The code and documentation is also available on our system at

https://www.nber.org/data/census-ipums/linking_project

while access to the data on our system is restricted to members of the "cens1930" group. For internal use the files are at:

/home/data/census-ipums/linking_project

Publications using data from the matches should cite the Census Linking Project as: Ran Abramitzky, Leah Boustan and Myera Rashid. Census Linking Project: Version 1.0 [dataset]. 2020. https://censuslinkingproject.org In order to facilitate users making their own matches, we have prepared two resources. First, for each census, for each year and sex, a file containing the variables useful for matching - bpl, sex, namefrst, namelast, age, datanum, serial, pernum and histid. These files are quite reasonable in size and will facilitate making matches with a reasonable memory footprint. They are located in /home/data/census-ipums/current/mx/csv/mxMMMMMM.csv /home/data/census-ipums/current/mx/dta/mxMMMMMM.dta A second resource (under construction) are potential matches based on Jaro-Winkler scores better than .75 for first and last names and within the expected age difference plus or minus 5 years. These are: /home/data/census-ipums/mx/csv/mpBBBBS.csv /home/data/census-ipums/mx/dta/mpBBBBS.dta /home/data/census-ipums/mx/csv/mpNNNNMMMMBBBBS.csv /home/data/census-ipums/mx/dta/mpNNNNMMMMBBBBS.dta and may include multiple records from the later census (MMMM) that are potential matches for the record from the earlier census (NNNN). There are separate (BBBBS) files for each birthplace and sex. It would be up to the user to select the best match after merging with wider records.

Please respect other users by being reasonably efficient with computational resources, especially memory. In particular, when reading even one of these files into Stata you will want to subset on variables or rows, or both. There is a directory

/home/data/census-ipums/tiny with Arkansas records only. This is a good way to get a small sample for testing that can be followed through time. In Stata, using a qualifier such as use /home/data/census-ipums/current/dta/1920 in 1/10000 will give you a small file, but with no ability to test linking through time, and some Stata versions will always read the entire file, discarding the records after 10000. That is time-consuming. Using use /home/data/census-ipums/tiny/dta/1920 is much more satisfactory.

Showload will show you the available memory on all the machines, and "top -o RES" will show how your job is doing on the current machine. Computer time is cheap, but waiting for the computer is not. You may run multiple jobs but resist the urge to use more than half the available CPU or memory on any one machine. If you ask for more memory than the computer has, your job will run so slowly that it may never finish. It is always a good idea to keep track of the progress of large, long jobs.

Note that all our disk storage is compressed in the filesystem. Zip or Z compression will not reduce the actual resources used and will add complexity and time to your analysis.

Notes and questions for discussion

Are any more of the uscenNNNN_NNNN (numbered) variables useful? Including all of them would multiply the load times. At this time only rawhnum has been added (House number on street).
Online documentation from IPUMS suggests using datanum, serial and pernum for identifying individual records, but common practice among NBER users is to use histid. Also, datanum is absent in 1860-70. An advantage of histid is that it is maintained across IPUMS versions, but do you ever merge across different IPUMS versions? Why? The histid is a rather long identifier, and I feel like I need to carry along the others anyway.
The name fields sometimes contain what appear to me to be stray nonsense characters such as quote marks, dollar signs, brackets, etc. The quote marks especially can discomfort Stata. Would it be better to drop these?
The programs from IPUMS would create hierarchical files. I didn't think most users would like that. Is there a preference for hierarchical files?
I did not continue the practice of dividing the files into 100 pieces. The dta files can load in a couple of minutes. I do make extracts with the variables required for matching divided by birthplace and sex. Those are never very large and are available in ./mx. Is that ok?
It should be possible to greatly reduce the resource load for matching across decades and I would like to talk to users doing that. Some preliminary work is outlined here.
Jaro-Winkler distance programs don't seem to have uniform outcomes. The Feigenbaum -jarowinkler.ado- program applies the Winkler correction to all scores, while Winkler himself applies it only when the Jaro score is greater than .7. The Winkler adjustment parameter is .1, other authors have other values. The distance for null and one character strings varies across implementations. I can follow Feigenbaum, but seek input from all users.
Winkler also has an adjustment for often confused characters (such as "X" and "K") which is not often used. Other authors have nickname lists. I would like to collect such lists and offer them for more general use.
Some users have supplemented the data with additional variables. If you would like to allow others to use these new variables, I can add them to the common use files.
If you have a crosswalk across decades, I would be happy to post it here for other users. Records can be identified by histid or the combination of datanum, serial and pernum. I will standardize the variable names by adding a 4 digit year to each.
I have dropped 50,000 value labels which are simply ASCII presentations of numeric values, such as: label define value x 7 `7'
There are 196 "P" records in 1940 with no corresponding "H" record. This does not happen in any other year, and those records are omitted from the .dta file.
There are a number of records with blank histid.
What is the appropriate sort order? Matching by histid requires records sorted by histid, which is not the native order.