PDF file location: http://www.murraylax.org/rtutorials/ipumsdata.pdf

HTML file location: http://www.murraylax.org/rtutorials/ipumsdata.html


Note on required packages: The following code requires the packages in the package SAScii. This package contains functions to run SAS importation code in R, for the purposes of loading a data set in R designed to be opened in SAS.

install.packages("SAScii") # This only needs to be executed once for your machine

library("SAScii") # This needs to be executed every time you load R


1. Downloading the Data

After you have submitted an extract, you will be brought to a screen like this one:

 

 

The link data will not appear immediately. IPUMS will send you an email when your extract is ready. The amount of time this takes depends on the number and size of the samples requested. An extract with a single sample of several thousand observations may only take a minute. It is possible that very large extracts will take an hour or more.

When the data link appears, download this file and note the folder in which it is saved. This is a text file that needs to be processed to be opened in any statistical software. The file I downloaded for this tutorial is named cps_00013.dat.gz.

Also download the link to the SAS command file. This is code for SAS (another statistical software program) that contains the code to process and open the data file. Save this file in the same folder. The file I downloaded for this tutorial is named cps_00013.sas.

2. Opening the Data

In the file viewer in Rstudio, navigate to the folder where you saved the data and SAS script files. You can navigate to a folder by clicking the ... button in the upper-right corner of the file viewer.

Once you have navigated to the correct folder, click on the More button above the file viewer, and click Set As Working Directory, as shown in the screenshot below.

 

 

The following code processes the data using the SAS script and stores the data as a data.frame object called df.

df <- read.SAScii("cps_00013.dat.gz", "cps_00013.sas")

We can get a quick view of the variable names, the scale of measurement, and the first few observations by passing the data frame df to the function str().

str(df)
## 'data.frame':    185914 obs. of  5 variables:
##  $ YEAR     : num  2017 2017 2017 2017 2017 ...
##  $ EDUC     : num  73 1 10 125 73 81 73 60 111 81 ...
##  $ CLASSWKR : num  0 0 0 0 21 21 21 0 21 21 ...
##  $ UHRSWORKT: num  999 999 999 999 40 40 40 999 40 40 ...
##  $ EARNWEEK : num  10000 10000 10000 10000 10000 ...

The data frame includes 5 variables and more than 185,000 observations.

3. Re-coding Variables

A number of the variables include codes for categories and missing observations. In this section, we take a closer look at these variables and discuss how to re-code them to make them usable.

3.1 Re-coding to an Ordered Factor: Education

The variable EDUC is a code between 000 and 125 for the level of education, and includes values of 999 for missing or unknown values. The following screenshots from https://cps.ipums.org/cps-action/variables/EDUC#codes_section show the meaning for each code.

EDUC Page Codes