This post, the third in a sequence of four, addresses the challenge of moving data embedded in section headers to a variable, such that the section header information appears on the same row of data as the observations to which it applies. In this example, the section headers contain parish information, and the observations are zip codes within those parishes. To keep things simple, we'll ignore the data transformation issues discussed in the last post; the next post will bring everything together.
Accordingly, let's start with the original, raw data and drop the rows that don't contain data, as we did in the last post.
# data from http://www.datacenterresearch.org/data-resources/active-addresses-by-zip-code/ # Import file library(XLConnect) data <- readWorksheet(loadWorkbook("TheDataCenter_ActiveResidentialAddresses.xls"),sheet=1) # Drop the rows from the dataset that do not contain data data <- data[-c(1:3, 95:98),] data <- data[rowSums(is.na(data))<ncol(data),]# Drop the intersperced blank rows. rownames(data) <- NULL # For convenience, we're also going to drop the date rows, which we don't need for this step. data <- data[-c(1,2),] # Rename the first column to reflect the fact that it mostly contains zip code data data <- rename(data, c("Col1"="zipcode")) # We need to move the information in the section headers to a variable and assign the correct parish to each zipcode. # Before doing this, we need to make sure that zipcode is a unique identifier and that no zipcodes appear in more than one parish. zipcodes <- data$zipcode duplicates <- zipcodes[duplicated(zipcodes)] dup.zips <- subset(data, zipcode %in% duplicates) # This tells us that 70068 appears in two different parishes, although we don't know which ones from the data, # which illustrates why parish needs to be a variable table(dup.zips$zipcode) remove(zipcodes, duplicates, dup.zips) # 70068 appears twice, so we need to create a unique identifier for each parish-zip code combination in the dataset. data$unique.ID <- rownames(data) data$unique.ID <- as.integer(data$unique.ID) rownames(data) <- NULL # Now, determine those rows that are section headers, with parish names data$parish.row <- ifelse(grepl("Parish", data$zipcode), data$zipcode, NA) # Set up a variable for parish, and # impute parish data by "carrying forward" the parish data for the non-section header rows beneath each parish row library(zoo) # na.locf function data$parish <- na.locf(data$parish.row) # Check that the parish information was filled in delete <- subset(data, select=c("zipcode","parish.row","parish")) remove(delete)
# Delete parish rows, which are not unique observations but summaries of the zip code data below data <- subset(data, is.na(data$parish.row)) data$parish.row <- NULL