Reformat Time-Series Excel Table: Transpose and Handle Dates

This post, the second in a sequence of four, works with the New Orleans active addresses dataset introduced in the last post and addresses the challenge of transposing the data long while preserving the date information. You can download the raw data here.

The challenge is that the date information is spread out over two rows (one for year and one for month), and we want to make sure that when we flip the data long, the date information is connected to the correct values. Additionally, some of the year information is missing, so we'll also need to fill this in. 

To do this, we'll subset each row with date information, transpose it, and add a variable with the former column in which that piece of date information was found. Then we'll reshape the active address data, name the timevar in this process based upon column, and merge in the date information using the former column name as the merge key.

# data from http://www.datacenterresearch.org/data-resources/active-addresses-by-zip-code/

# Import file
library(XLConnect)
data <- readWorksheet(loadWorkbook("TheDataCenter_ActiveResidentialAddresses.xls"),sheet=1)

# Drop the rows from the dataset that do not contain data
data <- data[-c(1:3, 95:98),]
data <- data[rowSums(is.na(data)) < ncol(data), ]# Drop the intersperced blank rows.
rownames(data) <- NULL

# R assigned nonsensical column names during the import process; 
# set them to placeholders, which we'll use as identifiers when we transpose the data and merge in the year and month information.
columns <- dim(data)[2]
column.names <- character()

for (i in 1:columns) {

col.name <- paste("col", i, sep="")
column.names <- c(column.names, col.name)

}

remove(columns, i, col.name)
colnames(data) <- column.names
# Create a vertical vector to represent years
years <- data[1,-1] # subset the row with the year data, minus the first column
years.t <- t(years) # transpose to obtain a vertical year vector
remove(years)
years.t <- as.data.frame(years.t)
library(plyr) # rename function
years.t <- rename(years.t, c("1"="year")

# Carry the last observation forward for rows with missing year data
library(zoo) # na.locf function
years.t$year <- na.locf(years.t$year) 

 

years.t with var variable

years.t with var variable

# This assigns the rownames (formerly the column names) to a variable, which we'll use as an identifier for merge, below.
years.t$var <- rownames(years.t)
rownames(years.t) <- NULL
# Create a vertical vector of months
months <- data[2,-1] # subset the row with the month data, minus the first column
months.t <- t(months)
remove(months)
months.t <- as.data.frame(months.t)
library(plyr) # rename function
months.t <- rename(months.t, c("2"="month"))

# This assigns the rownames (formerly the column names) to a variable, which we'll use as an identifier for merge, below.
months.t$var <- rownames(months.t) 
rownames(months.t) <- NULL

# Now drop the date rows, as R thinks they're observations
data <- data[-c(1,2),]

# Rename the first column to reflect the fact that it mostly contains zip code data
data <- rename(data, c("col1"="zipcode"))
 
################
# Reshape long #
################

# Remove first column from column.names
column.names <- column.names[!is.element(column.names, "col1")]

# Expected number of rows in the new dataset
num.rows <- nrow(data) * length(column.names)

# We don't need to worry about the order of the column names here - we'll be merging the date information in by column,
# so the correct date information will still map to the correct active address values.

library(reshape2)
data.long <- reshape(data, 
 varying = c(column.names), 
 v.names = "active_addresses",
 timevar = "var", 
 idvar = "zipcode",
 times = c(column.names),
 new.row.names = 1:num.rows,
 direction = "long")
rownames(data.long) <- NULL
remove(num.rows, column.names)

data.long <- merge(data.long, years.t, by="var")
data.long <- merge(data.long, months.t, by="var")
remove(months.t, years.t)

# Create a date variable that R recognizes as a date
data.long$date <- paste(data.long$month, data.long$year, sep="-")
data.long$date <- as.Date(paste('01', data.long$date), format='%d %b-%Y')

data.long <- subset(data.long, select=c("zipcode", "date", "year", "month", "active_addresses"))
data.long <- data.long[order(data.long$zipcode, data.long$date),]
rownames(data.long) <- NULL

There are still some lingering issues, but we'll fix them in the next post.