Big Data Wrangling: Reshaping from Long to Wide

Note: I've found a better way to reshape large datasets from long to wide than the approach I outline in this blog post. Please see the new approach at: http://www.kimberlycoffey.com/blog/2016/9/updated-big-data-wrangling

Reshaping datasets from wide to long in R tends to work smoothly regardless of the size of the dataset, but reshaping from long to wide can break (or take so long you wonder if it's stopped working) with large data sets. The threshold at which this problem arises will vary depending upon your system and memory allocation. I find that it occurs with datasets of ~25,000 rows or more with the default heap size and with datasets of ~1 million rows or more with maximum heap allocation.

This post shares an alternative approach that resolves size-related limitations when reshaping large datasets from long to wide. The essence of the solution is this: subset the data based upon the levels of repeated assessments, rename the measured variable to something unique to that assessment, and then merge the data for the separate assessments back together. Although the reshape() and dcast() code for this task is more concise, the enclosed subsetting approach doesn’t stall for very large datasets. 

In the event that you’ve found yourself waiting for minutes or hours while R chews on a reshape() or dcast() command, hoping that the program hadn’t silently stalled out, there's hope!

To present the alternative approach, I create simulated data that is modeled after the oft-referenced High School and Beyond (HSB) dataset. There's a nice example of how to use the reshape() command to reshape this dataset from long to wide at http://www.ats.ucla.edu/stat/r/faq/reshape.htm. The dataset to which the reshape() command is applied at http://www.ats.ucla.edu/stat/r/faq/reshape.htm is only 200 observations long, however, and it's only with much larger datasets that this command and other similar ones will fail.

Below, I generate a larger, fake dataset modeled after the HSB data to demonstrate the scalability of the solution, and compare reshape(), dcast(), and my alternative approach. I pluck the different levels of the repeating variable from the long dataset using the unique() command (in this example, the different levels are school subjects), and loop through the long dataset. In each iteration, I subset by one of the school subjects, rename the "score" variable to something specific to that iteration, and then merge the resulting file to create the wide dataset. 

n <- 100000 # Set the sample size.
# You can tinker with this if you want to test the sample size that stresses your system. 
###########################
# Generate simulated data #
###########################

id <- seq(1,n, by=1)
female <- sample(0:1, n, replace=TRUE)
race <- sample(1:4, n, replace=TRUE)
prog <- sample(1:3, n, replace=TRUE)
schtyp <- sample(1:2, n, replace=TRUE)
ses <- sample(1:3, n, replace=TRUE)

master <- cbind(id,female,race,prog,schtyp,ses)
master <- as.data.frame(master)
remove(female, race, prog, schtyp, ses)

school.subject <- c("math","read","science","socst","write")
data.long <- merge(as.data.frame(id), as.data.frame(school.subject), all=TRUE)
remove(school.subject)
data.long <- merge(master, data.long, by=c("id"), all=TRUE)
data.long <- data.long[order(data.long$id, data.long$school.subject),]
rownames(data.long) <- NULL

s <- n*5
data.long$score <- sample(40:60, size=s, replace=TRUE)


##################################################
# Reshape dataset using conventional approaches: #
##################################################

# reshape()
print("Start time:")
Sys.time()
wide.reshape <- reshape(data.long, 
timevar = "school.subject",
idvar = c("id","female","prog","race","schtyp","ses"),
direction = "wide")
print("End time:")
Sys.time()

# dcast()
library(reshape2)
print("Start time:")
Sys.time()
wide.dcast <- dcast(data.long, id + female + prog + race + schtyp + ses ~ school.subject, value.var="score")
print("End time:")
Sys.time()


#########################
# Alternative approach: #
#########################

print("Start time:")
Sys.time()

# Set up empty (wide) data frame for merge
wide.new <- subset(master, select=c("id"))
row.names(wide.new) <- NULL

repeating <- unique(data.long$school.subject)

# Subset by school subject, rename, and recombine (wide)
for (i in 1:length(repeating) ) {

level <- repeating[i]
data <- subset(data.long, school.subject==level, select=c(id, score))

var.name <- paste(level, "score", sep=".")

library(plyr)
data <- rename(data, c("score" = var.name))

wide.new <- merge(wide.new, data, by=c("id"), sort=TRUE, all.x=TRUE, all.y=TRUE)

print(i) # lets you know where R is in the loop - useful for large loops

}

wide.new <- merge(master, wide.new, by=c("id"), sort=TRUE, all.x=TRUE, all.y=TRUE)

remove(i, level, var.name, repeating, data)

print("End time:")
Sys.time()

# Clean up workspace
remove(data.long, master, wide.reshape, wide.dcast, wide.new, id, n, s)