This post is an update to my *Big Data Wrangling: Reshaping from Long to Wide* post from May 2015. The original *Big Data Wrangling* post is among the most frequently viewed on this blog, so I suspect that lots of people are looking for efficient ways to reshape datasets from long to wide. This post presents a faster way to do that than the original post proposed, and uses a benchmarking package that helps quantify the time associated with different approaches.

### Generate simulated data

Just as I did in the original post, I'm creating simulated data here modeled after the High School and Beyond dataset.

options(scipen = 999) n <- 100000 # Set the sample size. # You can tinker with this if you want to test performance with different sample sizes. id <- seq(1,n, by=1) female <- sample(0:1, n, replace=TRUE) race <- sample(1:4, n, replace=TRUE) prog <- sample(1:3, n, replace=TRUE) schtyp <- sample(1:2, n, replace=TRUE) ses <- sample(1:3, n, replace=TRUE) master <- cbind(id,female,race,prog,schtyp,ses) master <- as.data.frame(master) remove(female, race, prog, schtyp, ses) school.subject <- c("math","read","science","socst","write") data.long <- merge(as.data.frame(id), as.data.frame(school.subject), all=TRUE) remove(school.subject) data.long <- merge(master, data.long, by=c("id"), all=TRUE) data.long <- data.long[order(data.long$id, data.long$school.subject),] rownames(data.long) <- NULL s <- n*5 data.long$score <- sample(40:60, size=s, replace=TRUE) remove(id, n, s)

### Create a function from my previous subset-and-merge approach

In the original post, I introduced an approach to reshaping large datasets that I’d found worked faster and more reliably with large datasets than either reshape() or dcast(). To test this approach using the microbenchmark() function below, I quickly make the subset-and-merge approach into a function. *Please note that I do not intend this to be a standalone function for reshaping a dataset from long to wide. The function I create here uses variable names and syntax that are unique to this example.* My objective is simply to encapsulate all of the steps from the subset-and-merge approach into one function call.

# create the function library(plyr) subset.and.merge <- function(data.long) { wide.new <- subset(master, select=c("id")) row.names(wide.new) <- NULL repeating <- unique(data.long$school.subject) # Subset by school subject, rename, and recombine (wide) for (i in 1:length(repeating) ) { level <- repeating[i] data <- subset(data.long, school.subject==level, select=c(id, score)) var.name <- paste(level, "score", sep=".") data <- rename(data, c("score" = var.name)) wide.new <- merge(wide.new, data, by=c("id"), sort=TRUE, all=TRUE) } wide.new <- merge(master, wide.new, by=c("id"), sort=TRUE, all=TRUE) # save output to global environment # assign("wide.new", wide.new, envir = .GlobalEnv) }

### Benchmark functions that reshape from long to wide

The code below compares the three approaches from the original post – specifically, reshape(), dcast(), and subset.and.merge() - with a new approach, the spread() function from the tidyr package.

The arguments for the spread() function are delightfully straightforward: *key* refers to the variable in the long dataset that you’d like to use to name the columns in the wide dataset and *value* refers to the name of the column in the long dataset whose values will populate the *key* columns in the wide dataset.

I'm using the microbenchmark package and function to compare reshape(), dcast(), subset.and.merge(), and spread(). microbenchmark() offers a really lovely way to compare the speed of different code snippets over multiple trials. Please note that the code below isn’t going to actually output the reshaped dataset using the optimal approach… it’s just going to tell us which method is fastest. We can then use that method moving forward.

On my machine, with a sample size of 100,000, this code takes ~35 minutes to run. If you want to try out the code but have it run faster, just reduce the sample size.

library(microbenchmark) library(reshape2) library(tidyr) results <- microbenchmark( reshape = reshape(data.long, timevar = "school.subject", idvar = c("id","female","prog","race","schtyp","ses"), direction = "wide"), dcast = dcast(data.long, id + female + prog + race + schtyp + ses ~ school.subject, value.var="score"), subset.and.merge = subset.and.merge(data.long), spread = spread(data.long, key = school.subject, value = score), times=100L, # number of times to evaluate expression unit = "s" # print() output time in seconds; time var itself is still measured in nanoseconds )

We can visualize the results from the microbenchmark analysis in a couple of ways. print(results) will give us the summary statistics for each expression evaluated.

autoplot() is a microbenchmark function that presents one way to visualize the time it took each approach to accomplish the task. You can ask to see time log-transformed or in its original units. I find the log-transformed units makes it a little easier to detect differences at the low end of the spectrum, so I use that option here.

autoplot(results, log = TRUE)

Lastly, a scatterplot of the times associated with each trial is another way to visualize the data.

library(ggplot2) ggplot(data=results, aes(y = log(time), x = seq_along(time), colour=expr)) + geom_point() + scale_color_discrete(name ="Expressions\nEvaluated") + ylab("Nanoseconds (log)") + xlab("Microbenchmarking Trial") + ggtitle("Speed of Four Approaches to Reshaping\nA 100,000-Observation Dataset from Long Wide") + theme(axis.title = element_text(face="bold")) + # bold plot title theme(plot.title = element_text(face="bold")) # bold axis labels

Overall, it’s clear that dcast() and spread() are faster than the subset-and-merge approach, which is faster than reshape().

When I came up with the subset-and-merge approach, I'd had problems using both reshape() and dcast() with large datasets. dcast() performed pretty well when I was tinkering with it for this post, though, even when I bumped the sample size up to 10,000,000, so I'm not sure if the problems I had at the time were specific to the dataset I was working with, or whether the function has been revised since then to work a little better with large datasets. I should note that when I bumped the sample size up, spread() did edge out dcast() by a hair.

Based upon these findings, tidyr’s spread() function looks to be the best reshaping approach (that I’m aware of) at this time.

**Obtain reshaped data using the tidyr spread() function**

library(tidyr) data.wide <- spread(data.long, key = school.subject, value = score)

Many thanks to Lucia Gjeltema and Mine Cetinkaya-Rundel for their excellent presentations on microbenchmarking and tidyr, respectively, which together inspired this post.

Complete code for this blog post can be found on GitHub at: https://github.com/kc001/Blog_code/blob/master/2016.09%20%20REVISED%20Big%20Data%20Wrangling%20-%20Reshaping%20Long%20to%20Wide.R.