Reformat Time-Series Excel Table: Complete Code

This post, the last in a sequence of four, combines the code samples from the previous two posts and resolves the lingering issue that R interprets the column with the counts of active addresses as a character variable. In the code below, I set up the unique identifier for each zip code-parish combination and move the parish information into a variable before reshaping the data long, as this is more concise than trying to do these things afterwards, and leverages the unique identifier to reshape the data properly.

Read More

Reformat Time-Series Excel Table: Convert Section Headers to a Variable

This post, the third in a sequence of four, addresses the challenge of moving data embedded in section headers to a variable, such that the section header information appears on the same row of data as the observations to which it applies. In this example, the section headers contain parish information, and the observations are zip codes within those parishes. To keep things simple, we'll ignore the data transformation issues discussed in the last post; the next post will bring everything together.

Read More

Reformat Time-Series Excel Table: Transpose and Handle Dates

This post, the second in a sequence of four, works with the New Orleans active addresses dataset introduced in the last post and addresses the challenge of transposing the data long while preserving the date information. 

The challenge is that the date information is spread out over two rows (one for year and one for month), and we want to make sure that when we flip the data long, the date information is connected to the correct values. Additionally, some of the year information is missing, so we'll also need to fill this in. 

Read More

Big Data Wrangling: Reshaping from Long to Wide

Reshaping datasets from wide to long in R tends to work smoothly regardless of the size of the dataset, but reshaping from long to wide can break (or take so long you wonder if it's stopped working) with large data sets. The threshold at which this problem arises will vary depending upon your system and memory allocation. I find that it occurs with datasets of ~25,000 rows or more with the default heap size and with datasets of ~1 million rows or more with maximum heap allocation.

This post shares an alternative approach that resolves size-related limitations when reshaping large datasets from long to wide. The essence of the solution is this: subset the data based upon the levels of repeated assessments, rename the measured variable to something unique to that assessment, and then merge the data for the separate assessments back together. Although the reshape() and dcast() code for this task is more concise, the enclosed subsetting approach doesn’t stall for very large datasets. 

In the event that you’ve found yourself waiting for minutes or hours while R chews on a reshape() or dcast() command, hoping that the program hadn’t silently stalled out, there's hope!

Read More