This post grapples with the challenge of combining datasets that should in theory match, but that in practice have inconsistent ways of designating the same entity, slightly different variable names, and some different variables and/or different scaling for the variables they do have in common. If you regularly work with data, you're probably very familiar with these issues. This post includes some tricks I use to identify and resolve the differences in this kind of situation.Read More
This post is an update to my Big Data Wrangling: Reshaping from Long to Wide post from May 2015. The original Big Data Wrangling post is among the most frequently viewed on this blog, so I suspect that lots of people are looking for efficient ways to reshape datasets from long to wide. This post presents a faster way to do that than the original post proposed, and uses a benchmarking package that helps quantify the time associated with different approaches.Read More
The US Department of Labor’s Bureau of Labor Statistics (BLS) website is a treasure trove of economic data. There are datasets on everything from the Consumer Price Index to how Americans spend their time. There’s so much there, in fact, that it can be a bit overwhelming to navigate.
This post provides a little guidance, based upon my experience using the site, and includes a function to pull BLS data directly into R.Read More
This post, the last in a sequence of four, combines the code samples from the previous two posts and resolves the lingering issue that R interprets the column with the counts of active addresses as a character variable. In the code below, I set up the unique identifier for each zip code-parish combination and move the parish information into a variable before reshaping the data long, as this is more concise than trying to do these things afterwards, and leverages the unique identifier to reshape the data properly.Read More
This post, the third in a sequence of four, addresses the challenge of moving data embedded in section headers to a variable, such that the section header information appears on the same row of data as the observations to which it applies. In this example, the section headers contain parish information, and the observations are zip codes within those parishes. To keep things simple, we'll ignore the data transformation issues discussed in the last post; the next post will bring everything together.Read More
This post, the second in a sequence of four, works with the New Orleans active addresses dataset introduced in the last post and addresses the challenge of transposing the data long while preserving the date information.
The challenge is that the date information is spread out over two rows (one for year and one for month), and we want to make sure that when we flip the data long, the date information is connected to the correct values. Additionally, some of the year information is missing, so we'll also need to fill this in.Read More
The following couple of posts address a common data science challenge: when sourcing data over the internet or from disparate departments within an organization, it's often necessary to substantially reformat the data before analysis. The Excel table considered in these posts, available online courtesy of The Data Center, features monthly counts of active postal addresses, by zip code, in New Orleans during the decade after Hurricane Katrina.Read More
Reshaping datasets from wide to long in R tends to work smoothly regardless of the size of the dataset, but reshaping from long to wide can break (or take so long you wonder if it's stopped working) with large data sets. The threshold at which this problem arises will vary depending upon your system and memory allocation. I find that it occurs with datasets of ~25,000 rows or more with the default heap size and with datasets of ~1 million rows or more with maximum heap allocation.
This post shares an alternative approach that resolves size-related limitations when reshaping large datasets from long to wide. The essence of the solution is this: subset the data based upon the levels of repeated assessments, rename the measured variable to something unique to that assessment, and then merge the data for the separate assessments back together. Although the reshape() and dcast() code for this task is more concise, the enclosed subsetting approach doesn’t stall for very large datasets.
In the event that you’ve found yourself waiting for minutes or hours while R chews on a reshape() or dcast() command, hoping that the program hadn’t silently stalled out, there's hope!Read More