If you've needed to perform the same sequence of tasks or analyses over multiple units, you've probably found for loops helpful. They aren't without their challenges, however - as the number of units increases, the processing time increases. For large data sets, the processing time associated with a sequential for loop can become so cumbersome and unwieldy as to be unworkable. Parallel processing is a really nice alternative in these situations. It makes use of your computer's multiple processing cores to run the for loop code simultaneously across your list of units. This post presents code to:
- Perform an analysis using a conventional for loop.
- Modify this code for parallel processing.
To illustrate these approaches, I'll be working with the New Orleans, LA Postal Service addresses data set from the past couple of posts. You can obtain the data set here, and code to quickly transform it for these analyses here.
The question we'll be looking to answer with these analyses is: which areas in and around New Orleans have exhibited the greatest growth in the past couple of years? Read More
This post, the last in a sequence of four, combines the code samples from the previous two posts and resolves the lingering issue that R interprets the column with the counts of active addresses as a character variable. In the code below, I set up the unique identifier for each zip code-parish combination and move the parish information into a variable before reshaping the data long, as this is more concise than trying to do these things afterwards, and leverages the unique identifier to reshape the data properly. Read More
This post, the third in a sequence of four, addresses the challenge of moving data embedded in section headers to a variable, such that the section header information appears on the same row of data as the observations to which it applies. In this example, the section headers contain parish information, and the observations are zip codes within those parishes. To keep things simple, we'll ignore the data transformation issues discussed in the last post; the next post will bring everything together. Read More
This post, the second in a sequence of four, works with the New Orleans active addresses dataset introduced in the last post and addresses the challenge of transposing the data long while preserving the date information.
The challenge is that the date information is spread out over two rows (one for year and one for month), and we want to make sure that when we flip the data long, the date information is connected to the correct values. Additionally, some of the year information is missing, so we'll also need to fill this in. Read More
The following couple of posts address a common data science challenge: when sourcing data over the internet or from disparate departments within an organization, it's often necessary to substantially reformat the data before analysis. The Excel table considered in these posts, available online courtesy of The Data Center, features monthly counts of active postal addresses, by zip code, in New Orleans during the decade after Hurricane Katrina. Read More
Reshaping datasets from wide to long in R tends to work smoothly regardless of the size of the dataset, but reshaping from long to wide can break (or take so long you wonder if it's stopped working) with large data sets. The threshold at which this problem arises will vary depending upon your system and memory allocation. I find that it occurs with datasets of ~25,000 rows or more with the default heap size and with datasets of ~1 million rows or more with maximum heap allocation.
This post shares an alternative approach that resolves size-related limitations when reshaping large datasets from long to wide. The essence of the solution is this: subset the data based upon the levels of repeated assessments, rename the measured variable to something unique to that assessment, and then merge the data for the separate assessments back together. Although the reshape() and dcast() code for this task is more concise, the enclosed subsetting approach doesn’t stall for very large datasets.
In the event that you’ve found yourself waiting for minutes or hours while R chews on a reshape() or dcast() command, hoping that the program hadn’t silently stalled out, there's hope! Read More
Suppose you’re working with data that includes dates (e.g., birth dates, start or stop dates for a project or customer account, graduation dates, etc.) and you want to flag those observations whose dates meet some criterion related to today's date. For example, you’re working with customer account data, and you want to identify those customer accounts that were closed in the past year. To flag recently-closed accounts, you need to test the account close date against a date representing one year ago today, but given that time keeps passing, the date that represents one year ago today keeps changing, too. If you’re going to be re-running your code periodically, you’ll want the program to automatically update the test date based upon the current date. (The alternative is manually updating the test date each time you run the program, which is inefficient and also susceptible to error.)
This post presents some clean, simple code that will update a date-related field using today's date as the reference point. Read More
It’s incredibly useful to be able to automate an analysis or set of analyses that you want to perform multiple times in exactly the same way. For example, if you’re working in industry, you might want to perform analyses that allow you to draw separate conclusions about the performance of individual stores, regions, products, customers, or employees. If you’re working in academia, you might want to separately examine multiple, different dependent variables. Frequently, this may entail several distinct steps, such as subsetting the data, performing the analysis or set of analyses, generating well-labeled output, etc.
This post presents one approach for feeding R a list of units to loop through, and then iteratively performing the same set of tasks for each unit. Read More
This post presents code to give the user a quick overview of a numeric variable with one function call. The code, which can easily be modified for your specific needs, currently includes information about the amount of missing data, mean and standard deviation (applicable when the distribution is normally distributed), median score and deciles, unique values of the variable, and the shape of the distribution. Read More
This post presents a function designed to give the user a quick overview of a factor variable, including missing data, levels of the factor, and the frequency with which each level appears in the data, with which with one function call. Read More