This post is the technical accompaniment to Multilevel Models to Explore the Impact of the Affordable Care Act’s Shared Savings Program, Part I.Read More
This post addresses using R for web scaping, using a RESTful Web Service API in conjunction with R’s RCurl and XML packages. This post is the first in a two-part series in which I’m looking to answer the question: which technical skills are most in-demand among data scientists?Read More
Customer segmentation is a deceptively simple-sounding concept. Broadly speaking, the goal is to divide customers into groups that share certain characteristics. There are an almost-infinite number of characteristics upon which you could divide customers, however, and the optimal characteristics and analytic approach vary depending upon the business objective. This means that there is no single, correct way to perform customer segmentation.
In this post, I work through a practical example that, in my experience, closely mirrors the challenges of performing this kind of analysis with real data.Read More
Rendering graphics typically takes R some time, so if you’re going to be producing a large number of similar graphics, it makes sense to leverage R's parallel processing capabilities. However, if you’re looking to collect and return the graphics together in a sorted object – as we were in the previous post on animated choropleths – there’s a catch. R has to keep the whole object in random access memory (RAM) during parallel processing. As the number of graphics files increases, you risk exceeding the available RAM, which will cause parallel processing to slow dramatically (or crash). In contrast, a good, old-fashioned sequential for loop can write updates to the object to the global environment after each iteration, clearing RAM for the next iteration. Paradoxically, then, parallel processing can take longer than sequential processing in this situation. In the case of the animated choropleths in the previous post, parallel processing took 21 minutes, whereas sequential processing took 11 minutes.
This post presents code to combine the efficiency and speed of parallel processing with the RAM-clearing benefits of sequential processing when generating graphics.Read More
This post demonstrates how to map change in a variable over time in a geographic area, allowing the user to scroll through time and selectively view dates of interest. It produces an interactive choropleth map, as the last post did, but whereas the last post was interactive in the sense that the user could zoom in on a specific geographic area, this map is interactive in the sense that the user can ‘zoom in’ on a specific point in time.
This map: In the wake of Hurricane Katrina, multiple New Orleans committees generated plans to rebuild the city; in some cases, these plans involved shifting the city’s footprint to move citizens out of more topographically vulnerable areas. The sequence of maps produced here answer the question: how quickly did various New Orleans zip codes re-populate after Hurricane Katrina, and how does the city’s current address density relate to pre-Katrina levels?Read More
If you've needed to perform the same sequence of tasks or analyses over multiple units, you've probably found for loops helpful. They aren't without their challenges, however - as the number of units increases, the processing time increases. For large data sets, the processing time associated with a sequential for loop can become so cumbersome and unwieldy as to be unworkable. Parallel processing is a really nice alternative in these situations. It makes use of your computer's multiple processing cores to run the for loop code simultaneously across your list of units. This post presents code to:
- Perform an analysis using a conventional for loop.
- Modify this code for parallel processing.
To illustrate these approaches, I'll be working with the New Orleans, LA Postal Service addresses data set from the past couple of posts. You can obtain the data set here, and code to quickly transform it for these analyses here.
The question we'll be looking to answer with these analyses is: which areas in and around New Orleans have exhibited the greatest growth in the past couple of years?Read More
It’s incredibly useful to be able to automate an analysis or set of analyses that you want to perform multiple times in exactly the same way. For example, if you’re working in industry, you might want to perform analyses that allow you to draw separate conclusions about the performance of individual stores, regions, products, customers, or employees. If you’re working in academia, you might want to separately examine multiple, different dependent variables. Frequently, this may entail several distinct steps, such as subsetting the data, performing the analysis or set of analyses, generating well-labeled output, etc.
This post presents one approach for feeding R a list of units to loop through, and then iteratively performing the same set of tasks for each unit.Read More