Multilevel Models to Explore the Impact of the Affordable Care Act’s Shared Savings Program, Part I

The Affordable Care Act encompasses a host of programs and provisions, which are the subject of much discussion and debate right now. This post offers a nonpartisan, genuinely curious exploration of one of the Affordable Care Act’s less frequently debated programs, the Shared Savings Program (SSP). The program’s objective is to reduce Medicare costs and increase the quality of care provided to Medicare patients. In this short sequence of posts, I explore whether the program looks to be meeting these objectives. Given that this topic may appeal to both non-technical and technical audiences, I’ve split the posts into a higher-level description of the findings (part I) and a more technical post with code (part II).

Data on ACOs and the Shared Savings Program is publicly available from My last post includes detailed code to download and prepare the data for analysis.

Read More

Random Forest for Loan Performance Prediction

Random Forests are among the most powerful predictive analytic tools. They leverage the considerable strengths of decision trees, including handling non-linear relationships, being robust to noisy data and outliers, and determining predictor importance for you. Unlike single decision trees, however, they don’t need to be pruned, are less prone to overfitting, and produce aggregated results that tend to be more accurate.

This post presents code to prepare data for a random forest, run the analysis, and examine the output.

The specific question I answer with these analyses is: what is the predicted percentage of loan principal that will have been re-paid by the time the loan reaches maturity? I’m using publicly-available, 2007-2011 data from the Lending Club for these analyses. You can obtain the data here.

Read More

Launch a Shiny App on Your Own Server in 4 Steps

Shiny by Rstudio accomplishes an extraordinary and disruptive feat: it puts dissemination of interactive, analytic results in the hands of the analysts. In professional settings, there has historically been Business Intelligence (BI) software, with an interactive, user-friendly interface, separating those performing analyses in a program like R from business users. Shiny makes it possible to design that interactive, user-friendly interface in R itself, obviating the need to use an additional tool to make interactive, analytic results accessible to business users.

So, how would one take a Shiny app, such as the one we created in the previous post, and make it accessible to business users who probably don’t have R installed on their machines and may not be particularly technically savvy?

Guest post by data warehouse, Business Intelligence, and software architecture expert Michael Helms.

Read More

Parallel Processing for Memory-Intensive Maps and Graphics

Rendering graphics typically takes R some time, so if you’re going to be producing a large number of similar graphics, it makes sense to leverage R's parallel processing capabilities. However, if you’re looking to collect and return the graphics together in a sorted object – as we were in the previous post on animated choropleths – there’s a catch. R has to keep the whole object in random access memory (RAM) during parallel processing. As the number of graphics files increases, you risk exceeding the available RAM, which will cause parallel processing to slow dramatically (or crash). In contrast, a good, old-fashioned sequential for loop can write updates to the object to the global environment after each iteration, clearing RAM for the next iteration. Paradoxically, then, parallel processing can take longer than sequential processing in this situation. In the case of the animated choropleths in the previous post, parallel processing took 21 minutes, whereas sequential processing took 11 minutes.

This post presents code to combine the efficiency and speed of parallel processing with the RAM-clearing benefits of sequential processing when generating graphics.

Read More

Parallel Processing

If you've needed to perform the same sequence of tasks or analyses over multiple units, you've probably found for loops helpful. They aren't without their challenges, however - as the number of units increases, the processing time increases. For large data sets, the processing time associated with a sequential for loop can become so cumbersome and unwieldy as to be unworkable. Parallel processing is a really nice alternative in these situations. It makes use of your computer's multiple processing cores to run the for loop code simultaneously across your list of units. This post presents code to: 

  1. Perform an analysis using a conventional for loop.
  2. Modify this code for parallel processing.

To illustrate these approaches, I'll be working with the New Orleans, LA Postal Service addresses data set from the past couple of posts. You can obtain the data set here, and code to quickly transform it for these analyses here.

The question we'll be looking to answer with these analyses is: which areas in and around New Orleans have exhibited the greatest growth in the past couple of years?

Read More