This post addresses a common data science task – comparing multiple models – and explores how you might do this when you’re running the models in R's caret package. We’ll work with the same data set and objective as the last post, which involved predicting which customers would respond to a marketing campaign, and build on that post by making one of the models we’re comparing a neural network. The other models I’m adding here are a random forest and a logistic regression.Read More
Neural networks are a great analytic tool for generating predictions from existing data. They can detect complex, non-linear relationships in data (including interactions among predictors), can handle large datasets with many predictors, and often produce more accurate predictions than regression/logistic regression. As with random forests, they can be used for regression or classification.
For this post, I take on a classic classification challenge and seek to answer the question: which customers are most likely to respond to a marketing campaign?Read More
Rendering graphics typically takes R some time, so if you’re going to be producing a large number of similar graphics, it makes sense to leverage R's parallel processing capabilities. However, if you’re looking to collect and return the graphics together in a sorted object – as we were in the previous post on animated choropleths – there’s a catch. R has to keep the whole object in random access memory (RAM) during parallel processing. As the number of graphics files increases, you risk exceeding the available RAM, which will cause parallel processing to slow dramatically (or crash). In contrast, a good, old-fashioned sequential for loop can write updates to the object to the global environment after each iteration, clearing RAM for the next iteration. Paradoxically, then, parallel processing can take longer than sequential processing in this situation. In the case of the animated choropleths in the previous post, parallel processing took 21 minutes, whereas sequential processing took 11 minutes.
This post presents code to combine the efficiency and speed of parallel processing with the RAM-clearing benefits of sequential processing when generating graphics.Read More
If you've needed to perform the same sequence of tasks or analyses over multiple units, you've probably found for loops helpful. They aren't without their challenges, however - as the number of units increases, the processing time increases. For large data sets, the processing time associated with a sequential for loop can become so cumbersome and unwieldy as to be unworkable. Parallel processing is a really nice alternative in these situations. It makes use of your computer's multiple processing cores to run the for loop code simultaneously across your list of units. This post presents code to:
- Perform an analysis using a conventional for loop.
- Modify this code for parallel processing.
To illustrate these approaches, I'll be working with the New Orleans, LA Postal Service addresses data set from the past couple of posts. You can obtain the data set here, and code to quickly transform it for these analyses here.
The question we'll be looking to answer with these analyses is: which areas in and around New Orleans have exhibited the greatest growth in the past couple of years?Read More