This post addresses a common data science task – comparing multiple models – and explores how you might do this when you’re running the models in R's caret package. We’ll work with the same data set and objective as the last post, which involved predicting which customers would respond to a marketing campaign, and build on that post by making one of the models we’re comparing a neural network. The other models I’m adding here are a random forest and a logistic regression.Read More
Random Forests are among the most powerful predictive analytic tools. They leverage the considerable strengths of decision trees, including handling non-linear relationships, being robust to noisy data and outliers, and determining predictor importance for you. Unlike single decision trees, however, they don’t need to be pruned, are less prone to overfitting, and produce aggregated results that tend to be more accurate.
This post presents code to prepare data for a random forest, run the analysis, and examine the output.
The specific question I answer with these analyses is: what is the predicted percentage of loan principal that will have been re-paid by the time the loan reaches maturity? I’m using publicly-available, 2007-2011 data from the Lending Club for these analyses. You can obtain the data here.Read More