A Brief Introduction to caret

This post builds on the last by introducing the caret package (short for Classification And REgression Training).  Caret is a really nice wrapper function for a variety of machine learning models, including random forests. It makes model tuning smooth and parallel processing a breeze.

To introduce the package, I’ll begin by replicating the model we ran in the last post, using caret this time, and then build on it to incorporate some of the additional things that caret makes possible.

You can obtain the data I’m working with here (2007-2011 data), and the code to transform it for the analyses at the beginning of the last post, here.

Run the data preparation code up through these lines:

# check the amount of missing data in remaining dataset
lapply(loan, function(x) { sum(is.na(x)) })

… but DON’T create the training and testing data sets. Caret has a function that will do that for us.

# As before, we will want to remove ID number so it isn’t used as a predictor.
loan$member_id <- NULL

# simplify things by only using complete cases
loan <- loan[complete.cases(loan),]

 

Training and Testing Datasets

Caret can create training and testing datasets, based upon the percentage of the data you want in the training set. Conveniently, if your dependent variable is a factor (in other words, if you’re planning to run classification-based models), it does stratified random sampling, so the training and testing datasets have roughly equal percentages of the dependent variable. If your dependent variable is numeric, as ours is, the data are split into quartiles and sampled within each quartile.

library(caret)
set.seed(1)
inTrain <- createDataPartition(y = loan$paid.back, # the outcome variable
 p = 0.75, # the percentage of data in the training sample
 list = FALSE) # should the results be in a list?

# the primary caret package function is called "train", 
# so we'll name the training dataset “training” instead.
training <- loan[inTrain,]
testing <- loan[-inTrain,]
remove(inTrain)

# Check that stratification created two similar-looking distributions
hist(training$paid.back)
hist(testing$paid.back)

randomForest Model

To minimize model variance, random forests use n bootstrapped samples for ntrees, and use a random sample of the predictors to fit each decision tree. This approach might sound a little odd, but if it weren’t done, the first couple of splits for every tree would probably be the same couple of very important predictors, resulting in similar, highly correlated trees across the forest. Averaging across highly correlated trees is not as effective at reducing variance as averaging across less correlated trees, hence the use of random sampling of predictors for each tree. (NB: In this context, I’m talking about machine learning variance, or undesirable variability in model results as a function of variations in training data, not variance in the statistical sense of the standard deviation squared, or the percentage of variability in the dependent variable explained by the model.)

Thus, the number of predictors used for each tree is one of the key tuning parameters for a random forest. In the last post, we didn’t specify this, so the default value of number of predictors divided by 3 or 28/3 ≈ 9 was used. Here, we make this explicit to lay the groundwork to adjust it in caret.

For illustration purposes, I’m also using 50 rather than 500 trees. This helps the code run a bit faster for these examples, before I bump it back up and use parallel processing.

library(randomForest)
set.seed(1)
original.rf <- randomForest(paid.back ~ .,
 data = training,
 ntree = 50,
 mtry=9,
 importance=TRUE,
 na.action=na.omit)

 

Random Forest in caret

To exactly replicate the randomForest() results via caret, there are two things we’ll need to set. The first is the random seeds, which in caret have to be set within trainControl. We’ll want to make sure the seeds match – specifically the seed for the final model.

The second involves how the model is specified. If you use the formula interface (“Y~X”), the caret train() function will convert any factor variables in the formula to dummy variables. To preserve their status as factors, so these results match those from the randomForest() model, we’ll use the non-formula interface below.

Please note that the non-formula interface does not permit missing data. 

seeds <- as.vector(c(1:26), mode = "list")
seeds[[26]] <- 1

caret.rf <- train(training[, -28], training$paid.back, # non-formula interface
method="rf", # specify random forest
ntree=50,
tuneGrid=data.frame(mtry=9), # number of predictors
importance=TRUE,
na.action=na.omit,
trControl=trainControl(method="boot", seeds = seeds),
allowParallel=FALSE)

print(original.rf)
print(caret.rf$finalModel)

importance(original.rf)
importance(caret.rf$finalModel)

 

Flexing caret

The preceding exercise demonstrated caret in the context of a model with which we were already familiar. Now we’ll use the package to do more.

One of my favorite features of caret is that is makes parallel processing really easy. All you have to do is register your clusters, and then set the allowParallel argument to TRUE. This makes it a lot faster to run all the models we'll be running here, so I increase the number of decision trees to 200.

Caret will also help you identify the optimal number of predictors to use for each tree. You can provide caret with a tuning grid to do this, or not provide anything, in which case it will try out some values and let you know which one worked best. Below, I ask caret to try out 2-27 predictors per tree, in multiples of 5.

Lastly, I also set the random seeds below. This is most important for the final model, which we’ll compare to the original model. Setting seeds for a model with tuning parameters is a little more complicated than what we did before, and includes specifying the seeds for not only for each resampled dataset, but also each level of the tuned parameter within that sample.

mtryGrid <- expand.grid(mtry = seq(2,27,5))

set.seed(1)
seeds <- vector(mode = "list", length = 26) # length is = (nresampling)+1
for(i in 1:25) seeds[[i]]<- sample.int(n=1000, 6) # 6 is the number of tuning parameters (mtry possibilities)
seeds[[26]] <- 1 # for the final model
remove(i)

library(doParallel)
cl = makeCluster(4)
registerDoParallel(cl)

RF <- train(training[, -28], training$paid.back,
 method="rf",
 ntree=200,
 metric="RMSE",
 importance=TRUE,
 na.action=na.omit,
 trControl=trainControl(method="boot", number= 25, seeds=seeds),
 tuneGrid = mtryGrid,
 allowParallel=TRUE)

stopCluster(cl)
remove(cl)
registerDoSEQ()

# Examine results
print(RF$finalModel)

You can see that the mean of squared residuals is lower and the variance explained is higher than they were before. Some of this is due to having a higher ntree value - as illustrated by the "Mean Squared Error by Number of Decision Trees" figure in the previous post, error will decrease with more trees until it reaches an asymptote - but some of it is also due to a slightly better-tuned model with regard to the number of predictors per tree. This model is fitting just a little better.

plot(RF, metric = "RMSE")

 

Interestingly, when we experimented with using 2-27 predictors per decision tree, the optimal number of predictors (7) was not far from randomForest's default value of total predictors/3, which in this dataset is about 9.

 

 

 

 

 

Generate Predictions

The only task that remains is to test the predictions that result from this model against those produced by our original model.

# Original model
original.preds <- predict(object = original.rf, newdata = testing) # Predict the test data

results <- data.frame(actual = round(testing$paid.back, 2), original.preds = round(original.preds, 2))

# caret model
caret.preds <- predict(object = RF, newdata = testing) # Predict the test data
results <- data.frame(results, caret.preds = round(caret.preds, 2))

# Examine mean squared error in test data
results$original.residual <- results$actual - results$original.preds
results$original.residual2 <- results$original.residual^2
mean(results$original.residual2)

results$caret.residual <- results$actual - results$caret.preds
results$caret.residual2 <- results$caret.residual^2
mean(results$caret.residual2)

The random forest model run via caret produced slightly better predictions in the test dataset.