This post addresses a common data science task – comparing multiple models – and explores how you might do this when you’re running the models in R's caret package. We’ll work with the same data set and objective as the last post, which involved predicting which customers would respond to a marketing campaign, and build on that post by making one of the models we’re comparing a neural network. The other models I’m adding here are a random forest and a logistic regression.

You can obtain the data from the University of California at Irvine online data repository here.

We’ll need to do the same data wrangling we performed in the last post, but I’m going to do the data-cleaning tasks that apply to all three models first, to accommodate the fact that the data needs to be processed differently for the neural network than it does for the random forest and logistic regression.

data <- read.csv("bank-additional-full.csv", header = TRUE, sep = ";") str(data) # Remove "duration", as this variable would not be known before a call is initiated data$duration <- NULL # Remove number of days since last contact" variable, # for which most values are "999" (no previous contact) and captured by the "previous" variable data$pdays <- NULL

You might remember from the last post that in classification tasks, caret considers the first level of the factored dependent variable to represent the occurrence of the event we’re trying to model. We’ll be using caret to run all three models, so I’ll go ahead and specify the factor order now.

# Re-order dependent variable factor levels for caret table(data$y) data$y <- factor(data$y, levels=c("yes", "no"), ordered=FALSE) table(data$y) levels(data$y)

I’m also going to select the observations (or customers) that we’ll use in our test and train datasets now. I’m not going to actually create the datasets yet, because we’ll need to pre-process the neural network data. We’ll still want the same observations to be included in each model’s train and set datasets, though.

# Determine which observations should be in the test and train datasets library(caret) set.seed(1) inTrain <- createDataPartition(y = data$y, # the outcome variable p = 0.75, # the percentage of data in the training sample list = FALSE) # should the results be in a list?

So now we’ve imported the raw data, made sure that caret interprets the levels of the dependent variable correctly, and specified which observations will be in the training and test datasets.

The next task is to do any additional, model-specific data cleaning or processing and run the models.

**Neural Network**

I’m working with the same dataset from the previous post and running the same neural network model, so the code is essentially identical. The only difference is that in this post, I’m using *repeated* N-fold cross validation (“method = ‘repeatedcv’ ”). Repeating the N-folds generally reduces solution variance and is good practice. Using it here gives us a chance to create a *seeds* object for this cross-validation option.

Keep in mind that you need the *seeds* object if you want fully reproducible results and are running the models in parallel processing. If you aren’t using parallel processing, though, and/or don’t need to be able to exactly reproduce the results from your model, you don’t need to worry about seeds.

############################################# # Clean and prepare data for neural network # ############################################# # Use min-max normalization to re-scale numeric variables num.vars <- data[, sapply(data, is.numeric)] maxs <- apply(num.vars, 2, max) mins <- apply(num.vars, 2, min) processed <- as.data.frame(scale(num.vars, center = mins, scale = maxs - mins)) apply(processed, 2, range) # Check that all variable are rescaled from 0-1 remove(mins, maxs, num.vars) # Dummy-code factor variables vars <- data[, sapply(data, is.factor)] # Select factor variables for dummy-coding vars$y <- NULL # Remove the dependent variable, which we will want to be a factor vars <- colnames(vars) library(psych) for (i in 1:length(vars) ) { var <- vars[i] new <- dummy.code(data[,var]) # convert factor to dummy variables # Rename dummy variables new <- dummy.code(data[,var]) names <- colnames(new) names <- paste(var, names, sep = ".") colnames(new) <- names remove(names) processed <- data.frame(processed, new) remove(new) } remove(i, vars, var) # Add y to processed dataset processed$y <- data$y table(processed$y) levels(processed$y) # Create train and test datasets train.nn <- processed[inTrain,] test.nn <- processed[-inTrain,] ############################### # Run neural network in caret # ############################### library(nnet) library(caret) # Grid of tuning parameters to try: nn.Grid <- expand.grid(.size=c(5,10,15), .decay=c(0.001,0.01,0.1))

I’m running 5-fold cross validation and repeating it twice. This means that we’ll be running 10 iterations of each neural network model we test, so we need a *seeds* list that is 10-plus-1 items long. (The “plus 1” is for the final model.) We’re also testing 9 possible tuning parameter combinations (three possible values for the number of hidden nodes times three possible weight decay values), so each of the 10 list items will need 9 slots. This all means that we'll be running 90 neural networks here (thank goodness for parallel processing!), plus the final model. The final model will use the best-fitting number of hidden nodes and weight decay, and apply them to the entire training dataset to produce the $finalModel fit. In this case, we’ve asked caret to define “best-fitting” as the model that maximizes sensitivity to detecting campaign responders.

# Set the seeds set.seed(1) nn.seeds <- vector(mode = "list", length = 11) # number of resamples + 1 for final model for(i in 1:10) nn.seeds[[i]] <- sample.int(n=1000, 9) # 9 is the # of tuning parameter combinations nn.seeds[[11]] <- 1 # for the last model remove(i) nn.seeds

The trainControl() object is almost identical across all three models (the neural network, the random forest, and the logistic regression), to facilitate comparisons. In theory, this means that we’d only need to create one trainControl() object and could use it for all three models. The number of tuning parameters we’re adjusting does vary across the models, though, so the *seeds* object will have different numbers of slots for each model.

library(pROC) nn.Control <- trainControl(method = "repeatedcv", # use N-fold cross validation number = 5, # the number of folds repeats = 2, classProbs = TRUE, summaryFunction = twoClassSummary, seeds = nn.seeds) library(doParallel) cl = makeCluster(6) registerDoParallel(cl) #Fit model model.nn <- train(y ~ ., data=train.nn, method='nnet', maxit = 1000, linout = FALSE, trControl = nn.Control, tuneGrid = nn.Grid, metric = "Sens", allowParallel = TRUE) stopCluster(cl) remove(cl) registerDoSEQ() model.nn plot(model.nn, metric = "Sens") remove(nn.Control, nn.Grid, nn.seeds, processed)

This "plot(model.nn, metric = "Sens") command shows us the impact of weight decay and number of hidden nodes on model sensitivity. We can see here that (consistent with the results from the last post), the combination of tuning parameters that maximized sensitivity had a weight decay of 0.001 and 15 hidden nodes.

**Random Forest**

One of the advantages of random forest models is that they require very little data processing – they can handle non-normal distributions and outliers with ease and don’t need to be normalized. The only real concern is whether there are any factor variables with more than 32 levels. I use the short program below to check for the existence of any factor variables with too many levels. There aren’t any, so we can move on to creating the train and test datasets for the random forest.

too.many.levels <- function(x) { is.factor(x) == TRUE & length(levels(x)) > 32 } delete <- lapply(data, too.many.levels) remove(delete, too.many.levels) ################################## # Create train and test datasets # ################################## train <- data[inTrain,] test <- data[-inTrain,]

I’m not going to say too much about the random forest model, as much of the code here may be familiar from my recent random forest and random forest in caret blog posts.

############################## # Run random forest in caret # ############################## library(caret) library(randomForest) # default number of predictors is predictors/3 or about 6 rf.Grid <- expand.grid(mtry = seq(from = 3, to = 18, by = 3)) set.seed(1) rf.seeds <- vector(mode = "list", length = 11) # length is = (nresampling)+1 for(i in 1:10) rf.seeds[[i]]<- sample.int(n=1000, 6) # 6 is the number of tuning parameters (mtry possibilities) rf.seeds[[11]] <- 1 # for the last model remove(i) rf.seeds library(pROC) rf.Control <- trainControl(method = "repeatedcv", # use N-fold cross validation number = 5, # the number of folds repeats = 2, classProbs = TRUE, summaryFunction = twoClassSummary, seeds = rf.seeds) library(doParallel) cl = makeCluster(4) registerDoParallel(cl) model.rf <- train(train[, -19], train$y, method="rf", ntree=100, importance=TRUE, na.action=na.omit, tuneGrid = rf.Grid, trControl= rf.Control, metric = "Sens", allowParallel=TRUE)

stopCluster(cl) remove(cl) registerDoSEQ() remove(rf.Grid, rf.Control, rf.seeds) # Examine results print(model.rf$finalModel)

The output lets us know that the model that maximized sensitivity to detecting campaign responders had a tree depth of 18 predictors.

importance(model.rf$finalModel)

This importance() command lets us know which variables had the most explanatory power. I use this information to cheat a little when preparing the logistic regression next.

plot(model.rf, metric = "Sens")

Lastly, plot(model.rf, metric = "Sens") gives a sense for the impact different tree depths had on the model’s sensitivity.

**Logistic Regression**

In many ways, data cleaning and preparation for a more traditional statistical model, like logistic regression, is the most involved. Typically, this involves generating lots of two-way tests and boxplots looking at the relationship between a potential predictor and the dependent variable, centering variables that might be used in interactions, and developing a carefully curated list of meaningful predictors that maximize the variance explained by the model. All of this work is good practice, especially if you need to understand the relationships among your predictors and your outcome.

It’s also necessary because a data analytic approach like logistic regression can’t always use all possible predictors. You can bump into problems with rank deficiency – caused by not having enough data to estimate the model, having predictors with too little variance, problems with predictor units and scaling, etc.

So, you generally need to put some care into developing the list of predictors for a logistic regression model. Sadly, that is not what I’m going to do here. For purposes of this blog post, I’m going to cheat. Below, I pull the top 10 most important predictors from the random forest model, combine them into a formula, and feed this formula to the logistic regression model. I fully acknowledge that this quick-and-dirty approach probably does not produce the strongest logistic regression model I could assemble from the predictors. But it is speedy and functional.

################################################### # Select predictors for logistic regression model # ################################################### # Quick-and-dirty predictor selection importance <- as.data.frame(importance(model.rf$finalModel)) importance <- importance[order(-importance$MeanDecreaseAccuracy),] importance$vars <- rownames(importance) rownames(importance) <- NULL vars <- importance[1:10, "vars"] f <- as.formula(paste("y ~", paste(vars, collapse = " + "))) remove(vars, importance)

If you come from a statistical background, the idea of running a logistic regression in caret is probably a bit confusing – if we’re not experimenting with different model parameters (like number of hidden nodes, weight decay, etc.), and the final model fit to the entire sample is the same as the initial model we provide, why not run the model once in glm() and be done with it? Why would we use caret?

The answer lies in the cross-validation. Even when we don’t have any parameters to tune, caret is randomly sampling the training data (10 times, for the 5-folds, repeated twice), fitting the fixed model each time, and calculating the sensitivity, specificity, etc. on the held-out data each time. In this way, even though it’s not selecting an optimal combination of tuning parameters, it *is* giving us a sense for the range associated with model metrics like sensitivity. The $finalModel parameter estimates for the impact of the predictors on the dependent variable, and their associated p-values, are exactly the same as they are using one run of glm(), though.

In the code below, which is very similar to the function calls for the neural network and random forest models, I've removed the part of the train() command that asks train() to maximize model senstivity ("metric = ‘Sens’ "). It wouldn’t be doing anything here, as there aren’t different tuning parameters to choose from to create the final model.

########################################## # Run logistic regression model in caret # ########################################## set.seed(1) log.seeds <- vector(mode = "list", length = 11) # number of resamples + 1 for final model for(i in 1:10) log.seeds[[i]] <- sample.int(n=1000, 1) # 1 because not adjusting any tuning parameters - just randomly sampling log.seeds[[11]] <- 1 # for the last model remove(i) log.seeds library(pROC) log.Control <- trainControl(method = "repeatedcv", # use N-fold cross validation number = 5, # the number of folds repeats = 2, classProbs = TRUE, summaryFunction = twoClassSummary, seeds = log.seeds) library(doParallel) cl = makeCluster(4) registerDoParallel(cl) model.log <- train(f, data=train, method="glm", family=binomial(link='logit'), trControl= log.Control) summary(model.log) stopCluster(cl) remove(cl) registerDoSEQ() remove(log.Control, log.seeds, f)

**Comparing Models**

Now that we have our neural network, random forest, and logistic regression models, we can compare them using caret’s resamples() command. Given that we asked caret to calculate the class probabilities (“classprobs = TRUE”) and use the twoClassFunction summary function ("summaryFunction = twoClassSummary"), this gives us the sensitivity, specificity, and ROC values for every iteration of the $finalModel. (For logistic regression, this is just the 10 resamples, but for the others, it refers to the iterations with the tuning parameters that were ultimately selected as best.)

library(caret) # collect resamples results <- resamples(list(RandomForest=model.rf, NeuralNetwork=model.nn, LogisticRegression=model.log)) results$values # summarize the distributions summary(results) # boxplot of results bwplot(results, metric="Sens")

The task before us was to detect customers who would respond to the marketing campaign, so the metric we care most about here is the sensitivity. Looking at the summary of results and the boxplot for sensitivity, we can see that in the training data, the random forest detected on average around 30.3% of campaign responders. The neural network detected on average around 29.7% of campaign responders, and the logistic regression only correctly classified an average of 23.1% (although, in fairness, my quick-and-dirty predictor selection technique is probably partially responsible for this low value).

We can also look at the specificity (metric="Spec") and ROC (metric="ROC") values, although we want to keep in mind that we asked the random forest and neural network models to tune themselves to optimize sensitivity. Doing this can compromise specificity, which in this example is the ability to correctly classify non-responders, because the model is very slightly shifted towards classifying someone as a responder. This is exactly what we see. Specificity is worse for the two models that were tuned for sensitivity. The dataset is unbalanced (11% = responders, 89% = nonresponders), so compromised specificity also has a disproportional impact on the random forest and neural network's ROC values. Despite this, the random forest's specificity and ROC values slightly exceed the neural network's values.

bwplot(results, metric="Spec") bwplot(results, metric="ROC") remove(results)

Taken cumulatively, the results suggest that the random forest does the best job of detecting customers who will respond to the marketing campaign. The neural network was a close second, but the random forest also does a slightly better job overall than the neural network.

**Testing Predictions in the Test Data**

We think we have a selected the best model, but what remains now is to confirm this by using each model to make predictions in the test data.

We’ll use caret’s predict() function to do this, and specify that we want it to output the actual class that it thinks each observation belongs to (i.e., “yes” or “no”), as opposed to the probabilities of those classes. We do this with the "type= ‘raw’ " command.

# Generate predicted classes preds.log <- predict.train(model.log, newdata=test, type="raw") # Logistic regression preds.nn <- predict.train(model.nn, newdata=test.nn, type="raw") # Neural network - remember that you're using processed data preds.rf <- predict.train(model.rf, newdata=test, type="raw") # Random forest

We could manually work out the sensitivity, specificity, etc. as we did in the last post, but caret actually offers a really nice function that will do all of this plus some extra stuff with one function call: confusionMatrix().

results.nn <- confusionMatrix(preds.nn, test$y, positive = "yes") results.log <- confusionMatrix(preds.log, test$y, positive = "yes") results.rf <- confusionMatrix(preds.rf, test$y, positive = "yes") results.nn results.log results.rf

The resulting sensitivity values indicate that in the new, never-before-analyzed test data, random forest correctly detected 29.8% of responders, neural network detected 28.1%, and logistic regression finished last at 22.8%. (I take partial responsibility for its poor showing.)

There's one last number I want to draw your attention to before we end. The confusionMatrix objects contain a metric called "Pos Pred Value", or positive predictive value. This number refers to the percentage of the time your model said the event would occur, and it actually occurred. (Sensitivity refers to the number of correct positive classifications out of all the *actual* (both identified and missed) positive events in the data; positive predictive value is the number of correct positive classification out of all (both correct and incorrect) positive classifications.) In the context of this example, positive predictive value is the percentage of time we said a customer would respond to the marketing campaign, and they actually did. In our results.rf object, we can see that the positive predictive value for the random forest is 0.505, or about 51%. Given that we're working with a binary outcome, you might initially think, 'well, that's not much better than chance,' but this overlooks something: the two outcomes are *not* equally likely. Only ~11% of the customers responded to the marketing campaign. This means that in the absence of a predictive model like this, when the company marketed to a customer, there was only an 11% chance that the customer would buy the product (in this case, a long-term deposit). We've increased that to a 50% chance that when the company markets to a customer we recommend, the customer will buy the deposit!

Complete code for this post is available on GitHub at: https://github.com/kc001/Blog_code/blob/master/2016.07%20Compare%20ML%20models.R.