Neural networks are a great analytic tool for generating predictions from existing data. They can detect complex, non-linear relationships in data (including interactions among predictors), can handle large datasets with many predictors, and often produce more accurate predictions than regression/logistic regression. As with random forests, they can be used for regression or classification.

For this post, I take on a classic classification challenge and seek to answer the question: which customers are most likely to respond to a marketing campaign?

In this post, I'm making the assumption that we’re more concerned with accurately predicting customers who will respond to the campaign (i.e, maximizing the sensitivity of the model), than with accurately predicting those who will not respond (i.e., the specificity of the model). Of course, we care about both, or the overall accuracy, because we don’t want to waste resources marketing to customers who will not respond, but we’ll assume that it’s a bigger loss to miss a customer who would have responded than to market to a customer who doesn’t respond.

**Obtain and preprocess data**

The dataset used for this blog post comes from a Portugese banking institution, which ran a marketing campaign between 2008 and 2013 to sell long-term deposits. The data include predictors specific to the customer, such as their age, profession, marital status, education level, etc., as well as five predictors associated with the general economic environment at the time that the customer was contacted, such as the employment rate and consumer price index. The dependent variable, labeled “y” in the dataset, refers to whether the customer bought a deposit (identified in the dataset with a “yes”) or declined to do so (“no”). The data is available here from the University of California at Irvine repository of public datasets.

raw.data <- read.csv("bank-additional-full.csv", header = TRUE, sep = ";") str(raw.data) data <- raw.data # Remove "duration", as this variable refers to the duration of the marketing call, # and would not be known before a call is initiated data$duration <- NULL # Remove number of days since last contact" variable, # for which most values are "999" (no previous contact) and # captured by the "previous" variable data$pdays <- NULL

Neural networks work most best when numerical variables are normalized, so they’re all on the same scale. This rescaling increases computing efficiency, because the system doesn’t have to accommodate predictors with both very large and very small values. I’m intentionally preprocessing the entire dataset at once, then splitting it into train and test datasets, because any preprocessing that is done to the training set should also be performed in exactly the same way on the test dataset.

########################################################### # Use min-max normalization to re-scale numeric variables # ########################################################### num.vars <- data[, sapply(data, is.numeric)] maxs <- apply(num.vars, 2, max) mins <- apply(num.vars, 2, min) processed <- as.data.frame(scale(num.vars, center = mins, scale = maxs - mins)) apply(processed, 2, range) # Check that all variable are rescaled from 0-1 remove(mins, maxs, num.vars)

We’ll also need to dummy-code our factor variables. Below, I use the “dummy.code()” function in the psych library package to do this. In the same *for loop*, I also rename the resulting dummy-coded variables so they’re a little easier to map to their original variables, and add them to the *processed* dataset produced in the preceding step.

############################### # Dummy-code factor variables # ############################### # Select factor variables for dummy-coding vars <- data[, sapply(data, is.factor)] vars$y <- NULL # Remove the dependent variable, which we will want to be a factor vars <- colnames(vars) library(psych) for (i in 1:length(vars) ) { var <- vars[i] new <- dummy.code(data[,var]) # convert factor to dummy variables # Rename dummy variables new <- dummy.code(data[,var]) names <- colnames(new) names <- paste(var, names, sep = ".") colnames(new) <- names remove(names) processed <- data.frame(processed, new) remove(new) } remove(i, vars, var) ############################## # Add y to processed dataset # ############################## table(data$y) sum(is.na(data$y)) processed$y <- factor(data$y,levels=c("yes", "no"),ordered=TRUE) table(processed$y) levels(processed$y)

Take note that I’ve deliberately specified the order of the dependent variable’s levels – I explain why this is necessary below when we run the model using caret.

Lastly, we’ll split the processed data into train and test datasets. I’m using the caret package’s createDataPartition() function (discussed in my April 2016 post on caret) to do this, to ensure that the train and test datasets have approximately equal proportions of responders and non-responders.

################################## # Create train and test datasets # ################################## library(caret) set.seed(1) inTrain <- createDataPartition(y = processed$y, # the outcome variable p = 0.75, # the percentage of data in the training sample list = FALSE) # should the results be in a list? train <- processed[inTrain,] test <- processed[-inTrain,] remove(inTrain)

**Run neural network models and generate predictions: ***nnet package*

*nnet package*

There are two neural network packages in R that are typically used for this kind of task, nnet and neuralnet. I prefer the nnet package because its easier to manually adjust the weight decay. nnet does have a limitation, though: you’re restricted to one hidden layer. In practice, additional hidden layers don’t usually increase the model’s predictive power much, so I’m comfortable with this trade-off.

The primary tuning parameters to adjust with an nnet model are “size = “, which refers to the number of nodes in the hidden layer, and “decay = “, or the size of the weight decay. Here, I run the model in nnet with some arbitrary values for these two tuning parameters. Below, we’ll run the model again using caret and a matrix of possible values, so caret can help us select the optimal combination. The “linout = “ argument is short for “linear output” and is used to designate whether you have a continuous outcome variable (and are doing regression) or whether your dependent variable is a factor and your model is a classification. “linout = FALSE” is actually the default here, but I include it to make this aspect of the model explicit.

library(nnet) set.seed(1) model.nn <- nnet(y ~ ., data=train, size=10, maxit=1000, decay=.01, linout=FALSE) preds.nn <- predict(model.nn, newdata=test, type="class") results.nn <- table(predicted=preds.nn, true=test$y) results.nn # overall accuracy = sum of diagonal elements / sum of all elements accuracy.nn <- round(sum(diag(results.nn[nrow(results.nn):1, ]))/sum(results.nn), 3) # sensitivity in predicting responders sensitivity.nn <- round(results.nn["yes","yes"]/sum(results.nn[,"yes"]), 3) # specificity in predicting non-responders specificity.nn <- round(results.nn["no","no"]/sum(results.nn[,"no"]), 3)

**Run neural network models and generate predictions: ***caret** package*

*caret*

*package*

library(nnet) library(caret) # Grid of tuning parameters to try: fitGrid <- expand.grid(.size=c(5,10,15), .decay=c(0.001,0.01,0.1)) # Set the seeds # Necessary for reproducibility because we're using parallel processing set.seed(1) seeds <- vector(mode = "list", length = 6) # number of resamples + 1 for final model for(i in 1:5) seeds[[i]] <- sample.int(n=1000, 9) # 9 is the number of tuning parameter combinations seeds[[6]] <- 1 # for the last model remove(i) seeds library(pROC) fitControl <- trainControl(method = "cv", # use N-fold cross validation number = 5, # the number of folds classProbs = TRUE, summaryFunction = twoClassSummary, seeds = seeds) library(doParallel) cl = makeCluster(6) registerDoParallel(cl) #Fit model model.ct.nn <- train(y ~ ., data=train, method='nnet', maxit = 1000, linout = FALSE, trControl = fitControl, tuneGrid = fitGrid, metric = "Sens", # maximize sensitivity to "yes" values allowParallel = TRUE)

Important note: When you’re doing classification, as we are here, caret doesn’t magically know which level of your dependent variable indicates the presence of whatever you’re trying to predict, and which level indicates its absence. By default, it assumes that the first level of the variable represents its presence. This has implications for the sensitivity and specificity metrics, and is particularly important when you’re looking to maximize one of these, as we are here.

If you look back to the code section where we created the *processed* dataset, you’ll notice that we re-ordered the dependent variable *y*, such that the levels were “yes” and then “no”. We did this so that “yes” will be the first level of the factor variable, so sensitivity now properly refers to detecting “yes” values.

stopCluster(cl) remove(cl) registerDoSEQ() model.ct.nn

The results indicate that a model with 15 nodes in the hidden layer and a weight decay of 0.001 produced the highest sensitivity to detecting campaign responders.

############################################ # Generate predictions from caret-run nnet # ############################################ preds.ct.nn <- predict(model.ct.nn, newdata=test) results.ct.nn <- table(predicted=preds.ct.nn, true=test$y) results.ct.nn # overall accuracy = sum of diagonal elements / sum of all elements accuracy.ct.nn <- round(sum(diag(results.ct.nn))/sum(results.ct.nn), 3) # sensitivity in predicting responders sensitivity.ct.nn <- round(results.ct.nn["yes","yes"]/sum(results.ct.nn[,"yes"]), 3) # specificity in predicting non-responders specificity.ct.nn <- round(results.ct.nn["no","no"]/sum(results.ct.nn[,"no"]), 3)

We can compare the predictions from our original nnet model (model.nn) with those produced when we run nnet through caret (model.ct.nn). With the set.seed values that I've used here, the original model correctly detected 26.6% of campaign responders, whereas the caret model detected 28.1% of campaign responders.

**Visualize Neural Network**

Lastly, if we want to visualize the final caret model, we can graph it using the NeuralNetTools package.

library(NeuralNetTools) plotnet(model.ct.nn)

If you try this and get an error that says “Error in plot.new() : figure margins too large”, try expanding the size of your Rstudio “Plots” window by manually dragging the dividers.

Given that our model has 61 predictors and the final model selected in caret has 15 nodes in the hidden layer, the graph is not particularly interpretable, but this gives you a sense for how to generate a neural network graph of your model.

To generate the graph I’ve posted at the beginning of this post, run this much-simplified model with only 8 predictors and 1 node in the hidden layer:

# Subset data to run simplified model to generate a sample graph graph.data <- subset(train, select = c("age","campaign","previous","emp.var.rate","cons.price.idx","cons.conf.idx", "euribor3m","nr.employed","y")) model.graph.nn <- nnet(y ~ ., data=graph.data, size=1, maxit=1000, decay=.01, linout=FALSE) plotnet(model.graph.nn)

Complete code for this blog post can be found on GitHub at: https://github.com/kc001/Blog_code/blob/master/2016.06%20Neural%20network.R.