Random Forest for Loan Performance Prediction

Random Forests are among the most powerful predictive analytic tools. They leverage the considerable strengths of decision trees, including handling non-linear relationships, being robust to noisy data and outliers, and determining predictor importance for you. Unlike single decision trees, however, they don’t need to be pruned, are less prone to overfitting, and produce aggregated results that tend to be more accurate.

This post presents code to prepare data for a random forest, run the analysis, and examine the output.

The specific question I answer with these analyses is: what is the predicted percentage of loan principal that will have been re-paid by the time the loan reaches maturity? I’m using publicly-available, 2007-2011 data from the Lending Club for these analyses. You can obtain the data here.

For purposes of these analyses, I'm treating the outcome we're modeling - loan performance - as a continuous variable and conducting a regression-based random forest, rather than a classification-based one. Another approach to this kind of task is to divide loans into categories (e.g., "good" and "bad") and classify them accordingly.

Clean and prepare data

# Set working directory to wherever you saved the data
setwd()

# Import file

# Set missing data to NA
is.na(loan) <- loan == ""

# Create date variables that R recognizes as dates
loan\$issue_d <- as.Date(paste('15', loan\$issue_d), format='%d %b-%y')
loan\$earliest_cr_line <- as.Date(paste('15', loan\$earliest_cr_line), format='%d %b-%y')

# Identify loans that have already come to term
loan\$term.months <- NA
loan\$term.months[loan\$term==" 36 months"] <- 36
loan\$term.months[loan\$term==" 60 months"] <- 60
# the "term" variable is redundant with the numerical term.months variable
loan\$term <- NULL

library(lubridate)
loan\$maturity.date <- loan\$issue_d + months(loan\$term.months)
today <- Sys.Date()
loan\$mature <- ifelse(loan\$maturity.date < today, 1, 0)
loan\$maturity.date <- NULL
remove(today)

# subset data to select only mature loans
loan <- subset(loan, mature==1)

# Convert character percentages to numeric variables
loan\$int_rate <- as.numeric(gsub("%" , "", loan\$int_rate))
loan\$revol_util <- as.numeric(gsub("%" , "", loan\$revol_util))

# Convert character employment length to numeric variable
# This produces some missing data for values of emp_length that were "n/a"
loan\$emp_length <- gsub(" years" , "", loan\$emp_length)
loan\$emp_length <- gsub(" year" , "", loan\$emp_length)
loan\$emp_length <- ifelse(loan\$emp_length == "10+", 10, loan\$emp_length)
loan\$emp_length <- ifelse(loan\$emp_length == "< 1", 0.5, loan\$emp_length)
loan\$emp_length <- as.numeric(loan\$emp_length)

# Convert character to ordinal variable

# Remove variables where more than 20% of the observations are missing values
loan <- loan[, colMeans(is.na(loan)) <= .20]

# randomForest can only accommodate factor variables with 32 or fewer levels
# Remove factor vars with too many levels
too.many.levels <- function(x) {
is.factor(x) == TRUE & length(levels(x)) > 32
}
delete <- lapply(loan, too.many.levels)
loan <- loan[, delete == FALSE]
remove(too.many.levels, delete)

# Calculate the percentage of loan paid back
# This is the outcome variable we will be looking to model in the training data,
# and predict in the test data
loan\$paid.back <- (loan\$funded_amnt - loan\$out_prncp)/loan\$funded_amnt
hist(loan\$paid.back)
range(loan\$paid.back, na.rm = TRUE)

Most loans have been paid back in their entirety (these are the values stacked up at 1). This makes sense because these are loans that presumably went through some sort of initial vetting process and passed before the Lending Club issued them. (Additionally, the Lending Club makes this loan data publicly-available, so they probably feel good about having potential investors see it.) Among the loans that weren’t paid back in full, a significant portion of the loan has still been repaid: the smallest percentage of repaid principal is 83%.

# Remove accounts with missing paid.back status
loan <- subset(loan, ! is.na(loan\$paid.back))

One of the strengths of decision trees and random forests is that you don’t have to spend a lot of time up front carefully selecting your predictors – the analysis will let you know which ones are most important. For this reason, I like to ask R to model all possible predictors (using the “randomForest(dependentVariable ~ . , …)” syntax. Our particular dataset includes some predictors that refer to things that would have happened after the loan originated, though, like the current loan status. We have to remove these, or we’re not accurately simulating the challenge of predicting, before the loan has been issued, how much of it will be paid back.

# Remove variables that provide additional outcome data about the loan
loan\$last_pymnt_amnt <- NULL # Last total payment amount received
loan\$total_pymnt <- NULL # Payments received to date for total amount funded
loan\$total_pymnt_inv <- NULL # Payments received to date for portion of total amount funded by investors
loan\$total_rec_prncp <- NULL # total recovered principal
loan\$out_prncp <- NULL # Remaining outstanding principal for total amount funded
loan\$out_prncp_inv <- NULL # Remaining outstanding principal for portion of total amount funded by investors
loan\$total_rec_int <- NULL # Interest received to date
loan\$total_rec_late_fee <- NULL # Late fees received to date
loan\$collection_recovery_fee <- NULL # post charge off collection fee
loan\$recoveries <- NULL # amount recovered after loan is charged off
loan\$loan_status <- NULL
loan\$last_pymnt_d <- NULL # Last month payment was received
loan\$next_pymnt_d <- NULL # Next scheduled payment date
loan\$last_credit_pull_d <- NULL # most recent month LC pulled credit for this loan
# Remove variables where all values are the same
loan <- loan[sapply(loan, function(x) length(levels(factor(x)))>1)]

# check the amount of missing data in remaining dataset
lapply(loan, function(x) { sum(is.na(x)) })

Random forests calculate error on out-of-bag observations, so technically you don’t need separate training and testing datasets. I create them here just because it simplifies the example a bit to have distinct training and testing datasets.

##################################
# Create train and test datasets #
##################################

library(dplyr)
training.per <- 0.75
training.n <- round((nrow(loan)*training.per), 0)
train <- sample_n(loan, training.n)
remove(training.n, training.per)
train.nums <- unique(train\$member_id)

test <- subset(loan, !(member_id %in% train.nums))
remove(train.nums)
# predict() won't work for cases that are missing any of the predictor variables
test <- test[complete.cases(test),]
row.names(test) <- NULL

# Remove ID number so randomForest doesn't try to use it as a predictor
train\$member_id <- NULL
test\$member_id <- NULL

Run the Random Forest

##############################
# Random Forest - Regression #
##############################

library(randomForest)
rf.model <- randomForest(paid.back ~ .,
data = train,
ntree = 500,
type="regression",
importance=TRUE,
na.action=na.omit)

print(rf.model) # view results
importance <- as.data.frame(importance(rf.model)) # importance of each predictor
names(importance)[names(importance)=="%IncMSE"] <- "IncMSE"
importance <- importance[order(-importance\$IncMSE),]

This provides us with a sorted list of predictor importance. Larger “IncMSE” values indicate that the variable is more important.

Strangely, we see that issue_d (the loan's issue date) is a very important predictor, even though we filtered the data to only include loans that had already come to term. The next pieces of code investigate this further.

# loan issue date is the most important predictor
library(ggplot2)
library(scales)
issue_d_plot <- ggplot(loan, aes(x=issue_d, y=paid.back)) + geom_point()
issue_d_plot <- issue_d_plot + scale_y_continuous(labels=percent)
issue_d_plot <- issue_d_plot + ylab("Percentage of Loan Paid Back")
issue_d_plot
remove(issue_d_plot)

This graph illustrates that most of the Lending Club’s loans from 2007-2011 were repaid in full (most data points are at 100%). Loans issued between September 2010 and March 2011, however, are less likely to have been repaid in full. The data don’t tell us why this happened – the Lending Club may have changed their lending criteria during this period or something like that. This figure lets us know that the variable importance results do make sense, given the data, even though issue date may be a much less important predictor during other time periods, or possibly for other banks during the same time period.

# Graph error rate as a function of number of decision trees using ggplot
plot.data <- as.data.frame(plot(rf.model))
colnames(plot.data) <- c("Error")
plot.data\$trees <- as.numeric(rownames(plot.data))

options(scipen = 999)
library(ggplot2)
library(scales)
rf.plot <- ggplot(plot.data, aes(x=plot.data\$trees, y=plot.data\$Error)) + geom_line(colour="#000099")
rf.plot <- rf.plot + xlab("Number of Decision Trees")
rf.plot <- rf.plot + ylab("Mean Squared Error")
rf.plot <- rf.plot + ggtitle("Mean Squared Error by Number of Decision Trees")
rf.plot
remove(rf.plot, plot.data)

This figure demonstrates one of the advantages random forests have over single decision trees - you can see that error drops as the number of trees grows.

Using the Model for Prediction

As I mentioned above, randomForest will give us the out-of-bag error rate on the training data, which is a measure of prediction error for observations when they weren’t included in a given bootstrapped sample of the data. The mean of squared  residuals in the specific training set I’m working with is ~0.000008, so we’d expect something similar when we apply the model to the test data.

# Use the model to predict outcomes in new data
rf.model.preds <- predict(object = rf.model, newdata = test) # Predict the test data
results <- data.frame(actual = round(test\$paid.back, 2), predicted = round(rf.model.preds, 2))
remove(rf.model.preds)

# Examine mean squared error in test data
results\$residual <- results\$actual - results\$predicted
results\$residual2 <- results\$residual^2
mean(results\$residual2)

Given that we treated loan performance as a continuous outcome, it makes sense to use a continuous measure of prediction accuracy, like mean squared error. Another way to look at the success of our loan performance predictions, however, is the binary measure of how often our predicted paid.back value for a loan was exactly right.

# Identify correct predictions
results\$correct.prediction <- ifelse(results\$actual==results\$predicted, 1, 0)
table(results\$correct.prediction)

The value I get for the percent of correct predictions, in my test data set, is ~98%.

remove(rf.model, results, importance, test, train)