Customer segmentation is a deceptively simple-sounding concept. Broadly speaking, the goal is to divide customers into groups that share certain characteristics. There are an almost-infinite number of characteristics upon which you could divide customers, however, and the optimal characteristics and analytic approach vary depending upon the business objective. This means that there is no single, correct way to perform customer segmentation.

In this post, I work through a practical example that, in my experience, closely mirrors the challenges of performing this kind of analysis with real data.

**First, why use clustering?**

Customer segmentation is often performed using unsupervised, clustering techniques (e.g., *k*-means, latent class analysis, hierarchical clustering, etc.), but customer segmentation results tend to be most actionable for a business when the segments can be linked to something concrete (e.g., customer lifetime value, product proclivities, channel preference, etc.). This begs the question: if you’re looking to link the segments to some sort of dependent variable, why not use an analytic technique that explicitly estimates the relationship between your possible predictors and the dependent variable?

The primary reason is that clustering creates groups from continuous variables (typically), so if you’re looking to create groups, clustering does a really nice job of finding the boundaries between groups for you. In situations where there is a dependent variable of interest, it is generally included as an input variable in the cluster analysis, so the clusters can be interpreted in light of this outcome variable. Clustering can also be used for exploratory purposes - it may be useful just to get a picture of typical customer characteristics at varying levels of your outcome variable.

**Second, an important caveat**

All this comes with an important warning, though. Clustering assumes that there *are* distinct clusters in the data. In my experience, customers are often distributed more or less continuously in multivariate space, and there aren’t necessarily distinct, underlying groups. I’ve deliberately selected a dataset for this blog post for which this is true, so you can see a worked example that grapples with this thorny issue.

**Third, an important distinction**

Customer segmentation is the process of dividing customers into groups based upon certain boundaries; clustering is *one* way to generate these boundaries. (Many thanks to the Mixotricha blog, for articulating this distinction.)

**The example in this blog post**

Customer segmentation can be performed using a variety of different customer characteristics. The most common are geographical region, demographics (e.g., age, gender, marital status, income), psychographics (e.g., values, interests, lifestyle, group affiliations), and purchase behavior (e.g., previous purchases, shipping preferences, page views on your website, etc.).

The dataset I use for this blog post uses behavioral data because, in my experience, this is the most common kind of data to have available. Although it would be wonderful to have demographic and psychographic data about all customers, it’s rare to have this without a survey specifically designed to collect it – and even then, you only have it for the customers who responded to the survey. It’s much more common to just have a list of customer accounts, invoices, invoice dates and times, products purchased, and ship-to locations.

In this situation, how might customers be segmented? This depends upon the business objective, of course, but a common goal is to identify high- and low-value customers for marketing purposes. I adopt this objective for this post.

I chose to calculate and work with metrics for each customer’s recency of last purchase, frequency of purchase, and monetary value. These three variables, collectively known as RFM, are often used in customer segmentation for marketing purposes. Lastly, I’m using the *k*-means clustering technique because it can efficiently handle large datasets and iterates quickly to good solutions.

You can obtain the dataset I'm using in this blog post from the University of California at Irvine machine learning repository at: https://archive.ics.uci.edu/ml/datasets/Online+Retail.

library(XLConnect) raw.data <- readWorksheet(loadWorkbook("Online Retail.xlsx"), sheet=1) data <- raw.data # the data takes a couple of minutes to import, so I keep raw.data as a backup str(data)

Some invoices are missing ID numbers. We’ll be doing the analysis at the customer level, so remove any observations with missing ID numbers.

length(unique(data$CustomerID)) sum(is.na(data$CustomerID)) data <- subset(data, !is.na(data$CustomerID))

There’s a little over one year’s worth of data in the downloaded dataset. To keep things tidy, I restrict the date range to one full year.

range(data$InvoiceDate) data <- subset(data, InvoiceDate >= "2010-12-09") range(data$InvoiceDate) table(data$Country) data <- subset(data, Country == "United Kingdom")

There’s some research indicating that customer clusters vary by geography, so here I’ll restrict the data to one geographic unit.

length(unique(data$InvoiceNo)) length(unique(data$CustomerID))

We now have a dataset of 19,140 unique invoices and 3,891 unique customers.

To calculate the recency and frequency variables below, it will be necessary to distinguish invoices with purchases from invoices with returns.

# Identify returns data$item.return <- grepl("C", data$InvoiceNo, fixed=TRUE) data$purchase.invoice <- ifelse(data$item.return=="TRUE", 0, 1)

**RFM Variables**

The original dataset was organized long, with invoices nested within customer. Below, I create a customer-level dataset and add recency, frequency, and monetary value data to it. The recency variable refers to the number of days that have elapsed since the customer last purchased something (so, smaller numbers indicate more recent activity on the customer’s account). Frequency refers to the number of invoices with purchases during the year. Monetary value is the amount that the customer spent during the year. Some customers have negative monetary values. These customers probably returned something during the year that they had purchased before the year started, so I reset their monetary value to zero.

################################# # Create customer-level dataset # ################################# customers <- as.data.frame(unique(data$CustomerID)) names(customers) <- "CustomerID" ########### # Recency # ########### data$recency <- as.Date("2011-12-10") - as.Date(data$InvoiceDate) # remove returns so only consider the data of most recent *purchase* temp <- subset(data, purchase.invoice == 1) # Obtain # of days since most recent purchase recency <- aggregate(recency ~ CustomerID, data=temp, FUN=min, na.rm=TRUE) remove(temp) # Add recency to customer data customers <- merge(customers, recency, by="CustomerID", all=TRUE, sort=TRUE) remove(recency) customers$recency <- as.numeric(customers$recency) ############# # Frequency # ############# customer.invoices <- subset(data, select = c("CustomerID","InvoiceNo", "purchase.invoice")) customer.invoices <- customer.invoices[!duplicated(customer.invoices), ] customer.invoices <- customer.invoices[order(customer.invoices$CustomerID),] row.names(customer.invoices) <- NULL # Number of invoices/year (purchases only) annual.invoices <- aggregate(purchase.invoice ~ CustomerID, data=customer.invoices, FUN=sum, na.rm=TRUE) names(annual.invoices)[names(annual.invoices)=="purchase.invoice"] <- "frequency" # Add # of invoices to customers data customers <- merge(customers, annual.invoices, by="CustomerID", all=TRUE, sort=TRUE) remove(customer.invoices, annual.invoices) range(customers$frequency) table(customers$frequency) # Remove customers who have not made any purchases in the past year customers <- subset(customers, frequency > 0) ############################### # Monetary Value of Customers # ############################### # Total spent on each item on an invoice data$Amount <- data$Quantity * data$UnitPrice # Aggregated total sales to customer annual.sales <- aggregate(Amount ~ CustomerID, data=data, FUN=sum, na.rm=TRUE) names(annual.sales)[names(annual.sales)=="Amount"] <- "monetary" # Add monetary value to customers dataset customers <- merge(customers, annual.sales, by="CustomerID", all.x=TRUE, sort=TRUE) remove(annual.sales) # Identify customers with negative monetary value numbers, as they were presumably returning purchases from the preceding year hist(customers$monetary) customers$monetary <- ifelse(customers$monetary < 0, 0, customers$monetary) # reset negative numbers to zero hist(customers$monetary)

**80/20 Rule**

If you aren’t familiar with the 80/20 Rule (also known as the Pareto Principle), it’s the concept that 80% of the results generally come from 20% of the causes. In this context, it implies that ~80% of sales would be produced by the top ~20% of customers. These 20% represent the high-value, important customers a business would want to protect.

To make a point about outliers below, I create some simple segments here by looking at the top customers who produced 80% of annual sales for the year. In this dataset, 80% of the annual sales are produced by the top 29% of customers, so the percentage isn’t quite 20%, but it’s not that far off and it does illustrate that there’s a smallish segment producing the bulk of the value.

customers <- customers[order(-customers$monetary),] # Apply Pareto Principle (80/20 Rule) pareto.cutoff <- 0.8 * sum(customers$monetary) customers$pareto <- ifelse(cumsum(customers$monetary) <= pareto.cutoff, "Top 20%", "Bottom 80%") customers$pareto <- factor(customers$pareto, levels=c("Top 20%", "Bottom 80%"), ordered=TRUE) levels(customers$pareto) round(prop.table(table(customers$pareto)), 2) remove(pareto.cutoff) customers <- customers[order(customers$CustomerID),]

**Preprocess data**

*k*-means clustering requires continuous variables and works best with relatively normally-distributed, standardized input variables. Standardizing the input variables is quite important; otherwise, input variables with larger variances will have commensurately greater influence on the results. Below, I transform our three input variables to reduce positive skew and then standardize them as z-scores.

# Log-transform positively-skewed variables customers$recency.log <- log(customers$recency) customers$frequency.log <- log(customers$frequency) customers$monetary.log <- customers$monetary + 0.1 # can't take log(0), so add a small value to remove zeros customers$monetary.log <- log(customers$monetary.log) # Z-scores customers$recency.z <- scale(customers$recency.log, center=TRUE, scale=TRUE) customers$frequency.z <- scale(customers$frequency.log, center=TRUE, scale=TRUE) customers$monetary.z <- scale(customers$monetary.log, center=TRUE, scale=TRUE)

**Visualize data**

The user has to specify the number of clusters with *k*-means clustering. I like to look at the data to get a sense for what I’m dealing with and how many clusters I might have. In the graphs below, the outcome we’re probably most interested in, customer monetary value, is plotted on the y-axis. Frequency of purchases is on the x-axis, and I chose to represent the third variable, recency of purchase, by color-coding the data points. Lastly, I included the 80/20 Rule segments so I could map those designations on to customer monetary value, frequency, and recency.

library(ggplot2) library(scales) # Original scale scatter.1 <- ggplot(customers, aes(x = frequency, y = monetary)) scatter.1 <- scatter.1 + geom_point(aes(colour = recency, shape = pareto)) scatter.1 <- scatter.1 + scale_shape_manual(name = "80/20 Designation", values=c(17, 16)) scatter.1 <- scatter.1 + scale_colour_gradient(name="Recency\n(Days since Last Purchase))") scatter.1 <- scatter.1 + scale_y_continuous(label=dollar) scatter.1 <- scatter.1 + xlab("Frequency (Number of Purchases)") scatter.1 <- scatter.1 + ylab("Monetary Value of Customer (Annual Sales)") scatter.1

This first graph uses the variables’ original metrics and is almost completely uninterpretable. There’s a clump of data points in the lower left-hand corner of the plot, and then a few outliers. This is why we log-transformed the input variables.

# Log-transformed scatter.2 <- ggplot(customers, aes(x = frequency.log, y = monetary.log)) scatter.2 <- scatter.2 + geom_point(aes(colour = recency.log, shape = pareto)) scatter.2 <- scatter.2 + scale_shape_manual(name = "80/20 Designation", values=c(17, 16)) scatter.2 <- scatter.2 + scale_colour_gradient(name="Log-transformed Recency") scatter.2 <- scatter.2 + xlab("Log-transformed Frequency") scatter.2 <- scatter.2 + ylab("Log-transformed Monetary Value of Customer") scatter.2

Ah, this is better. Now we can see a scattering of high-value, high-frequency customers in the top, right-hand corner of the graph. These data points are dark, indicating that they’ve purchased something recently. In the bottom, left-hand corner of the plot, we can see a couple of low-value, low frequency customers who haven’t purchased anything recently, with a range of values in between.

Importantly, we can see that the data points are fairly continuously-distributed. There really aren’t clear clusters. This means that any cluster groupings we create won’t exactly reflect some true, underlying group membership – they’ll be somewhat arbitrary (albeit reasonable) distinctions that we draw for our own purposes.

**Handling outliers**

One question we might have about those dots in the bottom, left-hand corner is how many customers they represent. The following code investigates them a little more thoroughly.

# How many customers are represented by the two data points in the lower left-hand corner of the plot? 18 delete <- subset(customers, monetary.log < 0) no.value.custs <- unique(delete$CustomerID) delete2 <- subset(data, CustomerID %in% no.value.custs) delete2 <- delete2[order(delete2$CustomerID, delete2$InvoiceDate),] remove(delete, delete2, no.value.custs)

The eighteen, no-value customers we just investigated are all customers who returned everything they bought. They highlight an important decision point at this juncture in the analysis. *k*-means clustering tends to be sensitive to outliers, such that outliers will sometimes end up being clustered together in their own tiny group. This is often cited as a reason to exclude them from the analysis. In this type of customer segmentation, however, the outliers may be the most important customers to understand. In the top, right-hand corner, we have customers who are outliers in terms of being extraordinarily high-value, high-frequency shoppers. These data points are all represented with little triangles, indicating that they’re in the top 80/20 category. These are important customers to understand, because they’re the customers we most want. At the other end of the continuum, we have the no-value customers in the bottom, left-hand corner. These customers, too, may be important to model and understand – they’re the customers we want to minimize. I deliberately include these outliers in the analysis.

# Scaled variables scatter.3 <- ggplot(customers, aes(x = frequency.z, y = monetary.z)) scatter.3 <- scatter.3 + geom_point(aes(colour = recency.z, shape = pareto)) scatter.3 <- scatter.3 + scale_shape_manual(name = "80/20 Designation", values=c(17, 16)) scatter.3 <- scatter.3 + scale_colour_gradient(name="Z-scored Recency") scatter.3 <- scatter.3 + xlab("Z-scored Frequency") scatter.3 <- scatter.3 + ylab("Z-scored Monetary Value of Customer") scatter.3 remove(scatter.1, scatter.2, scatter.3)

This third scatterplot (not pictured) is basically identical to the second – it illustrates that even though we’ve changed the scaling for the analysis, the shape of the distributions and the relationships among the variables remain the same.

**Determine number of clusters / run k-means**

Given that a visual overview of the data didn’t suggest an obvious choice for the number of clusters, and we don’t have prior information (or a request from the business) to produce a specified number of clusters, the next challenge is to determine how many clusters to extract. This is a bit of a judgment call. There are some metrics we can use to guide decision-making here, and I briefly include them below, but it’s most important that the clusters make sense and meet our objective (i.e., identifying high- and low-value customers for the business).

For this reason, I create a *for loop* to run the *k*-means analysis with increasing numbers of clusters, each time generating a graph of the clusters, the cluster centers for each model, and information about the variance explained. All of this can assist in selecting the optimal number of clusters.

preprocessed <- customers[,9:11] j <- 10 # specify the maximum number of clusters you want to try out models <- data.frame(k=integer(), tot.withinss=numeric(), betweenss=numeric(), totss=numeric(), rsquared=numeric()) for (k in 1:j ) { print(k) # Run kmeans # nstart = number of initial configurations; the best one is used # $iter will return the iteration used for the final model output <- kmeans(preprocessed, centers = k, nstart = 20) # Add cluster membership to customers dataset var.name <- paste("cluster", k, sep="_") customers[,(var.name)] <- output$cluster customers[,(var.name)] <- factor(customers[,(var.name)], levels = c(1:k)) # Graph clusters cluster_graph <- ggplot(customers, aes(x = frequency.log, y = monetary.log)) cluster_graph <- cluster_graph + geom_point(aes(colour = customers[,(var.name)])) colors <- c('red','orange','green3','deepskyblue','blue','darkorchid4','violet','pink1','tan3','black') cluster_graph <- cluster_graph + scale_colour_manual(name = "Cluster Group", values=colors) cluster_graph <- cluster_graph + xlab("Log-transformed Frequency") cluster_graph <- cluster_graph + ylab("Log-transformed Monetary Value of Customer") title <- paste("k-means Solution with", k, sep=" ") title <- paste(title, "Clusters", sep=" ") cluster_graph <- cluster_graph + ggtitle(title) print(cluster_graph) # Cluster centers in original metrics library(plyr) print(title) cluster_centers <- ddply(customers, .(customers[,(var.name)]), summarize, monetary=round(median(monetary),2),# use median b/c this is the raw, heavily-skewed data frequency=round(median(frequency),1), recency=round(median(recency), 0)) names(cluster_centers)[names(cluster_centers)=="customers[, (var.name)]"] <- "Cluster" print(cluster_centers) cat("\n") cat("\n") # Collect model information models[k,("k")] <- k models[k,("tot.withinss")] <- output$tot.withinss # the sum of all within sum of squares models[k,("betweenss")] <- output$betweenss models[k,("totss")] <- output$totss # betweenss + tot.withinss models[k,("rsquared")] <- round(output$betweenss/output$totss, 3) # percentage of variance explained by cluster membership assign("models", models, envir = .GlobalEnv) remove(output, var.name, cluster_graph, cluster_centers, title, colors) } remove(k)

For simplicity, I haven’t posted graphs for every model here, but somewhere from 2-5 clusters seems most interpretable.

A 2-cluster solution produces one group of high-value (median = $1,797.78), high-frequency (median = 5 purchases) customers who have purchased recently (median = 17 days since their most recent purchase), and one group of lower value (median = $327.50), low frequency (median = 1 purchase) customers for whom it's been a median of 96 days since their last purchase. Although these two clusters are clear and interpretable, this may be simplifying customer behavior too much.

As we add clusters, we gain insight into increasingly subtle distinctions between customers. I really like the 5-cluster solution. It gives us: a high-value, high-frequency, recent purchase group (cluster 5), a medium-value, medium-frequency, relatively-recent purchase group (cluster 2), two clusters of low-value, low-frequency customers broken down by whether their last purchase was recent or much earlier in the year (clusters 3 and 1, respectively), and lastly a no-value cluster whose median value to the business is $0.00 (cluster 4).

As we move beyond 5 clusters, the graphs become increasingly hard to interpret visually, and the cluster centers start to make distinctions that may not be that helpful (e.g., low-value-with-1-purchase vs. low-value-with-2-purchases customers).

There are a couple of other things we can check to assist us in selecting the optimal solution. The code below generates graphs of the variance explained and within-cluster variance by number of cluster. These graphs are two different ways of visualizing the same information – in both cases, we’re looking for an “elbow” or bend in the graph beyond which additional clusters add little additional explanatory power. Both graphs look to have elbows at around 2 clusters, but a 2-cluster solution explains only 49% of the variance and, once again, a 2-cluster solution may be too much of a simplification to really help the business with targeted marketing. The 5-cluster solution explains ~73% of the variance, but there are no clear elbows in the graph at this point.

library(ggplot2) library(scales) # Graph variance explained by number of clusters r2_graph <- ggplot(models, aes(x = k, y = rsquared)) r2_graph <- r2_graph + geom_point() + geom_line() r2_graph <- r2_graph + scale_y_continuous(labels = scales::percent) r2_graph <- r2_graph + scale_x_continuous(breaks = 1:j) r2_graph <- r2_graph + xlab("k (Number of Clusters)") r2_graph <- r2_graph + ylab("Variance Explained") r2_graph # Graph within sums of squares by number of clusters # Look for a "bend" in the graph, as with a scree plot ss_graph <- ggplot(models, aes(x = k, y = tot.withinss)) ss_graph <- ss_graph + geom_point() + geom_line() ss_graph <- ss_graph + scale_x_continuous(breaks = 1:j) ss_graph <- ss_graph + scale_y_continuous(labels = scales::comma) ss_graph <- ss_graph + xlab("k (Number of Clusters)") ss_graph <- ss_graph + ylab("Total Within SS") ss_graph remove(j, r2_graph, ss_graph)

Lastly, there is an R package that will look at a host of different fit indices and, using majority rule, suggest the number of clusters that most indices recommend.

######################################################### # Using NbClust metrics to determine number of clusters # ######################################################### library(NbClust) set.seed(1) nc <- NbClust(preprocessed, min.nc=2, max.nc=7, method="kmeans") table(nc$Best.n[1,]) nc$All.index # estimates for each number of clusters on 26 different metrics of model fit barplot(table(nc$Best.n[1,]), xlab="Number of Clusters", ylab="Number of Criteria", main="Number of Clusters Chosen by Criteria") remove(preprocessed)

The greatest number of indices recommend the 2-cluster solution.

How to proceed at this point is murky, but authentically so. At this juncture, it makes sense to show interested stakeholders the cluster solutions and get their input. The decision should be based upon how the business plans to use the results, and the level of granularity they want to see in the clusters.

If the business wants to use the results to understand a range of customer behavior from high-to-low value customers, I'd probably recommend the 5-cluster solution. I like that it distinguishes the no-value group of customers, whom the business probably wants to eliminate as much as possible, and also separates low-value, low-frequency customers who have purchased recently from those who have not. It may be easier to encourage recently-active customers to re-engage with the business and possibly develop into medium-value customers. That said, there isn't one, correct decision here.

**Three-Dimensional Representation of Clusters**

This code will give you a 3D rendering of your clusters, which is helpful if you’re working with three input variables, as an RFM analysis often is. The first graph features planes through each cluster that help the user understand the monetary value associated with the cluster. The second features ellipses that highlight the location of each cluster in three-dimensional space. In both graphs, the data points are color-coded to correspond to their associated cluster.

colors <- c('red','orange','green3','deepskyblue','blue','darkorchid4','violet','pink1','tan3','black') library(car) library(rgl) scatter3d(x = customers$frequency.log, y = customers$monetary.log, z = customers$recency.log, groups = customers$cluster_5, xlab = "Frequency (Log-transformed)", ylab = "Monetary Value (log-transformed)", zlab = "Recency (Log-transformed)", surface.col = colors, axis.scales = FALSE, surface = TRUE, # produces the horizonal planes through the graph at each level of monetary value fit = "smooth", # ellipsoid = TRUE, # to graph ellipses uses this command and comment out "surface = TRUE" grid = TRUE, axis.col = c("black", "black", "black")) remove(colors)

Complete code for this post is available on GitHub at: https://github.com/kc001/Blog_code/blob/master/2016.08%20k-means%20customer%20segmentation.R