Parallel Processing

If you've needed to perform the same sequence of tasks or analyses over multiple units, you've probably found for loops helpful. They aren't without their challenges, however - as the number of units increases, the processing time increases. For large data sets, the processing time associated with a sequential for loop can become so cumbersome and unwieldy as to be unworkable. Parallel processing is a really nice alternative in these situations. It makes use of your computer's multiple processing cores to run the for loop code simultaneously across your list of units. This post presents code to: 

  1. Perform an analysis using a conventional for loop.
  2. Modify this code for parallel processing.

To illustrate these approaches, I'll be working with the New Orleans, LA Postal Service addresses data set from the past couple of posts. You can obtain the data set here, and code to quickly transform it for these analyses here.

The question we'll be looking to answer with these analyses is: which areas in and around New Orleans have exhibited the greatest growth in the past couple of years?

# If you want your output to exactly match mine, specify the date.
# Otherwise, if you want to look at the past couple of years from 
# whenever you're running this code, ask the system to supply the date.

# today <- as.Date("2015-08-15")
today <- Sys.Date()
library(lubridate)
cut.off <- today - years(2) 

 

Plot data

For simplicity, I'll be treating change as linear in these analyses. I use the code below to quickly get a visual overview of the data, and to confirm that treating change as linear is basically reasonable, which it is.

graph.data <- subset(data.long, date >= cut.off & date <= today)

# I'm using loess smoothing because this allows us to get a sense for 
# the shape of the relationship between date and active_addresses.
library(ggplot2)
library(scales)
graph <- ggplot(graph.data, aes(x=graph.data$date, y=graph.data$active_addresses, 
group=graph.data$zipcode, colour=graph.data$zipcode))
graph <- graph + stat_smooth(method=loess, se=FALSE) 
graph <- graph + scale_y_continuous(labels = comma)
graph <- graph + xlab("Date")
graph <- graph + ylab("Number of Active Addresses")
graph <- graph + ggtitle("Number of Active Addresses by Zip Code")
print(graph)
remove(graph, graph.data)

Now we'll estimate and output the regression parameter for each zip code using a conventional, sequential for loop.

 

Conventional for loop

# Create list of unique.IDs to loop through
ids <- unique(data.long$unique.ID)

output <- data.frame()

Sys.time()
for (i in 1:length(ids) ) {

id <- ids[i]

    # Subset data for one zip code
    temp <- subset(data.long, 
                   id == unique.ID & data.long$date >= cut.off & data.long$date <= today, 
                   select=c("date", "active_addresses"))

# Regress active.addresses on date to obtain the change in active.addresses over time
reg.model <- lm(temp$active_addresses ~ temp$date)
trend <- as.numeric(reg.model$coefficients["temp$date"])
trend.pvalue <- summary(reg.model)$coefficients[2,4]

unique.ID <- as.numeric(id)

new_zip <- cbind(unique.ID, trend, trend.pvalue)
new_zip <- as.data.frame(new_zip)

output <- rbind(output, new_zip)

}
Sys.time()

remove(i, id, new_zip, temp, reg.model, trend, trend.pvalue, unique.ID)

On my computer, with 75 zip codes, this code snippet takes approximately 1 second to run. Thus, there isn't a lot of need for parallel processing here, but there are close to 1,500 zip codes in the state of Louisiana, and 43,000 zip codes in the entire United States. If we were looking to design code that could easily accommodate a larger number of zip codes, or do something more processing-intensive with the data for each one, we might want a more scalable solution. This is how you could obtain the same output using parallel processing:

 

Parallel Processing

library(parallel)
detectCores() # Determine how many processing cores your computer has

library(foreach)
library(doParallel)

# Specify the number of cores to use for the analysis
cl = makeCluster(4) 
registerDoParallel(cl)

# Obtain and output regression parameter for each zip code
Sys.time()
output.2 <- foreach(i=ids, .inorder = TRUE, .combine = "rbind") %dopar% { 

id <- subset(ids, subset = ids == i)

    # Subset data for one zip code
    temp <- subset(data.long, 
                   id == unique.ID & data.long$date >= cut.off & data.long$date <= today, 
                   select=c("date", "active_addresses"))

# Regress active.addresses on date to obtain the change in active.addresses over time
reg.model <- lm(temp$active_addresses ~ temp$date)
trend <- as.numeric(reg.model$coefficients["temp$date"])
trend.pvalue <- summary(reg.model)$coefficients[2,4]

unique.ID <- as.numeric(id)

return(c(unique.ID, trend, trend.pvalue))

}
Sys.time()

stopCluster(cl)
remove(cl, ids, cut.off)

 

Make sense of the output

The code below processes the output for interpretation.

output.2 <- as.data.frame(output.2)
colnames(output.2) <- c("unique.ID", "trend", "trend.pvalue")
rownames(output.2) <- NULL

master <- subset(data.long, date=="2005-07-01", select=c("unique.ID","zipcode","parish"))

# Merge in zip code and parish information; this will help us interpret the output
output.2 <- merge(master, output.2, by="unique.ID")

# Reset non-significant parameter estimates to zero
output.2$trend[output.2$trend.pvalue > 0.05] <- 0

# Now sort parameter estimates in descending order to get a sense for which ones have the most growth
output.2 <- output.2[order(-output.2$trend),]
row.names(output.2) <- NULL

Depending upon when you're running this code, and whether you specified dates that match those I used, your list and parameter estimates might look a little different. 

This output shows us that zip code 70117, roughly corresponding to the Bywater, Holy Cross, and Lower 9th Ward neighborhoods, added new addresses most rapidly over the past two years. The trend estimate of 1.2 indicates that this zip code added a little over 1 address/day over the course of the past two years. (Of course, this doesn't tell us whether these were residential or commercial addresses.) R is clever about dates - even though our date variable is measured monthly, R knows that those months contain days, and the parameter estimate it returns to us is measured in change per day, not per month.

Zip code 70433, roughly corresponding to Covington, LA, north of New Orleans across Lake Pontchartrain, exhibited the second-highest growth in addresses during this period. This zip code added 0.80 of an address/ day, or the equivalent of 4 new addresses in a 5-day period.

Although viewing the output in a table like this, and then mentally visualizing the New Orleans neighborhood or surrounding region to which it maps, is workable, it’s not particularly convenient. The easiest way to digest this information is graphically, superimposed on the geographical area to which it applies. In the next blog post, we’ll examine how to do this.