Parallel Processing for Memory-Intensive Maps and Graphics

Rendering graphics typically takes R some time, so if you’re going to be producing a large number of similar graphics, it makes sense to leverage R's parallel processing capabilities. However, if you’re looking to collect and return the graphics together in a sorted object – as we were in the previous post on animated choropleths – there’s a catch. R has to keep the whole object in random access memory (RAM) during parallel processing. As the number of graphics files increases, you risk exceeding the available RAM, which will cause parallel processing to slow dramatically (or crash). In contrast, a good, old-fashioned sequential for loop can write updates to the object to the global environment after each iteration, clearing RAM for the next iteration. Paradoxically, then, parallel processing can take longer than sequential processing in this situation. In the case of the animated choropleths in the previous post, parallel processing took 21 minutes, whereas sequential processing took 11 minutes.

This post presents code to combine the efficiency and speed of parallel processing with the RAM-clearing benefits of sequential processing when generating graphics.

In the code below, I divide the graphics to be generated into buckets that are small enough that RAM can manage the graphics contained in each bucket without slowing down. I use parallel processing to rapidly generate the graphics files within each bucket, but then return the resulting files to the for loop, which saves them to the global environment and clears the deck to render the next bucket of graphics files. Essentially, this code nests parallel processing within a for loop.

This code uses the same data set as the previous post. You can find the raw data here, and code to transform the data for these analyses here. The first part of the code sample below, during which I set up the proportion variable to graph, establish the choropleth color cut-off points, and identify the zip codes and dates, is the same as the previous post. I include it here for easy reference.

Set up the analyses (same code as last time)

graph.data <- data.long

# Proportion variable to compare active.addresses to pre-Katrina levels
aug2005 <- subset(data.long, date=="2005-07-01", select=c("unique.ID","active_addresses"))
library(plyr) # renaming variables
aug2005 <- rename(aug2005, c("active_addresses"="aug2005"))

graph.data <- merge(graph.data, aug2005, by="unique.ID")
remove(aug2005)

graph.data$proportion <- graph.data$active_addresses / graph.data$aug2005
graph.data$proportion <- round(graph.data$proportion, digits = 2)

# Focus graph on New Orleans
graph.data <- subset(graph.data, parish == "Orleans Parish" & zipcode != 70129)

# Convert continuous proportion variable to an ordinal variable with discrete levels
# Note that variable to be graphed must be named "value" for the zip_choropleth function
# hist(graph.data$proportion)
cuts <- c(0, 0.25, .5, 0.75, 0.99, 1.01, 1.25)
graph.data$value <- cut(graph.data$proportion, cuts, labels = c("0-25%","25-50%","50-75%","75-99%","100%","101-125%"))
remove(cuts)

# Create region variable for zip_choropleth function
graph.data$region <- as.character(graph.data$zipcode)

# Zip codes to zoom in on
zips <- unique(graph.data$zipcode)

dates <- unique(graph.data$date)

Buckets

The next task is to determine the number of “buckets” you’ll need, which also represents the number of iterations of the for loop. The size of the bucket determines the number you’ll need; you want buckets that are as large as possible without exceeding RAM's limits. I found that a bucket size of around 20 worked well for my system for this task. 

# Determine number of date "buckets"
bucket.size <- 20 # the approximate size of each bucket; user sets this
n <- length(dates)
buckets <- n/bucket.size
remove(n)
as.integer(buckets)

# make sure number of buckets is an integer, and round up
# if the number of graphics files isn’t evenly divisible by the number of buckets
buckets <- ifelse(as.integer(buckets)==buckets, buckets, as.integer(buckets + 1))

# Divide date list into bucketed sublists
bucket.list <- split(dates, factor(sort(rank(dates)%%buckets)))
bucket.list

For loop + Parallel Processing

Determine how many processing cores you have available to you, and specify the number you want used for this task. 

library(parallel)
detectCores()

library(foreach)
library(doParallel)

cl = makeCluster(4)
registerDoParallel(cl)

choropleths <- list()

start.time <- Sys.time()

# Create an external log file to keep track of processing progress
sink("log.txt", append=TRUE)


# for loop
for (x in 1:buckets) {

print(x)
bucket_to_use <- bucket.list[[x]]

# Create list of choropleths for this loop
choros <- list()

# parallel processing nested within for loop
choros <- foreach(i=bucket_to_use, .inorder = TRUE, .packages=c("choroplethrZip","ggplot2") ) %dopar% {

temp.date <- subset(bucket_to_use, subset = bucket_to_use == i)

# this outputs to the log file
j <- which(bucket_to_use==temp.date)
cat(paste(j, "Starting date:", i, "\n"), file="log.txt", append=TRUE)

df <- subset(graph.data, date == as.Date(temp.date, format='%Y-%m-%d'), select=c("year","month","region", "value"))

date.field <- paste(df$month[1], df$year[1], sep="-")
title <- paste("Proportion of Pre-Katrina Addresses Active", date.field, sep="\n")
choros[[i]] = zip_choropleth(df, title=title, zip_zoom = zips, reference_map = TRUE) +
scale_fill_manual(name = "Proportion of Pre-Katrina \nAddresses Active",
values=c("royalblue3", "lightskyblue1", "light green", "yellow","orange","red"),
breaks=c("0-25%","25-50%","50-75%","75-99%","100%","101-125%"),
drop=FALSE)
remove(temp.date, df, date.field, title)
return(choros[[i]])

}

# Now append output from this iteration to the output list in the global environment
choropleths <- c(choropleths, choros)

# assign updated output list back to the global environment
assign("choropleths", choropleths, envir = .GlobalEnv)

}

sink()
stop.time <- Sys.time()
stopCluster(cl) # stop parallel processing
remove(cl)
stop.time - start.time 
remove(start.time, stop.time)

remove(bucket_to_use, x, choros)

This approach cut the processing time for this task by more than 50%, to a little over 5 minutes on my machine.