Crime maps interest just about everyone. Government officials are interested in the need for and success of intervention programs, law enforcement officials are interested in policing needs, and private citizens are concerned about their safety and the safety of loved ones. This post presents code to create an interactive Shiny application that will allow the user to specify an address, the type of crime, and the time of day - or not, and instead just zoom around as their curiosity dictates - and see mapped crime incidents with dynamically adjusting annual crime stats in that specific area.
The data for this post come from the Baton Rouge, Louisiana Crime Incidents dataset. Read More
The Affordable Care Act encompasses a host of programs and provisions, which are the subject of much discussion and debate right now. This post offers a nonpartisan, genuinely curious exploration of one of the Affordable Care Act’s less frequently debated programs, the Shared Savings Program (SSP). The program’s objective is to reduce Medicare costs and increase the quality of care provided to Medicare patients. In this short sequence of posts, I explore whether the program looks to be meeting these objectives. Given that this topic may appeal to both non-technical and technical audiences, I’ve split the posts into a higher-level description of the findings (part I) and a more technical post with code (part II).
Data on ACOs and the Shared Savings Program is publicly available from data.CMS.gov. My last post includes detailed code to download and prepare the data for analysis. Read More
This post grapples with the challenge of combining datasets that should in theory match, but that in practice have inconsistent ways of designating the same entity, slightly different variable names, and some different variables and/or different scaling for the variables they do have in common. If you regularly work with data, you're probably very familiar with these issues. This post includes some tricks I use to identify and resolve the differences in this kind of situation. Read More
This post is the second in a two-part series in which I’m looking to answer the question: which technical skills are most in-demand among data scientists? This question pops up regularly on just about any website catering to data scientists, and it’s an understandable one. The field itself is very new – by most standards, a 21st century profession – and its parameters are still unclear. Additionally, its tools and techniques are evolving as rapidly as computing technology evolves, creating a discipline for which the required skills are in flux.
In the previous post, I collected data to answer this question by scraping job postings off the job board indeed.com. In this post, I use text analysis to analyze these job descriptions and highlight the skills that are most frequently mentioned. Read More
This post addresses using R for web scaping, using a RESTful Web Service API in conjunction with R’s RCurl and XML packages. This post is the first in a two-part series in which I’m looking to answer the question: which technical skills are most in-demand among data scientists? Read More
This post is an update to my Big Data Wrangling: Reshaping from Long to Wide post from May 2015. The original Big Data Wrangling post is among the most frequently viewed on this blog, so I suspect that lots of people are looking for efficient ways to reshape datasets from long to wide. This post presents a faster way to do that than the original post proposed, and uses a benchmarking package that helps quantify the time associated with different approaches. Read More
Customer segmentation is a deceptively simple-sounding concept. Broadly speaking, the goal is to divide customers into groups that share certain characteristics. There are an almost-infinite number of characteristics upon which you could divide customers, however, and the optimal characteristics and analytic approach vary depending upon the business objective. This means that there is no single, correct way to perform customer segmentation.
In this post, I work through a practical example that, in my experience, closely mirrors the challenges of performing this kind of analysis with real data. Read More
This post addresses a common data science task – comparing multiple models – and explores how you might do this when you’re running the models in R's caret package. We’ll work with the same data set and objective as the last post, which involved predicting which customers would respond to a marketing campaign, and build on that post by making one of the models we’re comparing a neural network. The other models I’m adding here are a random forest and a logistic regression. Read More
Neural networks are a great analytic tool for generating predictions from existing data. They can detect complex, non-linear relationships in data (including interactions among predictors), can handle large datasets with many predictors, and often produce more accurate predictions than regression/logistic regression. As with random forests, they can be used for regression or classification.
For this post, I take on a classic classification challenge and seek to answer the question: which customers are most likely to respond to a marketing campaign? Read More
On April 9th, voters in New Orleans rejected a proposal to increase funding for police and fire departments. Crime rates are a recurring source of concern in New Orleans, and this vote prompts the question: what is the relationship between police presence and crime in New Orleans?
The data for this post (both on crime rates and on the number of police officers) come from the Federal Bureau of Investigation's Uniform Crime Reporting (UCR) program. Read More
This post builds on the last by introducing the caret package (short for Classification And REgression Training). Caret is a really nice wrapper function for a variety of machine learning models, including random forests. It makes model tuning smooth and parallel processing a breeze. Read More
Random Forests are among the most powerful predictive analytic tools. They leverage the considerable strengths of decision trees, including handling non-linear relationships, being robust to noisy data and outliers, and determining predictor importance for you. Unlike single decision trees, however, they don’t need to be pruned, are less prone to overfitting, and produce aggregated results that tend to be more accurate.
This post presents code to prepare data for a random forest, run the analysis, and examine the output.
The specific question I answer with these analyses is: what is the predicted percentage of loan principal that will have been re-paid by the time the loan reaches maturity? I’m using publicly-available, 2007-2011 data from the Lending Club for these analyses. You can obtain the data here. Read More
Shiny by Rstudio accomplishes an extraordinary and disruptive feat: it puts dissemination of interactive, analytic results in the hands of the analysts. In professional settings, there has historically been Business Intelligence (BI) software, with an interactive, user-friendly interface, separating those performing analyses in a program like R from business users. Shiny makes it possible to design that interactive, user-friendly interface in R itself, obviating the need to use an additional tool to make interactive, analytic results accessible to business users.
So, how would one take a Shiny app, such as the one we created in the previous post, and make it accessible to business users who probably don’t have R installed on their machines and may not be particularly technically savvy?
Guest post by data warehouse, Business Intelligence, and software architecture expert Michael Helms. Read More
Shiny by RStudio is a really lovely, interactive way to present analyses to users. Conveniently, there's a free and open-source community-version of the package. This post introduces code to create a complete, interactive Shiny app with employment data and forecasts for New Orleans, LA. Read More
The US Department of Labor’s Bureau of Labor Statistics (BLS) website is a treasure trove of economic data. There are datasets on everything from the Consumer Price Index to how Americans spend their time. There’s so much there, in fact, that it can be a bit overwhelming to navigate.
This post provides a little guidance, based upon my experience using the site, and includes a function to pull BLS data directly into R. Read More
Rendering graphics typically takes R some time, so if you’re going to be producing a large number of similar graphics, it makes sense to leverage R's parallel processing capabilities. However, if you’re looking to collect and return the graphics together in a sorted object – as we were in the previous post on animated choropleths – there’s a catch. R has to keep the whole object in random access memory (RAM) during parallel processing. As the number of graphics files increases, you risk exceeding the available RAM, which will cause parallel processing to slow dramatically (or crash). In contrast, a good, old-fashioned sequential for loop can write updates to the object to the global environment after each iteration, clearing RAM for the next iteration. Paradoxically, then, parallel processing can take longer than sequential processing in this situation. In the case of the animated choropleths in the previous post, parallel processing took 21 minutes, whereas sequential processing took 11 minutes.
This post presents code to combine the efficiency and speed of parallel processing with the RAM-clearing benefits of sequential processing when generating graphics. Read More
This post demonstrates how to map change in a variable over time in a geographic area, allowing the user to scroll through time and selectively view dates of interest. It produces an interactive choropleth map, as the last post did, but whereas the last post was interactive in the sense that the user could zoom in on a specific geographic area, this map is interactive in the sense that the user can ‘zoom in’ on a specific point in time.
This map: In the wake of Hurricane Katrina, multiple New Orleans committees generated plans to rebuild the city; in some cases, these plans involved shifting the city’s footprint to move citizens out of more topographically vulnerable areas. The sequence of maps produced here answer the question: how quickly did various New Orleans zip codes re-populate after Hurricane Katrina, and how does the city’s current address density relate to pre-Katrina levels? Read More
This post produces an interactive map that features analysis output mapped onto the geographical region to which it applies. This specific map answers the question: which areas in and around New Orleans exhibited the most growth (or loss) in active addresses over the past two years? Read More
If you've needed to perform the same sequence of tasks or analyses over multiple units, you've probably found for loops helpful. They aren't without their challenges, however - as the number of units increases, the processing time increases. For large data sets, the processing time associated with a sequential for loop can become so cumbersome and unwieldy as to be unworkable. Parallel processing is a really nice alternative in these situations. It makes use of your computer's multiple processing cores to run the for loop code simultaneously across your list of units. This post presents code to:
- Perform an analysis using a conventional for loop.
- Modify this code for parallel processing.
To illustrate these approaches, I'll be working with the New Orleans, LA Postal Service addresses data set from the past couple of posts. You can obtain the data set here, and code to quickly transform it for these analyses here.
The question we'll be looking to answer with these analyses is: which areas in and around New Orleans have exhibited the greatest growth in the past couple of years? Read More