The following couple of posts address a common data science challenge: when sourcing data over the internet or from disparate departments within an organization, it's often necessary to substantially reformat the data before analysis. The Excel table considered in these posts, available online courtesy of The Data Center, features monthly counts of active postal addresses, by zip code, in New Orleans during the decade after Hurricane Katrina. You can obtain it here.
This table looks lovely and very interpretable to the human eye in its current form, but it is not organized the way R needs it to be for analysis. Specifically:
- It's a time series dataset, but it's organized wide, rather than long, as R will need for many time series and graphing functions.
- In addition, as it's currently set up, R will think the date rows are two additional observations, independent from all other observations and no more or less important than the rows that follow .
- This wide organization is doubly-complicated because there are actually two rows with date information (one with year and one with month).
- In its current format, parish is a section header. Although a human reader can infer from this that the zip codes that appear below a named parish occur in that parish, R can't.
- As with the date rows, R will think the parish rows are additional observations. The parish information is not linked to each zip code in a way that R can interpret and analyze.
- Lastly, there are a couple of smaller quirks to resolve:
- The table begins and ends with some extra rows that contain notes about the dataset, not data. There are also some blank rows interspersed throughout the table.
- The active address numbers, which are beautifully-formatted with commas in MS Excel, are interpreted by R as character variables, not numbers, and therefore not ready for analysis.
The next post in the sequence examines the issue of transposing the data long while preserving the date information and converting the date rows to date variables. The third post examines how we might convert a section header to a variable. The last post brings everything together in place and cleans up a few lingering loose ends.