Web Scraping in R

This post addresses using R for web scaping, using a RESTful Web Service API in conjunction with R’s RCurl and XML packages.

This post is the first in a two-part series in which I’m looking to answer the question: which technical skills are most in-demand among data scientists? In this post, I collect data to answer this question by scraping job postings from the job board indeed.com. In the next post, I use text analysis to analyze those job postings and highlight the skills that are most frequently mentioned.

If you’re looking to replicate my analyses here, the first thing to do is set up a free publisher account at indeed.com, which will provide you with the publisher ID that you’ll need to use their API and submit your queries to indeed.com from within R. Once you’ve done that, navigate to their “Job Search API” tab to get a sense for the parameters you can specify in the query. These include things like the job query, the job location, the number of days back to search, etc. For purposes of these analyses, I am querying the term “data scientist,” specifying fulltime positions, and looking for positions posted within the past 31 days. I am not specifying the location of the position.

Collecting the actual job descriptions is a two-step process. The first step involves getting the indexed search results, and the second step involves using the index to obtain the actual job descriptions. (This mirrors the process for searching basically anything online – first you submit a search term and get a list of possible results, and then you have to click on an indexed result to view the page with the actual information on it.)

 

1. Obtain index of search results

The getURL() function in R's RCurl package allows us to submit our http request, with the API-specific information, and get back results.

The indeed.com API limits the size of the query results to 25 job postings. In order to assemble a reasonable sample size for our analysis, I run the query 4 times, stepping through the results from 0-25, 26-50, 51-75, and 76-100 to pull down 100 job postings.

You can request that the results be returned as XML or JSON. The default is XML and I use that here.

# Use the Application Programming Inferface (API) for indeed.com
# For "publisher=" enter your unique publisher ID number in the query below
library(RCurl)
search.1 <- getURL("http://api.indeed.com/ads/apisearch?publisher=########&format=xml&q=data+scientist&l=&sort=relevance&radius=25&st=%20employer&jt=fulltime&start=0&limit=25&fromage=31&filter=1&latlong=0&co=us&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2",
 .opts=curlOptions(followlocation = TRUE))

search.2 <- getURL("http://api.indeed.com/ads/apisearch?publisher=########&format=xml&q=data+scientist&l=&sort=relevance&radius=25&st=%20employer&jt=fulltime&start=26&limit=25&fromage=31&filter=1&latlong=0&co=us&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2",
 .opts=curlOptions(followlocation = TRUE))

search.3 <- getURL("http://api.indeed.com/ads/apisearch?publisher=########&format=xml&q=data+scientist&l=&sort=relevance&radius=25&st=%20employer&jt=fulltime&start=51&limit=25&fromage=31&filter=1&latlong=0&co=us&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2",
 .opts=curlOptions(followlocation = TRUE))

search.4 <- getURL("http://api.indeed.com/ads/apisearch?publisher=########&format=xml&q=data+scientist&l=&sort=relevance&radius=25&st=%20employer&jt=fulltime&start=76&limit=25&fromage=31&filter=1&latlong=0&co=us&chnl=&userip=1.2.3.4&useragent=Mozilla/%2F4.0%28Firefox%29&v=2",
 .opts=curlOptions(followlocation = TRUE))

Once we have the results, the xmlParse() function in the XML package deciphers their structure, so we can easily navigate to the relevant part of each search result. This gives us a clean tree structure where each job posting is a <result> that includes a <jobtitle>, <company>, <city> and <state>, <url>, etc.

library(XML)
xml.1 <- xmlParse(search.1)
xml.2 <- xmlParse(search.2)
xml.3 <- xmlParse(search.3)
xml.4 <- xmlParse(search.4)
print(xml.4)

We can now use the xpathApply() function to specify the path to the urls and cleanly pluck them from the indexed results.

################################################################
# Obtain url for each discrete job posting from search results #
################################################################

# use xpathApply function to extract each url
# To return a vector instead of a list, unlist the output
urls.1 <- unlist(xpathApply(xml.1, path="//response/results/result/url", fun=xmlValue))
urls.2 <- unlist(xpathApply(xml.2, path="//response/results/result/url", fun=xmlValue))
urls.3 <- unlist(xpathApply(xml.3, path="//response/results/result/url", fun=xmlValue))
urls.4 <- unlist(xpathApply(xml.4, path="//response/results/result/url", fun=xmlValue))

urls <- c(urls.1, urls.2, urls.3, urls.4)
remove(urls.1, urls.2, urls.3, urls.4)

 

2. Use urls to obtain actual job descriptions

Once we have the urls, we can use the getURL command again to pull down each detailed job description. I'm NOT using the indeed.com API for this second step, because although the API does include an option for pulling job postings by <jobkey>, the API only pulls down <snippet> text for the job posting, rather than the full job description. We'll want the detailed job description for our analysis, so I'm navigating to the url for each separate job posting and scraping it myself.

getURL will pull down a bunch of additional text that includes html commands and formatting, so we’ll parse the html, just like we parsed the XML above. As before, the parsed version makes it a little easier to discern the structure of the data, and we can now see that in each listing, the text for the actual job description occurs after <span id="job_summary" class="summary">. Using the xpathApply() function again, we collect just this part of the html file for our analysis.

In the code below, I use a for loop to iterate through the urls for the 100 job postings, pull out the text for each job description, and save it in a vector I’ve created for this purpose, called “jobs”.

jobs <- vector()

for (i in 1:length(urls)) { # for each job

print(i)

url <- urls[i]

raw.html.job <- getURL(url, .opts=curlOptions(followlocation = TRUE))

parsed.html.job <- htmlTreeParse(raw.html.job, useInternal=TRUE)

job.desc <- unlist(xpathApply(parsed.html.job, path="//span[@id='job_summary']", fun=xmlValue))

jobs[i] <- job.desc
remove(url, raw.html.job, parsed.html.job, job.desc)

}

remove(i, urls)

We now have the text for 100 different job postings in a digestible format for text analysis. The next post addresses preprocessing the data, analyzing it, and visualizing the results.

Complete code for this post can be found on GitHub at: https://github.com/kc001/Blog_code/blob/master/2016.10%20Web%20scraping.R