These instructions are the basis for a hands-on class at NICAR 2012 in St. Louis, where I will be co-teaching with Chris Keller. First, Chris will explain the breakdown of how to figure out what marks are before and after the content you want to scrape. This involves understanding classes, ids, HTML and CSS markup, etc.
Then, for my part, we’ll walk through a database of childcare facilities. We want to grab the capacity, and phone numbers of the various childcare facilities. That way, we see which facilites ca nhosue the most children, then follow up to see if they are hosting the most children. (This is a totally hypothetical exercise, I’m not seeing that there is or isn’t a story here. This is just a way to investigate.)
1. Head to the Web page listing Abbotsford child day care facilities we are interested in.
2. We see that the page lists 30 facilities, then you have to click next to see more. But if we change Count=30 to Count=200 in the URL, we can see everything at once.
3. We get a minimal amount of information on this main page, but when we click on a specific facility, we can get more information.
4. First, we need to grab a list of urls that we want to scrape. To do this, we’ll use Chrome’s Scraper extension. Install it, if you haven’t already. Right-click on one of the links, and Scrape Similar. This will create a list that we can export to Google Docs and then grab.
5. What we have here are relative URLS — they are missing the “http://www.healthspace.ca” that should precede each URL. So, bring the website list (column B of the output) into Excel. Turn that into Column A. Type the root URL (http://www.healthspace.ca) into Column B, and fill down. Concatenate the two items with this formula in column C — “=CONCATENATE(B2, A2)” — and fill down. THOSE are the URLs we copy into a text file — complete URLs that can be scraped.
6. Turn this into a text file which lists out the URLs. Outwit Hub can only deal with 100 rows of exported data at once, but if we feed it less than 100 URLs at once, we can process our data in batches.
8. Load up Outwit Hub. You can either download it as a Firefox extension, or a stand-alone app for Mac or Windows.
13. Now, we need to run the scraper on all the URLs we pulled. Use File…Open inside OutwitHub and import your first text file.
18. Repeat steps 12-16 with your second data file. You’ll need to add one Excel file to another, by opening both files, and copying/pasting contents of one under the other.
19. And voila! Pretty scraped data file.