These instructions are the basis for a hands-on class at NICAR 2012 in St. Louis, where I will be co-teaching with Chris Keller. First, Chris will explain the breakdown of how to figure out what marks are before and after the content you want to scrape.  This involves understanding classes, ids, HTML and CSS markup, etc.

Then, for my part, we’ll walk through a database of childcare facilities. We want to grab the capacity, and phone numbers of the various childcare facilities. That way, we see which facilites ca nhosue the most children, then follow up to see if they are hosting the most children. (This is a totally hypothetical exercise, I’m not seeing that there is or isn’t a story here. This is just a way to investigate.)

1. Head to the Web page listing Abbotsford child day care facilities we are interested in.

http://www.healthspace.ca/Clients/FHA/FHA_Website.nsf/CCFL-Child-List-All?OpenView&Count=30&Start=1&RestrictToCategory=60EB4B18BCE11E706D8DD005ADF96EBF

2. We see that the page lists 30 facilities, then you have to click next to see more.  But if we change Count=30 to Count=200 in the URL, we can see everything at once.

http://www.healthspace.ca/Clients/FHA/FHA_Website.nsf/CCFL-Child-List-All?OpenView&Count=200&Start=1&RestrictToCategory=60EB4B18BCE11E706D8DD005ADF96EBF

3. We get a minimal amount of information on this main page, but when we click on a specific facility, we can get more information.

http://www.healthspace.ca/Clients/FHA/FHA_Website.nsf/CCFL-FacilityHistory?OpenView&RestrictToCategory=9F39A12B9BCB6129882578E00057850D

4. First, we need to grab a list of urls that we want to scrape. To do this, we’ll use Chrome’s Scraper extension. Install it, if you haven’t already. Right-click on one of the links, and Scrape Similar.  This will create a list that we can export to Google Docs and then grab.

;

5. What we have here are relative URLS — they are missing the “http://www.healthspace.ca” that should precede each URL. So, bring the website list (column B of the output) into Excel. Turn that into Column A. Type the root URL (http://www.healthspace.ca) into Column B, and fill down. Concatenate the two items with this formula in column C — “=CONCATENATE(B2, A2)” — and fill down. THOSE are the URLs we copy into a text file — complete URLs that can be scraped.

6. Turn this into a text file which lists out the URLs. Outwit Hub can only deal with 100 rows of exported data at once, but if we feed it less than 100 URLs at once, we can process our data in batches.

7. Cut the text file into multiple text files, where no more than 100 URLs are listed per page.

8. Load up Outwit Hub. You can either download it as a Firefox extension, or a stand-alone app for Mac or Windows.

9. Load up the first URL in the Outwit Hub viewer.

10. Figure out the before and after markers for what we want to export. We can copy and paste this into needed fields.

11. Press Execute to test. You should see the scraper work on this page.

12. Save the scraper.

13. Now, we need to run the scraper on all the URLs we pulled. Use File…Open inside OutwitHub and import your first text file.

14. Click on the links tab to grab all links on that page, which should be every line in the document.

15. Maker sure the Scraped tab is set to Catch.  If it’s set to Empty, every time a page loads that list will be cleared, so you will overwrite each page’s scrape.

16. Select all those links, right click, and Fast Scrape, then select whatever you named your scraper.

17. Watch your data fill in. When it’s done, export that file to a CSV/Excel.

18. Repeat steps 12-16 with your second data file. You’ll need to add one Excel file to another, by opening both files, and copying/pasting contents of one under the other.

19. And voila! Pretty scraped data file.