This document is designed as a guide to exploring the power of OutwitHub and Needlebase — two web tools that can help you do scraping.

We’ll perform some exercises that showcase what these tools are capable of. This was originally written as supplementary material for a hands-on lab at NICAR 2011 in Raleigh, N.C.

Before you start: Open up Firefox, and download Outwit Hub. https://addons.mozilla.org/en-us/firefox/addon/outwit-hub/

Outwit – Automatic Detection

Grab some simple images

1. Let’s go to the Air Force’s photo site. These are photos you have the right to use, because they’re created by the goverment and in the public domain. http://www.af.mil/photos/mediagallery.asp?galleryID=9127&?id=-1&page=1&count=48

2. Open up the OutwitHub extension. Navigate to the images tab.

3. Look at the images. How could we clean this list?

(Deal w/duplicates, sort by image size to make sure we keep the larger version)

4. Download elements from the table, using Export Selection As.

5. Download actual images using Download Selected Files or Download Selected Files in… (Choose the latter to stay organized.)

Grab a table

1. We’re going to take an online table. Sometimes, copying and pasting these tables into Excel works, but sometimes it doesn’t. As long as it’s a formatted HTML table, there’s a formula we can use. http://www.ftb.ca.gov/individuals/txdlnqnt.shtml

1a. What’s the easiest way to grab this info? Click on tables, just like with pictures, and it’s downloadable.

Teach Outwit What You Want

1. In this exercise, we’ll go to the Senate directory, and we’re just looking for names and Web form addresses. We’d like to make a little tool to mkae it easy for users to contact their representatives.

2. Head here: http://www.senate.gov/general/contact_information/senators_cfm.cfm

3. Bring up OutwitHub and try to export a simple table the way we did the last two exercises. Did that work? No!  Why do we think it didn’t work?

4. We’ll have to show Outwit what we’re looking for. We want two fields (senator’s name and senator’s web form). What consistent markup comes before and after each of these fields?

5. Enter these consistent pieces of markup as before and after points for two fields in a new scraper (go to the Scrapers tab on the left to define a new one.

6. Save the scraper.

7. Click execute, and Outwit will move you over to the scraped tab, and display the data.  You then can export this data, or linked files, just like in the other examples.

Needlebase

Compiling Data That Spans Multiple Pages: HRC’s Info on Same-Sex Marriage Laws

  1. Head to Needlebase.com, click on the login tab and borrow my account — it’s okay, it’s a temp password. (When you access this page after NICAR (it’ll be at bit.ly/nicar11almostscrapinglab, these instructions will tell you to stop mooching off me and get your own account.) But today, I’ll be nice. Just today. (Username: meminkoff@gmail.com; Password: ghwi87tn)
  2. We’ll be pulling HRC summaries of same sex marriage laws into a usable data set to eliminate unnecessary clicking.  http://www.hrc.org/issues/marriage/marriage_laws.asp
  3. Scroll down to “Your Domains” and click add.
  4. Pick any name and description you like.  Please include your last name in the name, so we don’t duplicate our work.
  5. Lay out different fields as requested.  Look at a sample from the HRC pages to determine what the different fields should be.
  6. Fill in examples of your data types.
  7. Show Needle how fields are linked.
  8. Add a source.  In this case: http://www.hrc.org/issues/marriage/marriage_laws.asp
  9. Identify examples of various fields for Needle.  After you fill in a couple, tell it to guess.  If you like what you see, click Confirm All.  You can also designate “next page” links, if it needs to look through several pages, or a detail page, where it can find more information about a given record.  In our example, it’ll grab state names from this main page, and everything else from a detail page.
  10. When done, click “Done” in the upper-right hand corner.
  11. Back on the sources page, click “Collect Now” next to this domain.
  12. Click on “needlebase,” in upper left hand corner.  Now, click on a domain name to get to display of the data.  On the left side, click on what piece of information you see as the main field.  In this case, we want to see every state.
  13. Click on options in the upper-right to add columns.  Those can be data types you specified, or metadata that Needle offers.
  14. Go to the footer of the page for a nice variety of export options
  15. Make sure to double check for errors, where Needle may have incorrectly identified information that corresponds to a certain data type.  You’ll have to go in to those pages and correct Needle, clicking on field names, and then pieces of data on the site.  This sort of cleanup is common with programmatic scraping too.  None of this is magic, but it can get you further, faster than copying/pasting by hand.  And gaining a little time back in our busy lives?  I think that’s something we could all use.
  • http://twitter.com/TheRastaLion MannehDF

    Very useful. Thanks for sharing.

    [Reply]