Nothing too groundbreaking here, if you’re a coder, but this was interesting to me in my coding development.
This weekend, I pounded away at the Document Cloud API, experimenting with how we might integrate it into our continually-developing workflow at PBS News.
This taught me a) the DC folks have some of the clearest documentation I’ve ever seen, b) the DC folks are really responsive to requests, as in they added created and modified dates to the API less than two hours after I requested it, c) Don’t make test documents public, esp. now that anyone can search the public docs and d)APIs are an awesome tool!
As data mungers, a lot of the time we deal with completely unstructured data (I’m in the middle of a MASSIVE project dealing with that right now), so it’s always a pleasure when there’s an API, which stands for “Application Programming Interface,” which at least has a fairly consistent pattern.
APIs are showing up all over journalism. My former colleagues at the LAT have one for LA neighborhoods, and both USA Today and the New York Times have several, covering beats from politics (NYT’s Congress API) to literature (USAT’s Bestsellers API). They come in formats that are much easier to parse programatically than a lot of other stuff we have to deal with. And while it’s important that we structure our content as data, we should also pay attention to structuring the information we engage with as data, so we get back to the “reporting” part of database reporting. It’s something I’m trying to spend a lot more time on. And as much aof this as we have,
Fortunately, API skills serve you well whether you’re talking about newsroom content or external information, so it’s a useful skill to have.
Oh, and if you want to avoid the programming part altogether, my mentor Derek Willis can help you bring it directly into Excel. (How many entries do we think I got through without citing Derek? Not many. I’ll try again starting next post. Prediction: It won’t last.)
But even so, the overarching point is that using programming (my language of choice is Python) can help you, yes you, have the flexibility to pick information out of the API and bring it into a CSV, or a database, with the structure you impose. So, hopefully, you can start to imagine the fun I had with pulling in, displaying and remixing the documents we’ll be using to tell stories at PBS News.
Wait, there’s more! You know what’s even better? I’m part of an online journalism group that’s learning Python together. It’s a really great mixture, because I’m learning a lot from people who’ve been doing this much longer and better than I, but I’m also able to help people who are new to Python — where I was a year ago. And I love giving back to the community whenever I am able.
So, when one journalist asked how to get a JSON response from the Twitter API to go into a CSV, I was happy to mock something up, based on the DocCloud code I had put together, and share away. It occurs to me this might help you make the most of structured data from Twitter, Document Cloud, or any of the JSON-ish APIs out there, so I’ll post it here as well. Let me know how to make it better, and if you have any questions.
#To put this into a CSV, csv code adapted from this recipe #(http://www.palewire.com/posts/2009/03/03/django-recipe-dump-your-queryset-out-as-a-csv-file/) #of Ben Welsh at the LAT, who helped me do my first work with APIs: # IMPORTS #Make Python understand how to read things on the Internet import urllib2 #Make Python understand the stuff in a page on the Internet is JSON import json # Make Python understand csvs import csv # Make Python know how to take a break so we don't hammer API and exceed rate limit from time import sleep # tell computer where to put CSV outfile_path='/path/to/my/file.csv' # open it up, the w means we will write to it writer = csv.writer(open(outfile_path, 'w')) #create a list with headings for our columns headers = ['user', 'tweet_text'] #write the row of headings to our CSV file writer.writerow(headers) # GET JSON AND PARSE IT INTO DICTIONARY # We need a loop because we have to do this for every JSON file we grab #set a counter telling us how many times we've gone through the loop, this is the first time, so we'll set it at 1 i=1 #loop through pages of JSON returned, 100 is an arbitrary number while i<100: #print out what number loop we are on, which will make it easier to track down problems when they appear print i #create the URL of the JSON file we want. We search for 'egypt', want English tweets, #and set the number of tweets per JSON file to the max of 100, so we have to do as little looping as possible url = urllib2.Request('http://search.twitter.com/search.json?q=egypt&lang=en&rpp=100&page=' + str(i)) #use the JSON library to turn this file into a Pythonic data structure parsed_json = json.load(urllib2.urlopen(url)) #now you have a giant dictionary. #Type in parsed_json here to get a better look at this. #You'll see the bulk of the content is contained inside the value that goes with the key, or label "results". #Refer to results as an index. Just like list refers to the second item in a list, #dict['results'] refers to values associated with the key 'results'. print parsed_json #run through each item in results, and jump to an item in that dictionary, ex: the text of the tweet for tweet in parsed_json['results']: #initialize the row row =  #add every 'cell' to the row list, identifying the item just like an index in a list row.append(str(tweet['from_user'].encode('utf-8'))) row.append(str(tweet['created_at'].encode('utf-8'))) row.append(str(tweet['text'].encode('utf-8'))) #once you have all the cells in there, write the row to your csv writer.writerow(row) #increment our loop counter, now we're on the next time through the loop i = i +1 #tell Python to rest for 5 secs, so we don't exceed our rate limit sleep(5)