« « Universities should apply expertise to explain complex topics

PBS post: Sharing our work, does it make “Census”? » »

Making the structured usable: Transform JSON into a CSV

Posted by on Feb 1, 2011 in Blog, programming | 13 Comments

Nothing too groundbreaking here, if you’re a coder, but this was interesting to me in my coding development.

This weekend, I pounded away at the Document Cloud API, experimenting with how we might integrate it into our continually-developing workflow at PBS News.

This taught me a) the DC folks have some of the clearest documentation I’ve ever seen, b) the DC folks are really responsive to requests, as in they added created and modified dates to the API less than two hours after I requested it, c) Don’t make test documents public, esp. now that anyone can search the public docs and d)APIs are an awesome tool!

As data mungers, a lot of the time we deal with completely unstructured data (I’m in the middle of a MASSIVE project dealing with that right now), so it’s always a pleasure when there’s an API, which stands for “Application Programming Interface,” which at least has a fairly consistent pattern.

APIs are showing up all over journalism. My former colleagues at the LAT have one for LA neighborhoods, and both USA Today and the New York Times have several, covering beats from politics (NYT’s Congress API) to literature (USAT’s Bestsellers API). They come in formats that are much easier to parse programatically than a lot of other stuff we have to deal with. And while it’s important that we structure our content as data, we should also pay attention to structuring the information we engage with as data, so we get back to the “reporting” part of database reporting. It’s something I’m trying to spend a lot more time on. And as much aof this as we have,

Fortunately, API skills serve you well whether you’re talking about newsroom content or external information, so it’s a useful skill to have.

Oh, and if you want to avoid the programming part altogether, my mentor Derek Willis can help you bring it directly into Excel. (How many entries do we think I got through without citing Derek? Not many. I’ll try again starting next post. Prediction: It won’t last.)

But even so, the overarching point is that using programming (my language of choice is Python) can help you, yes you, have the flexibility to pick information out of the API and bring it into a CSV, or a database, with the structure you impose. So, hopefully, you can start to imagine the fun I had with pulling in, displaying and remixing the documents we’ll be using to tell stories at PBS News.

Wait, there’s more! You know what’s even better? I’m part of an online journalism group that’s learning Python together. It’s a really great mixture, because I’m learning a lot from people who’ve been doing this much longer and better than I, but I’m also able to help people who are new to Python — where I was a year ago. And I love giving back to the community whenever I am able.

So, when one journalist asked how to get a JSON response from the Twitter API to go into a CSV, I was happy to mock something up, based on the DocCloud code I had put together, and share away. It occurs to me this might help you make the most of structured data from Twitter, Document Cloud, or any of the JSON-ish APIs out there, so I’ll post it here as well. Let me know how to make it better, and if you have any questions.

#To put this into a CSV, csv code adapted from this recipe 
#(http://www.palewire.com/posts/2009/03/03/django-recipe-dump-your-queryset-out-as-a-csv-file/) 
#of Ben Welsh at the LAT, who helped me do my first work with APIs:
 
 
# IMPORTS
 
 
#Make Python understand how to read things on the Internet
import urllib2
 
#Make Python understand the stuff in a page on the Internet is JSON
import json
 
# Make Python understand csvs
import csv
 
# Make Python know how to take a break so we don't hammer API and exceed rate limit
from time import sleep
 
# tell computer where to put CSV
outfile_path='/path/to/my/file.csv'
 
# open it up, the w means we will write to it
writer = csv.writer(open(outfile_path, 'w'))
 
#create a list with headings for our columns
headers = ['user', 'tweet_text']
 
#write the row of headings to our CSV file
writer.writerow(headers)
 
 
# GET JSON AND PARSE IT INTO DICTIONARY
 
# We need a loop because we have to do this for every JSON file we grab
 
#set a counter telling us how many times we've gone through the loop, this is the first time, so we'll set it at 1
i=1
 
#loop through pages of JSON returned, 100 is an arbitrary number
while i<100:
    #print out what number loop we are on, which will make it easier to track down problems when they appear
    print i
    #create the URL of the JSON file we want. We search for 'egypt', want English tweets, 
#and set the number of tweets per JSON file to the max of 100, so we have to do as little looping as possible
    url = urllib2.Request('http://search.twitter.com/search.json?q=egypt&lang=en&rpp=100&page=' + str(i))
    #use the JSON library to turn this file into a Pythonic data structure
    parsed_json = json.load(urllib2.urlopen(url))
    #now you have a giant dictionary.
#Type in parsed_json here to get a better look at this. 
#You'll see the bulk of the content is contained inside the value that goes with the key, or label "results". 
#Refer to results as an index. Just like list[1] refers to the second item in a list, 
#dict['results'] refers to values associated with the key 'results'.
    print parsed_json
 
 
 
    #run through each item in results, and jump to an item in that dictionary, ex: the text of the tweet
    for tweet in parsed_json['results']:
     #initialize the row
     row = []
     #add every 'cell' to the row list, identifying the item just like an index in a list
     row.append(str(tweet['from_user'].encode('utf-8')))
     row.append(str(tweet['created_at'].encode('utf-8')))
     row.append(str(tweet['text'].encode('utf-8')))
     #once you have all the cells in there, write the row to your csv
     writer.writerow(row)
    #increment our loop counter, now we're on the next time through the loop
    i = i +1
    #tell Python to rest for 5 secs, so we don't exceed our rate limit
    sleep(5)
« « Universities should apply expertise to explain complex topics

PBS post: Sharing our work, does it make “Census”? » »
  • Ross Huggett

    Good job Michelle. Exactly what I needed.

    [Reply]

  • Melfromalice

    thank you!

    [Reply]

  • ultraviolet

    Excelent!! Very Useful

    [Reply]

  • Era

    hi I think your code is exactly what I need, I tried it didn\’t give me an error but it also didn\’t write to the CSV file. I was able to load the json file but after that nothing. I mean even before the

    writer.writerow(headers) didn\’t write anything on the CSV file. HELP please!

    [Reply]

    Michelle Minkoff Reply:

    This sounds like there\’s an issue with the path to your CSV file. Sometimes, this doesn\’t throw an error, but just tries to write to a file that doesn\’t exist. Double-check that your path is correct. If you still have issues, feel free to follow up by sending me your script and CSV file via email at meminkoff AT gmail DOT com — and I\’ll see if I can diagnose what\’s going on.

    [Reply]

    Era Reply:

    Hi I emailed you my code a couple of days ago, I don\’t know if you got it but I think I figured out how to work it. I was just wondering if you had an idea on a way to output the from_user in a sorted way, in order of how many times a certain user tweeted Egypt?

    [Reply]

  • memakiano

    Excellent! Big thanks!!! I’ve been having problems with this for two days. I finally see the light! :-) There is a slight problem with some of the tweets, since they include a /n in their shortened URLs and Python think it’s a new line… any ideas on how to overcome this problem?

    [Reply]

    memakiano Reply:

    Actually, sorry, that’s not the reason for the line jumps. There must be a different reason I can’t figure out. Oh well! It was very helpful.

    [Reply]

    dirkmjk Reply:

    Could it be your problem is caused by newlines in user bios or tweet texts? That’s a problem I’ve run into; this is how I dealt with it:
    https://github.com/DIRKMJK/Twitter_FNV/blob/master/prepareCsv.py

    [Reply]

  • Eduardo

    Hi Michelle!
    I trying to use your code to parse json on facebook.
    I’ve found that it works until I get a post without one of the elements.
    Is likely we have a case of tweet without the “from_user” tag. When it occurs, I getting “KeyError” and the code stops.

    Do you have any idea to skip this tags fullfilling with NULL??

    Thank you a lot!!!

    [Reply]

    Michelle Minkoff Reply:

    Add a line to the code telling it to to skip a specific tag or key (for example, “from_user”).

    Like so:
    if (results[i]["from_user"]) {from_user = results[i]["from_user"]}

    or some such. That says to only try to fill in that variable if the variable exists in the data. Apply a similar strategy to any keys causing you errors.

    Does that help? If not, feel free to email meminkoff AT gmail DOT com.

    [Reply]

  • amy

    Thanks for this great column. If you already have a JSON file from Twitter, what do I need to modify in your code?

    [Reply]

    Michelle Minkoff Reply:

    Just replace this line ” url = urllib2.Request(‘http://search.twitter.com/search.json?q=egypt&lang=en&rpp=100&page=’ + str(i))”
    with the link the address or URL you went to, to grab that page from Twitter. If it’s just on your computer, you’ll need to put in on the Web somewhere, and then use that address instead. Everything else should be similar.

    [Reply]