« « Keep it subtle, stupid: Differentiating data values in visualizations

Personal reflection: Tufte’s messing with my head » »

Ben Fry on visualizations and Processing

Posted by on Jan 25, 2010 in Blog, class, data visualizations | No Comments

I’ve been working with Processing for a few weeks now.  Some people have asked me when I’m going to start delving into the data set for my final project.  Good question.  The answer is that I’m still trying to get some of these theory and technical tools down.  I’ve been having a lot of fun exploring the Processing language, and comparing it to my experience with Actionscript/Flash.  I haven’t gotten to anything complicated enough (yet) that I’ve seen work I’d like to do in one language that wouldn’t be feasible in the other, but I’ve heard that each has limitations the other can’t replicate.  Admittedly, I’ve been biased toward Flash in the past, having taken a course in it.

As I’ve been working with Processing, I’m beginning to comprehend its potential, and it’s opened up questions about visualizations.  And the language’s creator, Ben Fry, has created a wonderful book that’s guiding me through the language and some fundamental visualization concepts.  But perhaps more importantly, Fry is a gifted and fascinating data visualizer in his own right, having done everything from scientific visualizations to work for General Electric to illustrations and mastheads for mainstream newspapers and magazines and their web sites.  He very graciously responded to some questions of mine via email this week.

(UPDATE 1/26 1:03 p.m.: The summary is shortened, and my full email interview with Fry is now at the bottom of this post.)

Fry said there is work that he does that Flash is years away from being able to handle, such as the project that inspired this Nature cover.  And for much of his work, Flash has only recently developed to the point where it accommodates his needs.  “I think people should use whatever works for them, and I don’t particularly care if people use Processing or not. It’s a project that we give away for free, so I don’t have much to gain by having more users,” he said.

While some have the same concerns about data viz as some journalists have about citizen journalism — that the field is getting polluted by amateurs, and thus isn’t being practiced correctly — that’s not Fry’s belief.  “The more people that have access, the better,” he wrote.  “The field is all about communication, and how you reach wider audiences, so restricting creation to a selected few runs contrary to the definition of the field itself.”  Seems to me that applies to journalism as a whole as well.  In fact, I think diverse people practicing communication can only improve the field, and that applies to information gathering and disseminating across media.

But Fry comes down along with the St. Petersburg Times’ news technologist Matt Waite (and myself and a host of people who spend their days contemplating these issues) against the concept of dumping all data on a web site, the more variables and data sets the better.  This is what Waite calls a “data ghetto.”  Fry says that if you’re addressing that many variables, you’re probably not asking interesting enough questions of the data.

The work on a piece begins when Fry performs basic experiments and visualizations to help him analyze the data.  Sometimes he gets the data from scientists, sometimes he finds it himself, and it almost always needs cleaning.  Then, he makes an initial visualization, usually static, to help him get a sense of the  data.  That helps him to find interesting patterns and numbers, and then he builds a piece that focuses on the aspects that intrigue him.

And, I would argue, you don’t need to be artistically inclined to see that.

Audience is key.  If you are targeting people who aren’t used to a computer, sometimes graphs are the best policy, even when more unusual visualizations might make sense.  On the HapMap visualization, Fry wrote that he took out a few dozen features, and took care to put clear buttons across the top.  But if it’s just an analytical tool for yourself, it doesn’t matter what you do.  No matter what, Fry wrote, it’s essential to keep the audience in mind.

Not a bad lesson to keep in mind with all the journalism we practice.

Ben’s responses to my questions are included below:

How do you get started brainstorming the design of a specific piece?

I generally try to do some basic experiments to see what’s in the data–usually a static piece that just reads and shows the information. Then once I have an understanding of what’s there, I try to build a better piece that focuses on what I’ve learned about the interesting features of the data.

There are some that argue the field of data viz is being muddied as it becomes more accessible now that more people have access to data and viz tools. Do you agree that this is an issue? Is data viz better left to a few with a specialty, open to the public, or something in the middle?

No, I think only experts who are scared (or just cranky) push for that sort of thing. We heard the same arguments about people doing graphic design in the 80s because suddenly office workers were able to choose their own fonts or print documents on a laser printer.

The more people that have access, the better. The field is all about communication, and how you reach wider audiences, so restricting creation to a selected few runs contrary to the definition of the field itself.

How do you represent flaws in your data to your user? At what point are issues too miniscule and relevant to matter to the end user?

This depends on the audience and the intent of the piece. To use two examples, if it’s a newspaper article, the flaws aren’t important because the image needs to only explain a concept or specific central point, not all of its subtleties. A scientific diagram, on the other hand, is likely all about understanding the subtleties and flaws, because the central point is already well-understood by the scientist(s).

Do you think it’s possible to try to represent too many variables at once?

Yes. There is a natural tendency toward wanting to show all of them. Usually it means you’re not asking interesting questions of the data, and are instead focusing on the data itself, not why you collected the data in the first place.

Do you set out to promote a specific point/advance an agenda in your data pieces? Or do you see yourself as an impartial displayer of information, giving the information to the user objectively?

No more or less than a writer does. So to that end, some pieces are more subjective (some say artistic), others are more objective (journalistic).

For a piece such as the Nature cover (http://benfry.com/hapmapcover/), please describe the general steps you go through to create such a piece. Also, how many people would you work with on something like this, and what are your various roles?

That project began with this work: http://benfry.com/isometricblocks/

At the bottom of the page you can see three images that were part of the process. It went from a series of (large) print pieces working with that type of data, to the interactive piece seen on that page. Then, a couple years later when the HapMap project was completed, we used that same code to represent a portion of the data from the HapMap project as the cover.

I worked informally with other scientists who first gave me access to the data, but all the design and development was done on my own.

When creating visualizations with data, are you usually crunching the information yourself, or getting pre-cleaned data from others? If you take it yourself, how do you vet it for accuracy? And if someone else cleans it, how do you vet them so you know that you can trust their work?

It’s rare to get data that’s already been cleaned, so there’s usually at least some work to be done in figuring out what’s there and filtering it.

If things are inaccurate, you can generally tell because as soon as you see an image (if not before), because it just looks wrong.

How would you respond to comments that Java is getting old, and less common, and thus Processing is not as strong a choice as Flash for data visualization? Do you think the different technologies are valid for different types of visualization? Generally, what do you see as the pros and cons of each?

As a practical matter, it’s only been very recently that Flash would even be viable for any of the work that I do. Flash is a several years from being able to do the HapMap cover project that you mentioned above, for instance.

I think people should use whatever works for them, and I don’t particularly care if people use Processing or not. It’s a project that we give away for free, so I don’t have much to gain by having more users.

As an objective matter, Java has never been more common, so that particular statement is simply incorrect. But Sun has done a terrible job with marketing, so these silly memes about Java disappearing get started. That said, I’m not interested in defending Java or Sun.

I’m also not interested in paying exorbitant amounts of money to use Flash to develop my work. I think it’s funny when people make arguments for using tools that cost them a lot of money and lock them into a proprietary solution.

Put another way, if I could do everything in Flash that I could do in Processing, then I could save myself a lot of time and just use Flash for all my work instead of developing the Processing project. More likely, the work that I’d otherwise do in Flash I’ll be moving to JavaScript, so that I can still avoid being tied into a proprietary platform.

What strategies do you have for data visualizers to help the users easily navigate an interface? I see this as being especially key concerning data visualized in unusual ways (outside of typical graph form).

For marketing to flow properly into your design, especially with sensitive niches like lawyer seo, you really just focus on the audience. Who are they, and what makes sense for them? For the version of the isometricblocks piece (mentioned above) that I posted on the site, I removed a few dozen features and added a couple buttons across the top so that it was enough to get the idea across of how it was working. It’s still a complicated piece, but at least makes more sense in the context of web visitors giving it a try.

But the whole point is just audience. If you’re making it for yourself, it doesn’t matter. If you’re making it for people who don’t use the computer, then you should probably use the graph anyway. It all depends.

« « Keep it subtle, stupid: Differentiating data values in visualizations

Personal reflection: Tufte’s messing with my head » »