« « Lessons from an intermediate programmer-journalist (part 3 of 3)

Hello again — What constitutes “clean data”?

Posted by on Apr 2, 2017 in Blog | No Comments

They say better late than never, right? Do people still blog in 2017? Let’s hope so!

The last time I updated was about two years ago, and my, how things have changed. I do more work on data analysis than visualization these days, I’m learning even more new programming languages, ways to structure code, ways to work better with my colleagues, ways to tell stories. New philosophies about what it means to be a coder and a journalist.

(Oh, I also fell off the radar while I battled stage four kidney failure, started an obsessive healthy eating and weight lifting habit, developed a blood disease anyway, landed on dialysis, got worked up for the second kidney transplant of my life, got said transplant from a fellow news nerd (because, of course that would happen!), recovered from the transplant and reimmersed myself in coding and DC life. More on all that is on my Facebook page and this other blog.)

I’ve made an official goal of sharing my knowledge here again, so share I shall, although maybe in shorter bursts. That’s probably a good thing for you, as well as me. Yes?

I know I’m better because I’m working on three project simultaneously at work again (gosh, I love my job), and one thing I’ve noticed in all of them, and been wrestling with a lot is the concept of “clean data”. In teaching, I often explained this as rows and columns having meaning, and wanting to stay organized, so things belong in separate boxes. After doing more ETL work (that’s extract, transform, load — the process of taking data given to you and putting into a usable format. Sometimes that’s looked like a MySQL database, more recently an R data frame.)

However, clean data also has a lot to do with like things being like. For example, if you want to compare dates in a data set, sometimes dates are written mm/dd/yyyy, like 05/24/1986. But you could also have 24-05-86, which uses hyphens rather than slashes, a two digit year instead of four and switches the order of the month and the day. Without consistency, it’s hard to sort by the most recent date. There are tools to help with this standardization, which I am learning. While in the past, I was willing to do this, but felt it wasn’t part of journalism, I’ve now accepted and even enjoy that getting to the analysis part is just as much journalism and just as important. A tech person outside of journalism recently commented to me that journalists have less protected and more real-world examples, and thus it’s even more important, and a bigger challenge to handle lots of different use cases. Not something I had thought about before.

Secondly, I’ve started thinking of clean data as having one column for each type of information. I’ve dealt with a few cases where information in multiple columns need to combined, or broken into separate rows or records. I’ve been practically using and reading about reshaping data, particularly in R. I’m starting to really like this language for its anticipation of common data concerns. I know I could handle a problem like this in Excel or Ruby (I tried before I discovered reshape), but R makes it a lot easier. And while I respect these data munging tasks, I still can’t help but appreciate getting through them more efficiently to get to the good stuff. Reshape is covered nicely in the third week of Coursera’s R Programming series, which I highly recommend.

If you’re wondering, my favorite R book is “R for Data Science” from Hadley Wickham, of course. You can access it for free here. I’m working my way through it with some of my colleagues at work, but happy to talk about it with anyone at any time.

It’s getting late here, so that’ll be all for now, but I hope to get back in the habit of sharing things I learn here, and my Twitter account is finally active again. I’m also seeking to get back into speaking at and attending conferences and workshops, so hopefully I’ll see you around here or in real life. Click “Contact Me” in the upper right, to, well, you know — I’ve had a great time meeting, chatting and learning with all of you, and dearly hope it continues. More soon!

« « Lessons from an intermediate programmer-journalist (part 3 of 3)