« « Fighting for my life: The largest battle I ever won

Hosting #wjchat — Finding the story in the data » »

What’s regex again? Why should journos care?

Posted by on Sep 6, 2010 in Blog, programming | 3 Comments

My days continue to putter along at the delightful dream job that is working at the LAT Data Desk. Some work is public-facing, a lot is internal to the LAT, but it’s always a learning experience. Recently, I had the opportunity (okay, was tasked with) getting some information out of our archives into a database structure. Requesting the 1000-page text file was the easy part. Figuring out to get from “blob of text” to “neat items in data-friendly columns” is another story. Luckily, the unstructured text had some patterns. And patterns are a big part of what programming is all about.

You may have heard the term “regex” before — short for regular expressions. And compared to the way we talk and write, they’re anything but regular. But choosing between writing the “regex to rule them all” as I’ve been calling it, in a not-so-veiled reference to Tolkein’s Lord of the Rings geek trilogy, or copying and pasting tens of thousands of records by hand, I’ll take the former any day.

So, if you want to get through a large, unstructured set of data quickly, and make sense of the unsensical, regex is a great way to go. But your pattern had better be consistent. Computers don’t do well with exceptions, after all.

And if you’re a journo who doesn’t want to throw yourself into programming, here’s the one thing you should take away. If you have a document that could be a database, maybe there’s some dollar amounts, and you’d like to know the biggest one and what category it’s associated with, why not go to your programmer and ask if it’s something regex can help solve? Worst case scenario: your programmer says it’s not the right fit.

Some of my favorite explanations of regex are:

  • Google’s tutorial in text and video
  • Better explanation than I can give is here on Ben Welsh’s site, as per usual
  • Simple syntax reference
  • Python/PHP regex tester – Plug your text and pattern in, and it’ll tell you if you messed up.  My best friend as of late (which says something about my social life, but I digress).

So, a mini foray into how this works. A poorly-written regex looks like this:

‘[0-9]{1}\)[ ]{0,20}([A-Z0-9\n ]{1}[\w\d\n\(\)\”\’\.\:\,\;\-” /]*[\#]{0,1}[\w\d\n\(\)\”\’\.\;\:\,\-” /]*\.{1}?[\”]{0,1})’

Stop screaming. A less frightening one:
‘\$([0-9]{1,2}[\.]{0,1}[0-9]{0,2})’

Both of those are hand-crafted, newbie coder at work examples.

The second one isn’t any more intelligible.  That’s not better? Let’s break it down.

Some simple tips:

  • A regular expression is a string of numbers, letters and special characters. You use your language of choice to parse it (I’m partial to Python, access the parser by importing “re”. Easy enough to remember.
  • Define the range of characters you are looking for in brackets. [0-9] means looks for characters that are digits between 0 and 9. Do the same for letters A-Z or a-z, or specify specific letters: [t] means t is the only acceptable character.
  • After you define your range in brackets, follow it by saying how many times you are looking for such a character. {2} means it needs to appear 2 times to match, no more, no less. {0,2} means the characters can show up 0, 1, or 2 times. Include 0 in your frequency if you want to be more flexible, and allow this range to be optional.
  • Type . to include all characters.
  • Type * in your frequency to include the character 0 or more times, + for one or more times.
  • If you only have one character in your range or frequency, you can leave your [] or {} out.
  • If you want to define a start and end to your pattern, but not include that in what the computer returns, or spits back to you, enclose the part you care about in () and the rest will get junked
  • If you are including a special character in your range, such as searching for all ., you need to escape it, or use the \ key in front of it.
  • Escaping characters that don’t need to be escaped can’t hurt, at least in my experiments. But forget to escape a character, and you have hours of debugging ahead of you, because of a missing \. Yes, I speak from painful experience.

There’s more, but that’s a good start. So, what does this mean?:

‘\$([0-9]{1,2}[\.]{0,1}[0-9]{0,2})’

Search for strings of characters that start with a $, are followed by one or two numerical digits, followed by 0 or 1 dots, followed by 0, 1, or 2 digits. What does that spell? A price that conforms to AP style. It’ll catch whole numbers ($25) or those with cents attached ($25.50) or prices below $10 with cents attached ($1.25). Pull your budget numbers out of that ghastly text file the city council gave you.

How do I use that regex?

Okay, okay, good point. I’ll tell you how in Python. Let’s say we’re searching for all prices in a file.
Here’s some line by line commented code to help you through it, and save all the prices to a CSV spreadsheet.

#Open the file you're reading from
file = open('raw_file.txt', "r")
#Read the file into a variable, so you can access it
file_to_search = f.read()
#Save your regex pattern into a variable.  Use the r before the opening quote to tell the computer it's a raw string, otherwise it may try to convert it into unicode.  I've noticed Python has a tendency to interpret my unicode strings as raw, and my raw strings as unicode, unless I'm explicit about what I mean.  And as the <a href="http://www.python.org/dev/peps/pep-0020/">Zen of Python</a> says, "Explicit is better than implicit."
price_pattern = r'\$([0-9]{1,2}[\.]{0,1}[0-9]{0,2})'
#Create a list of all pieces of text in your file that match the pattern.  List the pattern as the first item in the parens (otherwise known as an argument), list the file you're search in as the second argument, add the third argument re.DOTALL if you want the . character, which signifies all characters to include newlines -- this tripped me up at first
price_list = re.findall(price_pattern, file_to_search, re.DOTALL)
#Open a CSV object, which you can save list items into, the a means that every time you write a new row, you append the new rows of text, and avoid overwriting what's already in the file
writer = csv.writer(open("prices.csv", "a"))

To figure out the rest of how to write this to a CSV, modify teacher Ben’s recipe — it’s how I figured it out. After all, this is about regex, not CSV writing.

Or, do whatever you want with that list. You could sort it to find the highest or average price…possibilities really are endless.

Oh, and what went wrong with that first, horrificly long example I gave you?  Didn’t know you could use . to signify all characters.  You try specifying every single character that could possibly be used in a string of text.  You’ll always end up forgetting at least one.  The lesson there: the computer is better at catching all instances of something than you are.  Don’t try to compete.  Accept its’ strength and move on.  Your strength: The computer can’t read the human language you’re parsing.  So, point one for us.

« « Fighting for my life: The largest battle I ever won

Hosting #wjchat — Finding the story in the data » »
  • http://www.anthonydebarros.com Anthony DeBarros

    I’ve been scratching my head at some of the more complex regex examples found in urls.py files, so this is very helpful!

    [Reply]

    Michelle Minkoff Reply:

    That’s how I started getting into regex, too. I got sick of creating URLs with trial and error, so I was sort of searching for a project like this.

    Some other common syntax I’ve seen in urls.py code includes \w and \d for word and digit. [A-Za-z] == [\w] and [0-9] == [\d], if anyone else in the interwebs is wondering.

    [Reply]

  • http://www.richardcornish.com Rich

    We had a saying at the Lawrence Journal-World when we were developing: “I had a problem, so I used regex.” “Now you have two problems.” ;)

    [Reply]