Computing Adventures and Phylogenetics: Regular Expressions

My first experience with regular expressions came from Python for Dummies. It looked particularly relevant to the specific task I was working on, scraping specific bits of information from fishbase. When my advisor, Peter Wainwright, first approached me about this, I didn't know where to begin, so I went to two people with experience in these sorts of tasks: Bob Thomson and Carl Boettiger.

Bob's suggestions were to download each individual fish's HTML file using a short bash script, then use something like Python or Perl to extract relevant bits. With over 30,000 species, just downloading the HTML files took quite a long time. But with no background (at the time) in Python or Perl, I turned to Carl, who suggested using R. He quickly wrote a package, rfishbase that allows you to access information from the XML files on fishbase through R. Although the XML files don't have all of the information available on the HTML files, they still have quite a lot.

My reason for this post, though, is because of a task my lab mate, Patrick, wished to accomplish using the data he accessed using rfishbase. Looking at a character vector containing information of interest, he wanted to get all of the reference numbers within that vector. An example of an element might be something like this:

"Occurs mainly over rocky and muddy bottoms. Uncommon around coral reefs. Usually rests on the bottom (Ref. 9710). Juveniles may be found in shallow water, but adults are usually taken from depths of 70-330 m (Ref. 13442). Reptant and natant decapods were the main food items throughout the year (Ref 59311). Feeds on a wide variety of fishes and invertebrates."

Given this, he would want the numbers 9710, 13442, and 59311. Even in this one example, you can see that they are not always consistent: the first two have a period while the third doesn't. And there are even things like this:

"Common species. Free-living. Assumed to feed on small invertebrates and fish (Ref. 4741, 34024). Feed on small bottom animals (Ref. 35388)."

Notice the many spaces before the first reference and having two numbers. Or this:

"Occurs in various inshore habitats (Ref. 9800). Feeds on benthic invertebrates and fish (Ref. 11889). Also Ref. 43081."

This one doesn't even have parentheses around the last one. So the first thing I did was to find every instance of "Ref" followed by an optional period, any number of spaces, and a run of any number of numbers, commas, and spaces.

ref = regmatches(matches, gregexpr("Ref\\.? *[0-9 ,]*", matches))

where matches is the character vector with all of the information we're looking at. This is modified from here. Next, I removed all characters other than digits or commas and used strsplit to separate individual reference numbers.

refs = sapply(ref, function(x) unlist(strsplit(gsub("[^0-9,]", "", x), ",")))

You end up with a list the same length as the original character vector, and every element is a character vector of all of the reference numbers. From here, you can go in and find all of the unique values to find all of the references you need.

Computing Adventures and Phylogenetics

Wednesday, May 2, 2012

Regular Expressions

No comments:

Post a Comment