Wednesday, May 16, 2012

SIMMAP Trees

Thanks to Liam Revell, we can now produce simulated discrete character mappings within R.  The function make.simmap uses ace, from the package ape, to fit the model to use for the simulations.  Thus, the proportion at which a character state appears at a node among many iterations of simulations should be roughly equivalent to the likelihood of that state as estimated in ace.  I have heard this from many people, but I wanted to be able to summarize the actual simulated states.  In order to do this, I took a look at the components of a SIMMAP tree.  I'll illustrate this with a mock example:

We can take a look at what components this object has:

> names(exampleSimmap)
[1] "edge"        "edge.length" "tip.label"   "Nnode"       "maps"       
[6] "mapped.edge"

The first components are the same as any phylo object.  So maps and mapped.edge are what make a SIMMAP tree special.  Let's take a look (the middle elements removed to save space):

> exampleSimmap$maps
[[1]]
        1         0 
0.0958356 0.3197055 
[[2]]
        0         1 
0.2371619 0.1659281 
[[3]]
         1 
0.01380231 
...
[[17]]
        1 
0.1164989 
[[18]]
        1         0 
0.1161601 0.1769559 

This is precisely what we need!  Each element of exampleSimmap$maps represents a single branch, and the values represents the length of time that branch spends in each state, in this case 0 or 1.  That means we can simply take each branch's starting value (whatever is the name of the first element of that branch), and that is the value at the node where the branch starts.  Let's see if we can find this.

> exampleSimmap$edge
      [,1] [,2]
 [1,]   11   12
 [2,]   12   19
 [3,]   19    1
 [4,]   19    2
 [5,]   12   13
 [6,]   13   17
 [7,]   17    3
 [8,]   17    4
 [9,]   13   18
[10,]   18    5
[11,]   18    6
[12,]   11   14
[13,]   14   15
[14,]   15    7
[15,]   15   16
[16,]   16    8
[17,]   16    9
[18,]   14   10

The element named edge gives us the starting and ending node for each of the 18 edges in our tree.  That means we can use to figure out which node corresponds with which state.  Here I've written a function that takes in a SIMMAP tree and returns a named vector where the values are the node states and the names are the nodes.

mappedNode = function(phy) {
  # phy must be a SIMMAP tree
  nodes = phy$edge[,1] # this gives us the starting node for all edges
  map = sapply(phy$maps, function(x) attr(x, "names")[1]) # this gives us the starting value of each branch
  df = unique(data.frame(nodes = nodes, map = map)) # here we're removing the repeated values as interior nodes will have multiple branches
  mapping = df$map
  names(mapping) = df$nodes # naming the vector with node names
  mapping
}

Now we can run this function over all of the simulated mappings:

mappings = sapply(simmapTrees, function(x) mappedNode(x))
# change the 0 and 1 to numeric
nummaps = as.data.frame(sapply(1:length(simmapTrees), function(x) as.numeric(mappings[,x])))
# make sure the row names correspond to node names
rownames(nummaps) = rownames(mappings)
# get the number of simulated trees with  node state of 1
sums = sapply(rownames(nummaps), function(x) sum(nummaps[x,]))
# change that to a frequency
freq = sums/length(simmapTrees)
# plot
plot(exampleTree, label.offset = .05)
nodelabels(pie = freq, cex = .65, node = as.numeric(names(freq)))
tiplabels(pie = discTrait, cex = .65)
Looks fairly reasonable.  Now let's estimate the ancestral character using ace:

MLACE = ace(discTrait, exampleTree, type = "discrete", model = "SYM")

plot(exampleTree, label.offset = .05)
nodelabels(pie = 1 - MLACE$lik.anc, cex = .65)
tiplabels(pie = discTrait, cex = .65)

Pretty close, but not identical.  Maybe if we do more simulated mappings (I did 1000 here), they will start to look more similar?

Wednesday, May 2, 2012

Regular Expressions

My first experience with regular expressions came from Python for Dummies.  It looked particularly relevant to the specific task I was working on, scraping specific bits of information from fishbase.  When my advisor, Peter Wainwright, first approached me about this, I didn't know where to begin, so I went to two people with experience in these sorts of tasks:  Bob Thomson and Carl Boettiger.

Bob's suggestions were to download each individual fish's HTML file using a short bash script, then use something like Python or Perl to extract relevant bits.  With over 30,000 species, just downloading the HTML files took quite a long time.  But with no background (at the time) in Python or Perl, I turned to Carl, who suggested using R.  He quickly wrote a package, rfishbase that allows you to access information from the XML files on fishbase through R.  Although the XML files don't have all of the information available on the HTML files, they still have quite a lot.

My reason for this post, though, is because of a task my lab mate, Patrick, wished to accomplish using the data he accessed using rfishbase.  Looking at a character vector containing information of interest, he wanted to get all of the reference numbers within that vector.  An example of an element might be something like this:

"Occurs mainly over rocky and muddy bottoms.  Uncommon around coral reefs.  Usually rests on the bottom (Ref. 9710).  Juveniles may be found in shallow water, but adults are usually taken from depths of 70-330 m (Ref. 13442).  Reptant and natant decapods were the main food items throughout the year (Ref 59311).  Feeds on a wide variety of fishes and invertebrates."

Given this, he would want the numbers 9710, 13442, and 59311.  Even in this one example, you can see that they are not always consistent:  the first two have a period while the third doesn't.  And there are even things like this:

"Common species.  Free-living.  Assumed to feed on small invertebrates and fish (Ref.  4741, 34024).  Feed on small bottom animals (Ref. 35388)."

Notice the many spaces before the first reference and having two numbers.  Or this:

"Occurs in various inshore habitats (Ref. 9800). Feeds on benthic invertebrates and fish (Ref. 11889). Also Ref. 43081."

This one doesn't even have parentheses around the last one.  So the first thing I did was to find every instance of "Ref" followed by an optional period, any number of spaces, and a run of any number of numbers, commas, and spaces.

ref = regmatches(matches, gregexpr("Ref\\.? *[0-9 ,]*", matches))

where matches is the character vector with all of the information we're looking at.  This is modified from here.  Next, I removed all characters other than digits or commas and used strsplit to separate individual reference numbers.

refs = sapply(ref, function(x) unlist(strsplit(gsub("[^0-9,]", "", x), ",")))

You end up with a list the same length as the original character vector, and every element is a character vector of all of the reference numbers.  From here, you can go in and find all of the unique values to find all of the references you need.

Tuesday, May 1, 2012

Geiger Bug

Big thanks to Luke Harmon for supplying me with updated code that fixes a small bug in the geiger package!  For those of you who have had issues with fitDiscrete in the past, you can email Luke for a fix.

The issue that arises is when you use either the symmetrical model (model = "SYM") or the all rates different model (model = "ARD").  You will get an error message looking like this after waiting for some time for the likelihood optimization to finish:

Finding the maximum likelihood solution
[0        50      100]
[....................]
Error in getQ(exp(out$par), nRateCats, model) :
  You must supply the correct number of rate categories.

But you don't actually control what goes into getQ through the arguments you give to fitDiscrete.  So if you encounter this issue using either of the above models, be sure to email Luke to get the updated code.  A revamped version of geiger is on the way, so hopefully this issue won't be around for much longer!

The one thing Luke cautioned me about using these two models is that they can both quickly become parameter-rich.  The number of rate parameters for the symmetric model is n * (n-1) / 2, while for the all rates different model, the number of rate parameters is n * (n-1) where n is the number of discrete states.  For example, if I have five discrete states, I would have ten parameters for the symmetric model and twenty parameters for the all rates different model.  I need a lot of data to be estimating so many parameters!

I did try out his updated code, and it works perfectly fine.  It still does take a while (I'm sure the speed depends on the size of your data set, the shape of your phylogeny, your computer's specs...), but I hear that there is already a faster version of the code if you ask Luke for it.  It would definitely be interesting to test out how much of a difference there is and what they changed to make it faster.

If all you are trying to estimate is a single rate model (model = "ER", which stands for equal rates), then there is no need to use an updated version of the function.  The old version will work just fine.  The problem with the original was that the calculation of the rate categories occurred twice:  once to get the number of rate categories, and another within getQ to test whether the number of rate categories was correct.  So for example, with five discrete states nRateCats in the above will equal ten for the symmetric model.  But you will get an error to supply the correct number of rate categories because within getQ, it compares the number of parameters to nRateCats * (nRateCats-1) / 2, which would compare ten to forty-five.