Computing Adventures and Phylogenetics: March 2012

Friday, March 16, 2012

Color-coding

I've been helping some fellow students here at the workshop with R. One of the things I did was to create a vector of colors to label the tiplabels. Basically, the structure of the data was a data frame with the ID that roughly corresponds with the tip labels in the first column and the actual species name in the second column. Here being a toy example:

> species
ID species
1 A123 speca
2 SA123 speca
3 c123 specb

However, the tip labels of the tree don't exactly correspond to the IDs:

> tree$tip.label
[1] "SA123" "a123" "c123"

Time for some regular expressions!

> species[,1] = gsub('^A(.*)', 'a\\1', species[,1])
> species[,1]
[1] "a123" "SA123" "c123"

I'll break down the gsub function call a little bit. The first argument to gsub is the pattern. The ^ means the beginning, then A. () names a group, and since it is the first set of (), this group is 1. Within 1, there is a .*. The . indicates any character, and the * indicates any number of those characters. So essentially, this pattern is looking for a capital A at the beginning of a string followed by anything or nothing.

The second argument is the replacement. The sub in gsub stands for substitution. So it will take the entire pattern and replace it with whatever the replacement is. If nothing matches, then nothing happens. The \\1 refers to the named group, 1. The \\ escapes, so that gsub will replace with whatever it finds in group 1, not with the number 1. So essentially, it will match any string that starts with a capital A followed by a group 1 that can be anything, and replace it with a lowercase a followed by the characters in group 1. The third argument is simply the vector you want to perform this on.

Now that I've gotten the ID to match the tip labels, I'm going to reorder the data frame to match the order of the tip labels.

> rownames(species) = species$ID

> species = species[tree$tip.label,]
> species
ID species
SA123 SA123 speca
a123 a123 speca
c123 c123 specb

Now, all I have to do is change species$species into a factor and convert it to a numeric!

Rerooting Trees

One of the things I did for today's presentation was to obtain a consensus tree from MrBayes. Since MrBayes estimates unrooted trees, I can choose the proper outgroup with which to root my tree. You can do this in programs like FigTree, but this is an R blog, after all. There is a package called geiger, which contains the necessary functions to do this.

> tree = read.nexus("Trees.nex")

I typically like to include multiple outgroups, and the easiest way to specify the outgroup might be by specifying the node number instead of the tip labels. If you plot the tree and the node labels like so

> plot(tree)
> nodelabels()

you can find the number of the root that specifies the clade of the outgroup. To root, you can use a function in ape, root.

> rerooted = root(chaettree, node = 99)

Nice and simple.

Playing with OUwie

After all, I am an evolutionary biologist. And I just finished my joint presentation with Katie Staab from George Washington University at the Bodega Phylogenetics Workshop. We did quite some fun analyses, one of which was playing with a very new package, OUwie, brought to us by Jeremy M. Beaulieu and Brian O'Meara. You can use OUwie to fit models of character evolution under multiple regimes.

Possible models:

model=BM1 # a Brownian motion model applied over the entire tree
model=BMS # a different Brownian rate parameter (sigma2) for each regime
model=OU1 # a single peak Ornstein-Uhlenbeck model across the entire tree
model=OUM # a multi-peak Ornstein-Uhlenbeck model with different optima (theta) for each regime
model=OUMV # a multi-peak Ornstein-Uhlenbeck model with different optima (theta) and different Brownian rate parameter (sigma2) for each regime
model=OUMA # a multi-peak Ornstein-Uhlenbeck model with different optima (theta) and different strength of selection parameter (alpha) for each regime
model=OUMVA # a multi-peak Ornstein-Uhlenbeck model with different optima (theta), different Brownian rate parameter (sigma2), and different strength of selection parameter (alpha) for each regime

A great step forward for all of us using comparative methods. To start playing, all we need to specify are a tree, a data frame, and a model. Katie and I had already been playing around with make.simmap from the package phytools, so we had trees with stochastically mapped discrete traits. The data frame needs to be in the format where the first column is the species names, the second column is the discrete character, and the third column is the continuous character. Specifying that can be as easy as this:

> df1 = data.frame(speciesNames, discrete, PC1)

We had multiple mapped trees (as we should), so we took one to go for a spin, starting with the simplest model first:

> bm1Output = OUwie(mouthPsimmap[[1]], df1, model = "BM1", simmap.tree = TRUE)

Begin subplex optimization routine -- Starting value: 1 

Finished. Summarizing results. 

Finished. Performing diagnostic tests. 

Tada! This will give us our log likelihood, AIC score, AICc score, parameter estimates, root estimate, number of iterations executed, parameter standard errors, and eigenvalues. Now we can try it with different models:

> bmsOutput = OUwie(mouthPsimmap[[1]], df1, model = "BMS", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Finished. Summarizing results.
Finished. Performing diagnostic tests.

For this model, though, we need to be sure to iterate over the different mappings as this depends on the selective regime. We should get different results from different trees as we can see here:

> bmsOutput2 = OUwie(mouthPsimmap[[2]], df1, model = "BMS", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Finished. Summarizing results.
Finished. Performing diagnostic tests.
> bmsOutput$loglik
[1] -30.24609
> bmsOutput2$loglik
[1] -29.44005

We have a multiphylo of 500 mapped trees, so let's just sample 20 of them to see what's going on.

> sampledTrees = sample(mouthPsimmap, 20)

Now, we can just throw this into an apply to get a list of output for each of these twenty trees.

> bmsSampledOutput = lapply(sampledTrees, function(x) {
+ OUwie(x, df1, model = "BMS", simmap.tree = TRUE)
+ })

This might take a while, but once it's done, everything seems great! We can use some more apply functions to get out particular bits of the data we want.

Unfortunately, as soon as I add varying alpha into my model, I run into some problems:

> oumaOutput1 = OUwie(mouthPsimmap[[1]], df1, model = "OUMA", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Error in solve.default(V, res) :
system is computationally singular: reciprocal condition number = 1.27328e-103

I've tried a variety of debugging strategies, including changing the optimization algorithm in the OUwie function, but I can't seem to get anything that varies alpha to work, unfortunately. This is probably a function of having very little data, and thus no power to estimate parameters. Unfortunate, but I would love to try this again on a larger dataset!

Thursday, March 15, 2012

Location Data

For my first post, I'd like to highlight what prompted me to start this blog. I was approached by a fellow student of the Bodega Phylogenetics Workshop 2012 about a transforming her data frame into a format she could use. Her dataframe consists of two columns, with species names in the first column and location codes in the second column. There can be repeats in each of the two columns. Here is a toy example:

> df
species location
1 a 1A
2 b 1A
3 c 1B
4 a 2A
5 c 2B

She wants to turn this data frame into a data frame with each species as a row and each location as a column. For each column, 0 indicates absence and 1 indicates presence. Here is the function that does just that! It requires two vectors: species and location, where the order of the elements in the two vectors correspond with each other. Hopefully commented enough such that it's easy to follow

transformPresAbs = function(species, location) {

# find the unique species and locations
ulocations = unique(location)
uspecies = unique(species)

# initialize the matrix with the proper dimensions
# first start with a matrix to make specifying the dimensions easier
presAbs = matrix(NA, ncol = length(ulocations), nrow = length(uspecies))
# then name the columns and rows with the locations and species, respectively
colnames(presAbs) = levels(location)
rownames(presAbs) = levels(uspecies)

for(i in 1:nrow(presAbs)) { # here we're iterating over the rows (species)
presAbs[i,1:ncol(presAbs)] = sapply(colnames(presAbs), function(x) {
# here we're iterating over the columns (locations)
if(any(df[which(species == rownames(presAbs)[i]),] == x)) 1 # assign 1 if present
else 0
})
}

# show us what we got!
return(presAbs)

}