Computing Adventures and Phylogenetics: January 2013

Monday, January 28, 2013

Making Phylogenies Pretty

I'd like anyone and everyone to chime in on this. I need to create high-quality figures of phylogenies. I'm currently using Mesquite to get the backbone of the tree, but could use any advice in getting the tree looking good enough to be on a powerpoint slide:

I'd love to take a look at what people consider to be good phylogeny figures and what programs people use to make things look good. But mostly, I just want some more ideas about how to make this look better...even from those of you have no experience with phylogenies at all!

Edit: I'm collapsing all the edits into one. I've updated it such that all of the taxa are putatively monophyletic.

Thursday, January 24, 2013

MrBayes trouble

Sometimes I have to stop a MrBayes script before it finishes the entire number of generations I originally planned. Usually this is no trouble as I can easily use the values obtained before I stopped the script to continue with the analysis (by using append = YES) or to simply summarize what I already have. But occasionally, I get a mysterious error:

Abort trap 6

The situation when this occurs is when I have terminated a script early and just want to summarize the trees for what I already have. I have no trouble getting MrBayes to read the tree files, but this error occurs before it will analyze them and quits MrBayes entirely.

Of course, now that I want it to give me some trouble, it works:

Previously, it would say "Abort trap 6" instead of "General explanation:" and continuing. I guess I just need to let it run for longer and try again.

Running on a server

I use MrBayes quite a lot, and they often take quite a long time to run. That's why I use the lab's server to run most of my MrBayes scripts. Its name is Aquarium:

It was recently upgraded to 8 cores, so I frequently have seven scripts running at once (I like to leave at least one for someone else who might need it). I also submit them as low priority so that someone with more pressing needs can use the high priority slots.

Learning how to use Aquarium was the first time I learned anything about SSH (secure shell) or SCP (secure copy), and back when I only had a PC, I had to learn to use PuTTY to use it. I've now started to get the hang of it, but here are a couple tricks I've used recently to overcome some issues.

One thing I noticed was that some scripts were taking an abnormally long time! Months of waiting for 50 million generations. And most of the ones taking a long time were priors, where I replaced the sequence data with the appropriate number of question marks. For some reason, MrBayes, when estimating under the prior alone, will just stall. The reason why I know this now is because I used a little trick to see whether my file sizes were changing:

ls -l

This lists the files and folders with details, including file size. If the script is running correctly, most of my files (like my .p and .t files) will grow bigger and bigger. But even overnight, they were staying the same. Now came a trickier part:

When I use top, I can see the processes that are running. Since some of my scripts are stalled, I just want to stop them and download the files using SCP. To stop the correct process, I need to use the correct PID:

kill -9 [PID]

If I happen to remember the order in which I submitted my scripts, I can just use the time that it's been running to find the correct PID. But if I don't remember, I can use this:

ps aux | grep [command name]

This way, I know exactly what script corresponds with what PID!

Tuesday, January 22, 2013

Fun with vectorizing

I found myself staying up way too late and having way too much fun helping a friend with her R homework. One of the things we did was to vectorize a function! Her assignment was to create plots using the Ricker model, with varying values for r:

So the first thing we did was to write a function that would calculate the population size over a certain time period for a single value of r:

Ricker = function(r, K = 100, n0 = 50, tf = 100) {
n = numeric(length = tf)
n[1] = n0
for(t in 1:(tf-1)) {
# We have tf - 1 because when t = 1, we are actually
# calculating n[2], so if we want to end on n[tf], t
# should end on tf - 1
n[t+1] = n[t]*exp(r*(1-n[t]/K))
}
n
}

Changing this to work with a longer vector of r values was fairly simple:

vectorizedRicker = function(r, K = 100, n0 = 50, tf = 100) {
NumofRs = length(r)
# First, we're changing our output to be
# a matrix instead of a vector
n = matrix(0, NumbofRs, tf)
# We could vectorize over n0 as well, or any of the
# other parameters, if we set the number of rows for n
# to be equal to the length of the longest vector
# of parameter inputs
n[,1] = n0
for(t in 1:(tf-1)) {
# Now, we just change the for loop to reflect that it
# is working on entire columns of a matrix rather
# than just a single element of a vector
n[,t+1] = n[,t]*exp(r*(1 - n[,t]/K))
}
n
}

rfishbase

I was given a task to use the data in Fishbase to find percentages of different fish in different habitats. Carl Boettiger created an R package (rfishbase) to make the data very accessible in R, but unfortunately, some of the functions in the version on CRAN didn't work for me. Instead, I used the code he has on github. So if anyone else is having trouble getting updateCache() to work or wants to modify the code to extract a piece of information that isn't being extracted in the original fishbase() function, this is the place to go! Another nice thing is that you can also use getData() to download smaller chunks of data if you don't want to overwhelm the server.

Thursday, January 10, 2013

Moving on from base

So I gave myself a project in my last post to remove hard-coded numbers in my code to plot a stacked histogram. I succeeded, but it may have been more work than it's worth. Perhaps it's time for me to move on to ggplot2 or lattice?

cats = levels(dat$Type) # same as before

xaxis = round(range(scores$x)*5)/5 # the range, rounded to every .2

breaks = seq(xaxis[1]-.2, xaxis[2]+.2, .2) # the breaks to use for histograms

histData = sapply(cats, function(x) {
hist(scores$x[which(dat$Type == x)],
breaks = breaks)$counts
})
allhistData = do.call(rbind, list(histData))
barplot(t(allhistData), space = 0, ylab = "number of sequences", xlab = "LD")
zero = which(breaks==0)-2 # find the 0 on my x-axis. The plot starts at -1

marks = seq(floor(xaxis[1]), ceiling(xaxis[2])) # the tick-marks I want to use, every integer

ticks = numeric() # this vector will hold where I want the tick labels to appear

while(zero >= -1) {
ticks[length(ticks)+1] = zero
zero = zero - 5
} # goes down by five, which corresponds to one LD unit

while(length(ticks) < length(marks)) {
ticks[length(ticks)+1] = max(ticks) + 5
} # go up by five, until I have as many locations as I have labels

axis(side = 1, at = ticks[order(ticks)], labels = marks) #I need to order the tick marks in sequential order

legend("topleft", legend = cats, text.col = c("black", "gray"))

This produces the same plot from the previous post.

Linear Discriminant Analysis

This follows fairly naturally from the PCA I did on the data in my previous post. I have a dataset of several quantitative variables that can be grouped by a categorical variable. This time, I am going to maximize the separation between the two groups to see what traits are important in determining group inclusion.

While looking for information about linear discriminant analyses (LDA), I came across a very informative site by Dr. Avril Coghlan. I used a couple of the functions available from that website, including groupStandardise and calcWithinGroupsVariance, to obtain meaningful coefficients to determine what traits are more informative about group membership.

The lda() function is in the MASS package. Since I only have two groups, I only have one discriminant axis, as the number of discriminant axes is equal to the number of groups minus one. Thus I won't get a nice scatterplot the way I did with my PCA. Instead, if I try to plot the output to lda(), I get this:

I wanted a stacked histogram instead, so I had to do a little more work.

cats = levels(dat$Type) # the categories I'm using

histData = sapply(cats, function(x) {

hist(scores$x[which(dat$Type == x)],

breaks = seq(-3, 2.2, .2))$counts

# bad me, I hard-coded the breaks. My next project can be to use the same breaks as the above plot

})

# this gives me a matrix of the counts for each interval for each type

allhistData = do.call(rbind, list(histData))

barplot(t(allhistData),

space = 0,

ylab = "number of sequences",

xlab = "LD")

axis(side = 1,

at = c(-1, 4, 9, 14, 19, 24),

labels = c(-3, -2, -1, 0, 1, 2)) # more bad hard-coding

legend("topleft", legend = cats, text.col = c("black", "gray"))

And here is what I get:

It doesn't look great, but when did stacked histograms ever look good? Probably better if I have a lot more observations.

Tuesday, January 8, 2013

PCAs and Plotting

Principal Components Analysis, or PCA, is fairly straightforward using the princomp() function in R. But the data I have is divided by two factors: the type of response and the individual. I wanted to plot using different colors for the type of response and different plotting symbols for the individuals. Simple enough, but I also didn't want to need to modify the code if the number of types of responses or the number of individuals changed (the latter is probably more likely, but I still want my code to be as general as possible). Here is an example generated from random data:

The points themselves are fairly simple. I simply referred to the two variables I wanted in my col and pch arguments of plot().

col = as.numeric(dat[,x])+2

and

pch = dat$Individual+14

The reason for the +2 and +14 are to get the colors and plotting symbols I wanted. I could also assign specific colors I want by doing this:

colors = c("green", "blue")
ptype = c(15, 16, 17)

Then add these as arguments to plot().

col = colors[as.numeric(dat[,x])
pch = ptype[dat$Individual]

In this case, I would need to make sure that I have enough colors and plotting symbols that I don't run out.

The fun happens with the legend. For the text, I used this argument:

legend = c(levels(dat[,x]), unique(dat$Individual))

This remains flexible for any number of types or individuals. For the colors, I used a combination of seq() and rep() to get the numbers I wanted. If I didn't want my code to be general, I could simply use this:

text.col = c(3, 4, 1, 1, 1)

Instead, I used this:

text.col = c(seq(3, length(levels(dat[,x]))+2), rep(1, length(unique(dat$Individual))))

seq(3, length(levels(dat[,x]))+2) gives me a sequence of integers from three all the way to the number of types I have plus two (because I started with three instead of one). rep(1, length(unique(dat$Individual))) gives me 1 repeated for every unique individual.

Finally, the plotting symbols:

pch = c(rep(NA, length(levels(dat[,x]))), unique(dat$Individual)+14)

This is essentially the opposite of what I just did for the text color, except that I don't want any symbol next to the two types. Even though this looks like (and is) a lot more typing than simply hard-coding the appropriate numbers, this lets me use exactly the same code to make the figure even after I have doubled or tripled the amount of data I have.

Monday, January 7, 2013

jModelTest2

I had been having difficulties with jModelTest, and while searching for information about one of the problems I was having**, I discovered jModelTest2. I am quite surprised I hadn't come across it before, since it apparently has been around in some form or another for over a year, but I somehow failed to hear about it until now.

The first thing of note is that jModelTest2 can test all submodels of the GTR model. I am attempting to do this with my first trial to see how long it takes. Unlike jModelTest, jModelTest2 can use a hill-climbing algorithm to systematically find the best-fit models. Because the software I use (Phycas, MrBayes) doesn't accomodate all submodels, this may not be too relevant. However, BEAST or RevBayes can accomodate any submodel, so it may be useful in conjunction with those.

The second thing of note is high performance computing, supporting up to eight threads. Unfortunately, I don't really know much about this. My computer has a single-core processor, but it still seems to benefit from the threading. I may do a test run to see if the time it takes is dramatically reduced. Also not quite sure if "iddle" is a typo.

All-in-all, it seems like jModelTest2 is having no difficulties with the files that jModelTest threw errors for, so I take that as a good sign.

** In the old version of jModelTest, I was getting this error:

Notice "BIONJ-JC tree: null." This was the error I was trying to find more information on when I found out about jModelTest2.