Computing Adventures and Phylogenetics: 2013

Tuesday, April 16, 2013

RStudio character limit

I have been using my makebinary function without any issues, but now that I am working with a much larger tree, I am running into a problem with the number of characters in my newick tree.

My newick tree is 4434 characters long. Add to that the makebinary function call, and the whole command ends up being 4448 characters long. I found out that RStudio truncates commands to be 4096 characters. Fortunately, running it in the normal R console works without issues, but even that seems to have a 4096 character limit:

Note how there is now a new line (headed by the +) at the beginning of the second to last line.

Monday, March 25, 2013

R shortcuts

These might come up in a variety of contexts, but I think it's nice to know that R has a few built-in constants to prevent someone from tediously creating a vector such as c("a", "b", "c" ...). You can use LETTERS or letters, depending on whether you want the capital or lower-case letters. There are also abbreviated and full month names, month.abb and mouth.name, which are very handy for making plots. And finally, there is pi.

I recently used LETTERS and letters to create different plotting symbols for each individual data point so I could easily identify which point is what. I had more than 62 (26 capital letters, 26 lower-case letters, and 10 digits), so I had to use some symbols as well. I'm not sure if there's a more elegant way to do this, but if anyone knows of any, I'd love to hear about it.

Friday, March 22, 2013

More Python issues

This time completely unrelated to Phycas. I've been using Biopython to concatenate several nexus files that share some taxa into one nexus file. This is something I have been doing on my Mac, but I have been attempting this again on my new (to me) PC. After installing Python and Biopython, I tried to run the script with the appropriate modifications for the nexus files I am attempting to combine. I immediately ran into some issues.

getattr(self,'_'+line.command)(line.options)
AttributeError: 'Nexus' object has no attribute '_mitochondrial'']'

Considering my script (below) has no mention of mitochondrial, nor does Nexus.py in the Biopython package, this must be something about the nexus files themselves. When I look in the nexus file, it is divided into a taxa block and a character block. If I combine the two into a single data block and remove all of the information in the taxa block (other than the number of taxa, which I leave in the dimensions of the data block), everything works smoothly!

I don't know if this is related to the Mac/PC switch or whether it was just something about the newer version of Biopython, but I am glad it works.

from Bio.Nexus import Nexus

handles = [open('COI.nex', 'r'), open('cytb.nex','r'), open('RAG1.nex','r')]

nexi = [(handle.name, Nexus.Nexus(handle)) for handle in handles]

combined = Nexus.combine(nexi)
combined.write_nexus_data(filename='COMBINED.nex')

Thursday, March 21, 2013

Phycas woes (semi-)resolved

Just a little while ago, I wrote about getting nan for the number I need in my calculation for the marginal likelihood of a partitioning scheme. I have managed to dissect the Phycas package contents enough to obtain the number I need.

One of the first things I figured out was how to change the working directory in Python:

import os
os.chdir("path")

Simple enough. Next, I needed to find the inner workings of the function sump(). From my rummaging from last time, I knew that the files SumP.py and SumPImpl.py would be important in my search.

In SumPImpl.py, I found the function I needed to produce the marginal likelihood estimate: gss(). gss() is a function defined within the class ParamSummarizer(). And ParamSummarizer() is called within a function of class SumP(), found in SumP.py. Here is the relevant code from SumP.py:

from phycas.Phycas.SumPImpl import ParamSummarizer

class SumP(PhycasCommand):

......

def __call__(self, **kwargs):
self.set(**kwargs)
c = copy.deepcopy(self)
param_summarizer = ParamSummarizer(c)
param_summarizer.run()

So it seems to me that what I want is to assign something as ParamSummarizer(). But what do I put in the parentheses? There is this mysterious entity here called self, which can be found throughout both scripts. While I still don't fully understand how it works, this is needed to call functions within the same class. What seemed to work for me was this:

a = ParamSummarizer(sump)

Now, if I run a.run() or any other function defined within ParamSummarizer, it acts the way I expect it to, as long as I have previously specified the file to work with:

sump.file = "work.p"
a.run()
Autocorrelations (lag 1):

beta -> 1.00000 0.90000 0.80000 0.70000 0.60000 0.50000 0.40000 0.30000 0.20000 0.10000 0.00000
lnL 0.96409 0.36369 0.38353 0.34894 0.40534 0.30339 0.30469 0.32123 0.29938 0.23546 nan
lnPrior 0.86502 0.77219 0.74660 0.69551 0.65932 0.59464 0.55650 0.50884 0.45802 0.35500 0.01570
Error: (34, 'Result too large')

Well, this was what was happening earlier. How might I be able to get the exact value I want? The function that produces the marginal likelihood estimates in our case is gss(). This requires three arguments: self, betas, and p. self is given when the function is called, so let's focus on betas and p.

I found what betas and p are by looking at the function that calls gss(), marginal_likelihood(). But of course, marginal_likelihood() also requires some arguments, so I had to look at the function that calls marginal_likelihood() as well: handleFile(). Finally, a function where all we need to know is the file name:

def handleFile(self, fn):
burnin = self.opts.burnin
lines = open(fn, 'r').readlines()
if len(lines) < 3 + burnin:
self.output("File '%s' does not look like a parameter file (too few lines)")
else:
headers = lines[1].split()
if headers[1] == 'beta':
try:
self.marginal_likelihood(headers, lines, burnin)
except InvalidNumberOfColumnsError, e:
print e
else:
self.std_summary(headers, lines, burnin)

I decided to go step by step to make sure that I got the right variables in the right places. First, the code in handleFile():

lines = open("work.p", 'r').readlines()
headers = lines[1].split(

Followed by code in marginal_likelihood():

betas = []
params = {}
for h in headers:
params[h] = {}

row_start = 2
curr_beta = None
for i,line in enumerate(lines[row_start:]):
parts = line.split()
beta = float(parts[1])
if curr_beta is None or curr_beta != beta:
curr_beta = beta
betas.append(beta)
for h,x in zip(headers,parts):
params[h][beta] = [float(x)]
else:
for h,x in zip(headers,parts):
params[h][beta].append(float(x))

This gives us reasonable values for betas and params, which are going to be our betas and p going into gss():

a.gss(betas, params)

b_(k-1) beta_incr n lnRk lnR(cum)
0.900 0.100 10000 -2320.508943 -2320.508943
0.800 0.100 10000 -2321.152357 -4641.661300
0.700 0.100 10000 -2320.351828 -6962.013128
0.600 0.100 10000 -2320.291323 -9282.304451
0.500 0.100 10000 -2320.221322 -11602.525773
0.400 0.100 10000 -2321.118544 -13923.644317
0.300 0.100 10000 -2322.397532 -16246.041849
0.200 0.100 10000 -2324.258373 -18570.300222
0.100 0.100 10000 -2327.929248 -20898.229469
0.000 0.100 10000 nan nan
nan Generalized Stepping Stone method

So this is still giving us nan. Let's step through gss() and see why this is happening:

likes = p['lnL']
priors = p['lnPrior']
working_priors = p['lnRefDens']
lnR = 0.0
nbetas = len(betas)

This is all of the code before the big for-loop that takes us through each of the beta values. I'm going to change the for loop so that it ends just before the iteration that gives us nan by changing the ending point from nbetas to nbetas-1:

for i in range(1,nbetas-1):

blarger = betas[i - 1]

bsmaller = betas[i]

beta_incr = blarger - bsmaller

loglikes = likes[bsmaller]

logpriors = priors[bsmaller]

logwpriors = working_priors[bsmaller]

n = len(loglikes)

etak = max([(lnL + lnp - lnwp) for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors)])

tmp = 0.0

for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors):

log_term = beta_incr*(lnL + lnp - lnwp - etak)

tmp += math.exp(log_term)

tmp /= float(n)

lnRk = beta_incr*etak + math.log(tmp)

lnR += lnRk

Now, as I step through the code again using i = nbetas - 1, I see where the error is coming from:

for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors):

log_term = beta_incr*(lnL + lnp - lnwp - etak)

tmp += math.exp(log_term)

During some iterations, log_term is nan. And looking through my work.p file, I see that this is because there are nans and infs in the parameters file. What if I can just get Python to ignore these lines and use everything else? To do this, I imported numpy and used the isnan() function:

for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors):

log_term = beta_incr*(lnL + lnp - lnwp - etak)

if numpy.isnan(log_term):

n -= 1

else:

tmp += math.exp(log_term)

Wednesday, March 13, 2013

for loops vs. sapply

Loops are very useful for doing the same (or similar things) multiple times. Unfortunately, in R, loops can be very clunky and slow.

For loops are perhaps more intuitive than sapply because the result you get is the same as if you ran the code within the for loop multiple times. What do I mean?

> x = numeric(10)

> y = numeric(10)

> z = for(i in 1:10) {

+ y[i] = i

+ x[i] = i*2

+ x[i]

+ }

> x

[1] 2 4 6 8 10 12 14 16 18 20

> y

[1] 1 2 3 4 5 6 7 8 9 10

> z

NULL

So the code within the for loop actually changes what is stored in x and y, but it does not return anything itself. Thus, z is NULL.

Let's use very similar code, except using sapply:

> x = numeric(10)

> y = numeric(10)

> z = sapply(1:10, function(i){

+ y[i] = i

+ x[i] = i*2

+ x[i]

+ })

> x

[1] 0 0 0 0 0 0 0 0 0 0

> y

[1] 0 0 0 0 0 0 0 0 0 0

> z

[1] 2 4 6 8 10 12 14 16 18 20

Wait, why are x and y still full of 0s? This occurs because any assignments made within an sapply does not affect the global environment. So changing y[i] = i within sapply does not change the vector y itself. Thus it stays a vector of 0s as it was initialized. The trouble with sapply is that because of this, one iteration of the loop cannot depend on a different iteration of the loop--i.e., we cannot calculate x based off of what x was in a previous iteration. This is in direct contrast to a for loop, where because the changes happen in the global environment, we can use a previous iteration to determine the current iteration, like in this example:

> x = numeric(10)

> y = numeric(10)

> z = for(i in 2:10) {

+ y[i] = i

+ x[i] = x[i-1]+y[i-1]

+ }

> x

[1] 0 0 2 5 9 14 20 27 35 44

> y

[1] 0 2 3 4 5 6 7 8 9 10

> z

NULL

Tuesday, March 12, 2013

Phycas woes

I have been having very good luck with using Phycas to run stepping stone analyses to estimate the marginal likelihood of different partitioning schemes. I've obtained the marginal likelihood estimates of all of the partitioning schemes I intend to try for one of the phylogenies I am building...except for the last one.

This last one happens to be the most parameter-rich scheme, with every gene partitioned by codon position. Perhaps this is why it is having issues. The stepping stone MCMC ran smoothly, although it took over four days to run. The issue happened at the very end as it was calculating autocorrelations and estimated sample sizes:

Since what I really care about is the marginal likelihood estimate, I wanted to see if I could bypass the autocorrelation calculations and go straight to that. My lack of knowledge about Python was fairly limiting in this respect, but I managed to comment out the appropriate lines from the SumPImpl.py script within the Phycas package to only run the marginalized likehood calculations. Unfortunately, this is what I see then:

Perhaps the fact that the autocorrelation for the log likelihood was NA for beta = 0.0 has something to do with this... I think I will eventually try to run this script again when I have no others running to see if a different MCMC run will work.

Monday, January 28, 2013

Making Phylogenies Pretty

I'd like anyone and everyone to chime in on this. I need to create high-quality figures of phylogenies. I'm currently using Mesquite to get the backbone of the tree, but could use any advice in getting the tree looking good enough to be on a powerpoint slide:

I'd love to take a look at what people consider to be good phylogeny figures and what programs people use to make things look good. But mostly, I just want some more ideas about how to make this look better...even from those of you have no experience with phylogenies at all!

Edit: I'm collapsing all the edits into one. I've updated it such that all of the taxa are putatively monophyletic.

Thursday, January 24, 2013

MrBayes trouble

Sometimes I have to stop a MrBayes script before it finishes the entire number of generations I originally planned. Usually this is no trouble as I can easily use the values obtained before I stopped the script to continue with the analysis (by using append = YES) or to simply summarize what I already have. But occasionally, I get a mysterious error:

Abort trap 6

The situation when this occurs is when I have terminated a script early and just want to summarize the trees for what I already have. I have no trouble getting MrBayes to read the tree files, but this error occurs before it will analyze them and quits MrBayes entirely.

Of course, now that I want it to give me some trouble, it works:

Previously, it would say "Abort trap 6" instead of "General explanation:" and continuing. I guess I just need to let it run for longer and try again.

Running on a server

I use MrBayes quite a lot, and they often take quite a long time to run. That's why I use the lab's server to run most of my MrBayes scripts. Its name is Aquarium:

It was recently upgraded to 8 cores, so I frequently have seven scripts running at once (I like to leave at least one for someone else who might need it). I also submit them as low priority so that someone with more pressing needs can use the high priority slots.

Learning how to use Aquarium was the first time I learned anything about SSH (secure shell) or SCP (secure copy), and back when I only had a PC, I had to learn to use PuTTY to use it. I've now started to get the hang of it, but here are a couple tricks I've used recently to overcome some issues.

One thing I noticed was that some scripts were taking an abnormally long time! Months of waiting for 50 million generations. And most of the ones taking a long time were priors, where I replaced the sequence data with the appropriate number of question marks. For some reason, MrBayes, when estimating under the prior alone, will just stall. The reason why I know this now is because I used a little trick to see whether my file sizes were changing:

ls -l

This lists the files and folders with details, including file size. If the script is running correctly, most of my files (like my .p and .t files) will grow bigger and bigger. But even overnight, they were staying the same. Now came a trickier part:

When I use top, I can see the processes that are running. Since some of my scripts are stalled, I just want to stop them and download the files using SCP. To stop the correct process, I need to use the correct PID:

kill -9 [PID]

If I happen to remember the order in which I submitted my scripts, I can just use the time that it's been running to find the correct PID. But if I don't remember, I can use this:

ps aux | grep [command name]

This way, I know exactly what script corresponds with what PID!

Tuesday, January 22, 2013

Fun with vectorizing

I found myself staying up way too late and having way too much fun helping a friend with her R homework. One of the things we did was to vectorize a function! Her assignment was to create plots using the Ricker model, with varying values for r:

So the first thing we did was to write a function that would calculate the population size over a certain time period for a single value of r:

Ricker = function(r, K = 100, n0 = 50, tf = 100) {
n = numeric(length = tf)
n[1] = n0
for(t in 1:(tf-1)) {
# We have tf - 1 because when t = 1, we are actually
# calculating n[2], so if we want to end on n[tf], t
# should end on tf - 1
n[t+1] = n[t]*exp(r*(1-n[t]/K))
}
n
}

Changing this to work with a longer vector of r values was fairly simple:

vectorizedRicker = function(r, K = 100, n0 = 50, tf = 100) {
NumofRs = length(r)
# First, we're changing our output to be
# a matrix instead of a vector
n = matrix(0, NumbofRs, tf)
# We could vectorize over n0 as well, or any of the
# other parameters, if we set the number of rows for n
# to be equal to the length of the longest vector
# of parameter inputs
n[,1] = n0
for(t in 1:(tf-1)) {
# Now, we just change the for loop to reflect that it
# is working on entire columns of a matrix rather
# than just a single element of a vector
n[,t+1] = n[,t]*exp(r*(1 - n[,t]/K))
}
n
}

rfishbase

I was given a task to use the data in Fishbase to find percentages of different fish in different habitats. Carl Boettiger created an R package (rfishbase) to make the data very accessible in R, but unfortunately, some of the functions in the version on CRAN didn't work for me. Instead, I used the code he has on github. So if anyone else is having trouble getting updateCache() to work or wants to modify the code to extract a piece of information that isn't being extracted in the original fishbase() function, this is the place to go! Another nice thing is that you can also use getData() to download smaller chunks of data if you don't want to overwhelm the server.

Thursday, January 10, 2013

Moving on from base

So I gave myself a project in my last post to remove hard-coded numbers in my code to plot a stacked histogram. I succeeded, but it may have been more work than it's worth. Perhaps it's time for me to move on to ggplot2 or lattice?

cats = levels(dat$Type) # same as before

xaxis = round(range(scores$x)*5)/5 # the range, rounded to every .2

breaks = seq(xaxis[1]-.2, xaxis[2]+.2, .2) # the breaks to use for histograms

histData = sapply(cats, function(x) {
hist(scores$x[which(dat$Type == x)],
breaks = breaks)$counts
})
allhistData = do.call(rbind, list(histData))
barplot(t(allhistData), space = 0, ylab = "number of sequences", xlab = "LD")
zero = which(breaks==0)-2 # find the 0 on my x-axis. The plot starts at -1

marks = seq(floor(xaxis[1]), ceiling(xaxis[2])) # the tick-marks I want to use, every integer

ticks = numeric() # this vector will hold where I want the tick labels to appear

while(zero >= -1) {
ticks[length(ticks)+1] = zero
zero = zero - 5
} # goes down by five, which corresponds to one LD unit

while(length(ticks) < length(marks)) {
ticks[length(ticks)+1] = max(ticks) + 5
} # go up by five, until I have as many locations as I have labels

axis(side = 1, at = ticks[order(ticks)], labels = marks) #I need to order the tick marks in sequential order

legend("topleft", legend = cats, text.col = c("black", "gray"))

This produces the same plot from the previous post.

Linear Discriminant Analysis

This follows fairly naturally from the PCA I did on the data in my previous post. I have a dataset of several quantitative variables that can be grouped by a categorical variable. This time, I am going to maximize the separation between the two groups to see what traits are important in determining group inclusion.

While looking for information about linear discriminant analyses (LDA), I came across a very informative site by Dr. Avril Coghlan. I used a couple of the functions available from that website, including groupStandardise and calcWithinGroupsVariance, to obtain meaningful coefficients to determine what traits are more informative about group membership.

The lda() function is in the MASS package. Since I only have two groups, I only have one discriminant axis, as the number of discriminant axes is equal to the number of groups minus one. Thus I won't get a nice scatterplot the way I did with my PCA. Instead, if I try to plot the output to lda(), I get this:

I wanted a stacked histogram instead, so I had to do a little more work.

cats = levels(dat$Type) # the categories I'm using

histData = sapply(cats, function(x) {

hist(scores$x[which(dat$Type == x)],

breaks = seq(-3, 2.2, .2))$counts

# bad me, I hard-coded the breaks. My next project can be to use the same breaks as the above plot

})

# this gives me a matrix of the counts for each interval for each type

allhistData = do.call(rbind, list(histData))

barplot(t(allhistData),

space = 0,

ylab = "number of sequences",

xlab = "LD")

axis(side = 1,

at = c(-1, 4, 9, 14, 19, 24),

labels = c(-3, -2, -1, 0, 1, 2)) # more bad hard-coding

legend("topleft", legend = cats, text.col = c("black", "gray"))

And here is what I get:

It doesn't look great, but when did stacked histograms ever look good? Probably better if I have a lot more observations.

Tuesday, January 8, 2013

PCAs and Plotting

Principal Components Analysis, or PCA, is fairly straightforward using the princomp() function in R. But the data I have is divided by two factors: the type of response and the individual. I wanted to plot using different colors for the type of response and different plotting symbols for the individuals. Simple enough, but I also didn't want to need to modify the code if the number of types of responses or the number of individuals changed (the latter is probably more likely, but I still want my code to be as general as possible). Here is an example generated from random data:

The points themselves are fairly simple. I simply referred to the two variables I wanted in my col and pch arguments of plot().

col = as.numeric(dat[,x])+2

and

pch = dat$Individual+14

The reason for the +2 and +14 are to get the colors and plotting symbols I wanted. I could also assign specific colors I want by doing this:

colors = c("green", "blue")
ptype = c(15, 16, 17)

Then add these as arguments to plot().

col = colors[as.numeric(dat[,x])
pch = ptype[dat$Individual]

In this case, I would need to make sure that I have enough colors and plotting symbols that I don't run out.

The fun happens with the legend. For the text, I used this argument:

legend = c(levels(dat[,x]), unique(dat$Individual))

This remains flexible for any number of types or individuals. For the colors, I used a combination of seq() and rep() to get the numbers I wanted. If I didn't want my code to be general, I could simply use this:

text.col = c(3, 4, 1, 1, 1)

Instead, I used this:

text.col = c(seq(3, length(levels(dat[,x]))+2), rep(1, length(unique(dat$Individual))))

seq(3, length(levels(dat[,x]))+2) gives me a sequence of integers from three all the way to the number of types I have plus two (because I started with three instead of one). rep(1, length(unique(dat$Individual))) gives me 1 repeated for every unique individual.

Finally, the plotting symbols:

pch = c(rep(NA, length(levels(dat[,x]))), unique(dat$Individual)+14)

This is essentially the opposite of what I just did for the text color, except that I don't want any symbol next to the two types. Even though this looks like (and is) a lot more typing than simply hard-coding the appropriate numbers, this lets me use exactly the same code to make the figure even after I have doubled or tripled the amount of data I have.

Monday, January 7, 2013

jModelTest2

I had been having difficulties with jModelTest, and while searching for information about one of the problems I was having**, I discovered jModelTest2. I am quite surprised I hadn't come across it before, since it apparently has been around in some form or another for over a year, but I somehow failed to hear about it until now.

The first thing of note is that jModelTest2 can test all submodels of the GTR model. I am attempting to do this with my first trial to see how long it takes. Unlike jModelTest, jModelTest2 can use a hill-climbing algorithm to systematically find the best-fit models. Because the software I use (Phycas, MrBayes) doesn't accomodate all submodels, this may not be too relevant. However, BEAST or RevBayes can accomodate any submodel, so it may be useful in conjunction with those.

The second thing of note is high performance computing, supporting up to eight threads. Unfortunately, I don't really know much about this. My computer has a single-core processor, but it still seems to benefit from the threading. I may do a test run to see if the time it takes is dramatically reduced. Also not quite sure if "iddle" is a typo.

All-in-all, it seems like jModelTest2 is having no difficulties with the files that jModelTest threw errors for, so I take that as a good sign.

** In the old version of jModelTest, I was getting this error:

Notice "BIONJ-JC tree: null." This was the error I was trying to find more information on when I found out about jModelTest2.