Computing Adventures and Phylogenetics

Tuesday, April 16, 2013

RStudio character limit

I have been using my makebinary function without any issues, but now that I am working with a much larger tree, I am running into a problem with the number of characters in my newick tree.

My newick tree is 4434 characters long. Add to that the makebinary function call, and the whole command ends up being 4448 characters long. I found out that RStudio truncates commands to be 4096 characters. Fortunately, running it in the normal R console works without issues, but even that seems to have a 4096 character limit:

Note how there is now a new line (headed by the +) at the beginning of the second to last line.

Monday, March 25, 2013

R shortcuts

These might come up in a variety of contexts, but I think it's nice to know that R has a few built-in constants to prevent someone from tediously creating a vector such as c("a", "b", "c" ...). You can use LETTERS or letters, depending on whether you want the capital or lower-case letters. There are also abbreviated and full month names, month.abb and mouth.name, which are very handy for making plots. And finally, there is pi.

I recently used LETTERS and letters to create different plotting symbols for each individual data point so I could easily identify which point is what. I had more than 62 (26 capital letters, 26 lower-case letters, and 10 digits), so I had to use some symbols as well. I'm not sure if there's a more elegant way to do this, but if anyone knows of any, I'd love to hear about it.

Friday, March 22, 2013

More Python issues

This time completely unrelated to Phycas. I've been using Biopython to concatenate several nexus files that share some taxa into one nexus file. This is something I have been doing on my Mac, but I have been attempting this again on my new (to me) PC. After installing Python and Biopython, I tried to run the script with the appropriate modifications for the nexus files I am attempting to combine. I immediately ran into some issues.

getattr(self,'_'+line.command)(line.options)
AttributeError: 'Nexus' object has no attribute '_mitochondrial'']'

Considering my script (below) has no mention of mitochondrial, nor does Nexus.py in the Biopython package, this must be something about the nexus files themselves. When I look in the nexus file, it is divided into a taxa block and a character block. If I combine the two into a single data block and remove all of the information in the taxa block (other than the number of taxa, which I leave in the dimensions of the data block), everything works smoothly!

I don't know if this is related to the Mac/PC switch or whether it was just something about the newer version of Biopython, but I am glad it works.

from Bio.Nexus import Nexus

handles = [open('COI.nex', 'r'), open('cytb.nex','r'), open('RAG1.nex','r')]

nexi = [(handle.name, Nexus.Nexus(handle)) for handle in handles]

combined = Nexus.combine(nexi)
combined.write_nexus_data(filename='COMBINED.nex')

Thursday, March 21, 2013

Phycas woes (semi-)resolved

Just a little while ago, I wrote about getting nan for the number I need in my calculation for the marginal likelihood of a partitioning scheme. I have managed to dissect the Phycas package contents enough to obtain the number I need.

One of the first things I figured out was how to change the working directory in Python:

import os
os.chdir("path")

Simple enough. Next, I needed to find the inner workings of the function sump(). From my rummaging from last time, I knew that the files SumP.py and SumPImpl.py would be important in my search.

In SumPImpl.py, I found the function I needed to produce the marginal likelihood estimate: gss(). gss() is a function defined within the class ParamSummarizer(). And ParamSummarizer() is called within a function of class SumP(), found in SumP.py. Here is the relevant code from SumP.py:

from phycas.Phycas.SumPImpl import ParamSummarizer

class SumP(PhycasCommand):

......

def __call__(self, **kwargs):
self.set(**kwargs)
c = copy.deepcopy(self)
param_summarizer = ParamSummarizer(c)
param_summarizer.run()

So it seems to me that what I want is to assign something as ParamSummarizer(). But what do I put in the parentheses? There is this mysterious entity here called self, which can be found throughout both scripts. While I still don't fully understand how it works, this is needed to call functions within the same class. What seemed to work for me was this:

a = ParamSummarizer(sump)

Now, if I run a.run() or any other function defined within ParamSummarizer, it acts the way I expect it to, as long as I have previously specified the file to work with:

sump.file = "work.p"
a.run()
Autocorrelations (lag 1):

beta -> 1.00000 0.90000 0.80000 0.70000 0.60000 0.50000 0.40000 0.30000 0.20000 0.10000 0.00000
lnL 0.96409 0.36369 0.38353 0.34894 0.40534 0.30339 0.30469 0.32123 0.29938 0.23546 nan
lnPrior 0.86502 0.77219 0.74660 0.69551 0.65932 0.59464 0.55650 0.50884 0.45802 0.35500 0.01570
Error: (34, 'Result too large')

Well, this was what was happening earlier. How might I be able to get the exact value I want? The function that produces the marginal likelihood estimates in our case is gss(). This requires three arguments: self, betas, and p. self is given when the function is called, so let's focus on betas and p.

I found what betas and p are by looking at the function that calls gss(), marginal_likelihood(). But of course, marginal_likelihood() also requires some arguments, so I had to look at the function that calls marginal_likelihood() as well: handleFile(). Finally, a function where all we need to know is the file name:

def handleFile(self, fn):
burnin = self.opts.burnin
lines = open(fn, 'r').readlines()
if len(lines) < 3 + burnin:
self.output("File '%s' does not look like a parameter file (too few lines)")
else:
headers = lines[1].split()
if headers[1] == 'beta':
try:
self.marginal_likelihood(headers, lines, burnin)
except InvalidNumberOfColumnsError, e:
print e
else:
self.std_summary(headers, lines, burnin)

I decided to go step by step to make sure that I got the right variables in the right places. First, the code in handleFile():

lines = open("work.p", 'r').readlines()
headers = lines[1].split(

Followed by code in marginal_likelihood():

betas = []
params = {}
for h in headers:
params[h] = {}

row_start = 2
curr_beta = None
for i,line in enumerate(lines[row_start:]):
parts = line.split()
beta = float(parts[1])
if curr_beta is None or curr_beta != beta:
curr_beta = beta
betas.append(beta)
for h,x in zip(headers,parts):
params[h][beta] = [float(x)]
else:
for h,x in zip(headers,parts):
params[h][beta].append(float(x))

This gives us reasonable values for betas and params, which are going to be our betas and p going into gss():

a.gss(betas, params)

b_(k-1) beta_incr n lnRk lnR(cum)
0.900 0.100 10000 -2320.508943 -2320.508943
0.800 0.100 10000 -2321.152357 -4641.661300
0.700 0.100 10000 -2320.351828 -6962.013128
0.600 0.100 10000 -2320.291323 -9282.304451
0.500 0.100 10000 -2320.221322 -11602.525773
0.400 0.100 10000 -2321.118544 -13923.644317
0.300 0.100 10000 -2322.397532 -16246.041849
0.200 0.100 10000 -2324.258373 -18570.300222
0.100 0.100 10000 -2327.929248 -20898.229469
0.000 0.100 10000 nan nan
nan Generalized Stepping Stone method

So this is still giving us nan. Let's step through gss() and see why this is happening:

likes = p['lnL']
priors = p['lnPrior']
working_priors = p['lnRefDens']
lnR = 0.0
nbetas = len(betas)

This is all of the code before the big for-loop that takes us through each of the beta values. I'm going to change the for loop so that it ends just before the iteration that gives us nan by changing the ending point from nbetas to nbetas-1:

for i in range(1,nbetas-1):

blarger = betas[i - 1]

bsmaller = betas[i]

beta_incr = blarger - bsmaller

loglikes = likes[bsmaller]

logpriors = priors[bsmaller]

logwpriors = working_priors[bsmaller]

n = len(loglikes)

etak = max([(lnL + lnp - lnwp) for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors)])

tmp = 0.0

for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors):

log_term = beta_incr*(lnL + lnp - lnwp - etak)

tmp += math.exp(log_term)

tmp /= float(n)

lnRk = beta_incr*etak + math.log(tmp)

lnR += lnRk

Now, as I step through the code again using i = nbetas - 1, I see where the error is coming from:

for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors):

log_term = beta_incr*(lnL + lnp - lnwp - etak)

tmp += math.exp(log_term)

During some iterations, log_term is nan. And looking through my work.p file, I see that this is because there are nans and infs in the parameters file. What if I can just get Python to ignore these lines and use everything else? To do this, I imported numpy and used the isnan() function:

for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors):

log_term = beta_incr*(lnL + lnp - lnwp - etak)

if numpy.isnan(log_term):

n -= 1

else:

tmp += math.exp(log_term)

Wednesday, March 13, 2013

for loops vs. sapply

Loops are very useful for doing the same (or similar things) multiple times. Unfortunately, in R, loops can be very clunky and slow.

For loops are perhaps more intuitive than sapply because the result you get is the same as if you ran the code within the for loop multiple times. What do I mean?

> x = numeric(10)

> y = numeric(10)

> z = for(i in 1:10) {

+ y[i] = i

+ x[i] = i*2

+ x[i]

+ }

> x

[1] 2 4 6 8 10 12 14 16 18 20

> y

[1] 1 2 3 4 5 6 7 8 9 10

> z

NULL

So the code within the for loop actually changes what is stored in x and y, but it does not return anything itself. Thus, z is NULL.

Let's use very similar code, except using sapply:

> x = numeric(10)

> y = numeric(10)

> z = sapply(1:10, function(i){

+ y[i] = i

+ x[i] = i*2

+ x[i]

+ })

> x

[1] 0 0 0 0 0 0 0 0 0 0

> y

[1] 0 0 0 0 0 0 0 0 0 0

> z

[1] 2 4 6 8 10 12 14 16 18 20

Wait, why are x and y still full of 0s? This occurs because any assignments made within an sapply does not affect the global environment. So changing y[i] = i within sapply does not change the vector y itself. Thus it stays a vector of 0s as it was initialized. The trouble with sapply is that because of this, one iteration of the loop cannot depend on a different iteration of the loop--i.e., we cannot calculate x based off of what x was in a previous iteration. This is in direct contrast to a for loop, where because the changes happen in the global environment, we can use a previous iteration to determine the current iteration, like in this example:

> x = numeric(10)

> y = numeric(10)

> z = for(i in 2:10) {

+ y[i] = i

+ x[i] = x[i-1]+y[i-1]

+ }

> x

[1] 0 0 2 5 9 14 20 27 35 44

> y

[1] 0 2 3 4 5 6 7 8 9 10

> z

NULL

Tuesday, March 12, 2013

Phycas woes

I have been having very good luck with using Phycas to run stepping stone analyses to estimate the marginal likelihood of different partitioning schemes. I've obtained the marginal likelihood estimates of all of the partitioning schemes I intend to try for one of the phylogenies I am building...except for the last one.

This last one happens to be the most parameter-rich scheme, with every gene partitioned by codon position. Perhaps this is why it is having issues. The stepping stone MCMC ran smoothly, although it took over four days to run. The issue happened at the very end as it was calculating autocorrelations and estimated sample sizes:

Since what I really care about is the marginal likelihood estimate, I wanted to see if I could bypass the autocorrelation calculations and go straight to that. My lack of knowledge about Python was fairly limiting in this respect, but I managed to comment out the appropriate lines from the SumPImpl.py script within the Phycas package to only run the marginalized likehood calculations. Unfortunately, this is what I see then:

Perhaps the fact that the autocorrelation for the log likelihood was NA for beta = 0.0 has something to do with this... I think I will eventually try to run this script again when I have no others running to see if a different MCMC run will work.

Monday, January 28, 2013

Making Phylogenies Pretty

I'd like anyone and everyone to chime in on this. I need to create high-quality figures of phylogenies. I'm currently using Mesquite to get the backbone of the tree, but could use any advice in getting the tree looking good enough to be on a powerpoint slide:

I'd love to take a look at what people consider to be good phylogeny figures and what programs people use to make things look good. But mostly, I just want some more ideas about how to make this look better...even from those of you have no experience with phylogenies at all!

Edit: I'm collapsing all the edits into one. I've updated it such that all of the taxa are putatively monophyletic.