Thursday, March 21, 2013

Phycas woes (semi-)resolved

Just a little while ago, I wrote about getting nan for the number I need in my calculation for the marginal likelihood of a partitioning scheme. I have managed to dissect the Phycas package contents enough to obtain the number I need.

One of the first things I figured out was how to change the working directory in Python:

import os
os.chdir("path")

Simple enough. Next, I needed to find the inner workings of the function sump().  From my rummaging from last time, I knew that the files SumP.py and SumPImpl.py would be important in my search.

In SumPImpl.py, I found the function I needed to produce the marginal likelihood estimate: gss(). gss() is a function defined within the class ParamSummarizer(). And ParamSummarizer() is called within a function of class SumP(), found in SumP.py. Here is the relevant code from SumP.py:

from phycas.Phycas.SumPImpl import ParamSummarizer

class SumP(PhycasCommand):

......

    def __call__(self, **kwargs):
        self.set(**kwargs)
        c = copy.deepcopy(self)
        param_summarizer = ParamSummarizer(c)
        param_summarizer.run()

So it seems to me that what I want is to assign something as ParamSummarizer(). But what do I put in the parentheses? There is this mysterious entity here called self, which can be found throughout both scripts. While I still don't fully understand how it works, this is needed to call functions within the same class. What seemed to work for me was this:

a = ParamSummarizer(sump)

Now, if I run a.run() or any other function defined within ParamSummarizer, it acts the way I expect it to, as long as I have previously specified the file to work with:

sump.file = "work.p"
a.run()
Autocorrelations (lag 1):

        beta ->     1.00000      0.90000      0.80000      0.70000      0.60000      0.50000      0.40000      0.30000      0.20000      0.10000      0.00000
            lnL     0.96409      0.36369      0.38353      0.34894      0.40534      0.30339      0.30469      0.32123      0.29938      0.23546          nan
        lnPrior     0.86502      0.77219      0.74660      0.69551      0.65932      0.59464      0.55650      0.50884      0.45802      0.35500      0.01570
Error: (34, 'Result too large')

Well, this was what was happening earlier. How might I be able to get the exact value I want? The function that produces the marginal likelihood estimates in our case is gss(). This requires three arguments: self, betas, and p. self is given when the function is called, so let's focus on betas and p.

I found what betas and p are by looking at the function that calls gss(), marginal_likelihood(). But of course, marginal_likelihood() also requires some arguments, so I had to look at the function that calls marginal_likelihood() as well: handleFile(). Finally, a function where all we need to know is the file name:

def handleFile(self, fn):
burnin = self.opts.burnin
lines = open(fn, 'r').readlines()
if len(lines) < 3 + burnin:
self.output("File '%s' does not look like a parameter file (too few lines)")
else:
headers = lines[1].split()
if headers[1] == 'beta':
try:
self.marginal_likelihood(headers, lines, burnin)
except InvalidNumberOfColumnsError, e:
print e
else:
self.std_summary(headers, lines, burnin)

I decided to go step by step to make sure that I got the right variables in the right places. First, the code in handleFile():

lines = open("work.p", 'r').readlines()
headers = lines[1].split(

Followed by code in marginal_likelihood():

betas = []
params = {}
for h in headers:
     params[h] = {}

row_start = 2
curr_beta = None
for i,line in enumerate(lines[row_start:]):
     parts = line.split()
     beta = float(parts[1])
     if curr_beta is None or curr_beta != beta:
         curr_beta = beta
         betas.append(beta)
         for h,x in zip(headers,parts):
             params[h][beta] = [float(x)]
     else:
         for h,x in zip(headers,parts):
             params[h][beta].append(float(x))

This gives us reasonable values for betas and params, which are going to be our betas and p going into gss():

a.gss(betas, params)

    b_(k-1)  beta_incr          n            lnRk        lnR(cum)
      0.900      0.100      10000    -2320.508943    -2320.508943
      0.800      0.100      10000    -2321.152357    -4641.661300
      0.700      0.100      10000    -2320.351828    -6962.013128
      0.600      0.100      10000    -2320.291323    -9282.304451
      0.500      0.100      10000    -2320.221322   -11602.525773
      0.400      0.100      10000    -2321.118544   -13923.644317
      0.300      0.100      10000    -2322.397532   -16246.041849
      0.200      0.100      10000    -2324.258373   -18570.300222
      0.100      0.100      10000    -2327.929248   -20898.229469
      0.000      0.100      10000             nan             nan
 nan Generalized Stepping Stone method

So this is still giving us nan. Let's step through gss() and see why this is happening:

likes = p['lnL']
priors = p['lnPrior']
working_priors = p['lnRefDens']
lnR = 0.0
nbetas = len(betas)

This is all of the code before the big for-loop that takes us through each of the beta values. I'm going to change the for loop so that it ends just before the iteration that gives us nan by changing the ending point from nbetas to nbetas-1:

for i in range(1,nbetas-1):
    blarger = betas[i - 1]
    bsmaller = betas[i]
    beta_incr = blarger - bsmaller
    loglikes = likes[bsmaller]
    logpriors = priors[bsmaller]
    logwpriors = working_priors[bsmaller]
    n = len(loglikes)
    etak = max([(lnL + lnp - lnwp) for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors)])
    tmp = 0.0
    for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors):
        log_term = beta_incr*(lnL + lnp - lnwp - etak)
        tmp += math.exp(log_term)
    tmp /= float(n)
    lnRk = beta_incr*etak + math.log(tmp)
    lnR += lnRk

Now, as I step through the code again using i = nbetas - 1, I see where the error is coming from:

    for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors):
        log_term = beta_incr*(lnL + lnp - lnwp - etak)
        tmp += math.exp(log_term)

During some iterations, log_term is nan. And looking through my work.p file, I see that this is because there are nans and infs in the parameters file. What if I can just get Python to ignore these lines and use everything else? To do this, I imported numpy and used the isnan() function:

for lnL,lnp,lnwp in zip(loglikes,logpriors,logwpriors):
    log_term = beta_incr*(lnL + lnp - lnwp - etak)
    if numpy.isnan(log_term):
        n -= 1
    else:
        tmp += math.exp(log_term)

No comments:

Post a Comment