Tuesday, May 1, 2012

Geiger Bug

Big thanks to Luke Harmon for supplying me with updated code that fixes a small bug in the geiger package!  For those of you who have had issues with fitDiscrete in the past, you can email Luke for a fix.

The issue that arises is when you use either the symmetrical model (model = "SYM") or the all rates different model (model = "ARD").  You will get an error message looking like this after waiting for some time for the likelihood optimization to finish:

Finding the maximum likelihood solution
[0        50      100]
[....................]
Error in getQ(exp(out$par), nRateCats, model) :
  You must supply the correct number of rate categories.

But you don't actually control what goes into getQ through the arguments you give to fitDiscrete.  So if you encounter this issue using either of the above models, be sure to email Luke to get the updated code.  A revamped version of geiger is on the way, so hopefully this issue won't be around for much longer!

The one thing Luke cautioned me about using these two models is that they can both quickly become parameter-rich.  The number of rate parameters for the symmetric model is n * (n-1) / 2, while for the all rates different model, the number of rate parameters is n * (n-1) where n is the number of discrete states.  For example, if I have five discrete states, I would have ten parameters for the symmetric model and twenty parameters for the all rates different model.  I need a lot of data to be estimating so many parameters!

I did try out his updated code, and it works perfectly fine.  It still does take a while (I'm sure the speed depends on the size of your data set, the shape of your phylogeny, your computer's specs...), but I hear that there is already a faster version of the code if you ask Luke for it.  It would definitely be interesting to test out how much of a difference there is and what they changed to make it faster.

If all you are trying to estimate is a single rate model (model = "ER", which stands for equal rates), then there is no need to use an updated version of the function.  The old version will work just fine.  The problem with the original was that the calculation of the rate categories occurred twice:  once to get the number of rate categories, and another within getQ to test whether the number of rate categories was correct.  So for example, with five discrete states nRateCats in the above will equal ten for the symmetric model.  But you will get an error to supply the correct number of rate categories because within getQ, it compares the number of parameters to nRateCats * (nRateCats-1) / 2, which would compare ten to forty-five.

No comments:

Post a Comment