Another great package that I was first introduced to in my Statistical Computing class, RColorBrewer.
This is a package that has built-in palettes that allows you to choose colors that have enough contrast for making plots.
The palettes come in three basic types: sequential, diverging, and qualitative. The sequential and diverging are great for plots where you want the colors to show an order. The difference between sequential and diverging seems to be a little subtle in terms of need: sequential shows more of a gradient, while diverging emphasizes both high and low extremes.
The qualitative palettes are best for categorical data with no ordering among categories. There are many sets, but they differ in the number of maximum colors, from 8 to 12. One interesting palette is the Paired palette, which consists of 6 hues, each with a light and dark color. I used this recently in a plot of different species, with males and females of varying lightness.
The same effect can be produced, perhaps to better effect, with different plotting symbols, but the Paired palette does a pretty good job.
Friday, November 30, 2012
Monday, September 24, 2012
Classes and R
I heard about this Coursera course through the Davis R Users' Group (DRUG) mailing list, which is something that has just started up. I first heard about DRUG less than a week ago, and I immediately signed up. The first meeting is this Friday.
It is apparently not a workshop but simply two hours of the week set aside to work with R in the company of other people working with R. At least, that seems to be the plan so far. Hopefully this will mean that I will get a lot of analysis done, and I might even come across some interesting problems I can talk about on this blog, both in my own work and in other people's. And I believe it will be a good environment to work with any sort of phylogenetics software.
And of course, it can be time set aside for me to work on my online course material for "Computing for Data Analysis." I enrolled in the class just today, so I haven't had a chance yet to check out the lectures, but I plan to do so soon. They say it is designed for first-year biostatistics graduate students, so we will see how it fits into what I already know. Regardless, I am sure it will be a blast! I can always use more practice.
There are lectures, quizzes, and longer programming assignments associated with the class. Coursera seems to have quite a variety of classes available. I hope to take a more thorough look at what relevant courses are available later . . . I would like to take any classes involving R and maybe some other languages as well.
I am also interested in checking out some of the Udacity courses, which were previously recommended to me by a friend who works for Google. I think none of their classes are about R (mostly Python, I believe), but I am sure many are relevant to those interested in programming languages and data analysis. They offer classes for those with no programming background as well.
Both Coursera and Udacity are free. While some Coursera courses don't give out certificates, this one does, as long as you get a passing grade (70/100). I believe all of the Udacity courses have an option for certification, so you can have some record that you put in the effort to learn the material.
This will be my first time taking an online course, but from my experience taking a statistical computing class, I think the assignments will be very useful in retaining information and building good habits, even without the certification accompanying it. Even still, it could be a nice addition to a C.V., and Udacity also offers to send resumes to companies. That's not necessarily relevant to those wishing to stay in academia, but it's nice to know that we're developing transferable skills!
It is apparently not a workshop but simply two hours of the week set aside to work with R in the company of other people working with R. At least, that seems to be the plan so far. Hopefully this will mean that I will get a lot of analysis done, and I might even come across some interesting problems I can talk about on this blog, both in my own work and in other people's. And I believe it will be a good environment to work with any sort of phylogenetics software.
And of course, it can be time set aside for me to work on my online course material for "Computing for Data Analysis." I enrolled in the class just today, so I haven't had a chance yet to check out the lectures, but I plan to do so soon. They say it is designed for first-year biostatistics graduate students, so we will see how it fits into what I already know. Regardless, I am sure it will be a blast! I can always use more practice.
There are lectures, quizzes, and longer programming assignments associated with the class. Coursera seems to have quite a variety of classes available. I hope to take a more thorough look at what relevant courses are available later . . . I would like to take any classes involving R and maybe some other languages as well.
I am also interested in checking out some of the Udacity courses, which were previously recommended to me by a friend who works for Google. I think none of their classes are about R (mostly Python, I believe), but I am sure many are relevant to those interested in programming languages and data analysis. They offer classes for those with no programming background as well.
Both Coursera and Udacity are free. While some Coursera courses don't give out certificates, this one does, as long as you get a passing grade (70/100). I believe all of the Udacity courses have an option for certification, so you can have some record that you put in the effort to learn the material.
This will be my first time taking an online course, but from my experience taking a statistical computing class, I think the assignments will be very useful in retaining information and building good habits, even without the certification accompanying it. Even still, it could be a nice addition to a C.V., and Udacity also offers to send resumes to companies. That's not necessarily relevant to those wishing to stay in academia, but it's nice to know that we're developing transferable skills!
Tuesday, August 28, 2012
likelihood reconstruction of ancestral states
In a previous post, I talked about how summarizing the state at each internal node over many make.simmap mappings did not correspond exactly with the ace reconstructions. I contacted Liam Revell about this, and he informed me that this may be because ace does not compute the scaled marginal likelihoods but the conditional likelihoods of the subtrees descending from each node. He suggested I try rerooting the tree at each internal node and using the ace reconstruction at the root to find the marginal likelihoods. Here is my attempt:
nodes = (exampleTree$Nnode+2):(exampleTree$Nnode*2+1) # This gives me the number associated with each internal node
reRootAnc = t(sapply(nodes, function(x) {
tr = reroot(exampleTree, node = x, position = 0) # rerooting the tree at each internal node
reconst = ace(x = discrete, phy = tr, type = "discrete", model = "SYM") # estimating the maximum likelihood ancestral state estimate
reconst$lik.anc[1,] # taking only the value for the root
}))
This is the result I got for an example tree:
nodes = (exampleTree$Nnode+2):(exampleTree$Nnode*2+1) # This gives me the number associated with each internal node
reRootAnc = t(sapply(nodes, function(x) {
tr = reroot(exampleTree, node = x, position = 0) # rerooting the tree at each internal node
reconst = ace(x = discrete, phy = tr, type = "discrete", model = "SYM") # estimating the maximum likelihood ancestral state estimate
reconst$lik.anc[1,] # taking only the value for the root
}))
This is the result I got for an example tree:
Seems a little strange to me...I wonder what might be going on.
Thursday, August 9, 2012
I knew this shouldn't be so complicated...
I was trying to analyze my incomplete dataset, and I needed to remove data for species where I have measured fewer than four individuals. Since my dataset was small enough, it was easy to just remove them by hand, doing something like this:
> df
Species meas
1 a 9
2 a 1
3 a 7
4 a 5
5 b 7
6 c 0
7 c 5
8 c 2
9 c 9
10 c 1
11 d 3
12 d 2
> dfremoved = df[-which(df$Species=="b"),]
> dfremoved = dfremoved[-which(dfremoved$Species=="d"),]
> dfremoved
Species meas
1 a 9
2 a 1
3 a 7
4 a 5
6 c 0
7 c 5
8 c 2
9 c 9
10 c 1
But in order to do this systematically, I used a couple of steps.
> toofew = names(which(table(df$Species) < 4))
> toofew
[1] "b" "d"
First, I found the species names for species with fewer than four individuals. With this, I can remove all rows where df$Species match any of these names.
> dfremoved = df[!(df$Species %in% toofew),]
> dfremoved
Species meas
1 a 9
2 a 1
3 a 7
4 a 5
6 c 0
7 c 5
8 c 2
9 c 9
10 c 1
> df
Species meas
1 a 9
2 a 1
3 a 7
4 a 5
5 b 7
6 c 0
7 c 5
8 c 2
9 c 9
10 c 1
11 d 3
12 d 2
> dfremoved = df[-which(df$Species=="b"),]
> dfremoved = dfremoved[-which(dfremoved$Species=="d"),]
> dfremoved
Species meas
1 a 9
2 a 1
3 a 7
4 a 5
6 c 0
7 c 5
8 c 2
9 c 9
10 c 1
But in order to do this systematically, I used a couple of steps.
> toofew = names(which(table(df$Species) < 4))
> toofew
[1] "b" "d"
First, I found the species names for species with fewer than four individuals. With this, I can remove all rows where df$Species match any of these names.
> dfremoved = df[!(df$Species %in% toofew),]
> dfremoved
Species meas
1 a 9
2 a 1
3 a 7
4 a 5
6 c 0
7 c 5
8 c 2
9 c 9
10 c 1
Tuesday, July 31, 2012
Avoiding Repetition
If there is anything I learned in STA 141, I learned the importance of the DRY principle: don't repeat yourself. Anything repetitive was heavily penalized in the grades, but it's also more prone to error and often takes much longer. Even still, it's easy to be lazy and fall into the "copy/paste then change one word" routine, especially when just exploring a data set. That's what I started out doing, but it turns out that doing it the 'right' way even easier!
For example, I want a plot of all morphological variables against size (standard length). I want the points colored by species, and I want each point to be a unique value for the species.
plot(dat$standard.length, dat$head.length, col = as.factor(dat$Species), pch = as.character(dat$Number))
Now I can copy/paste this line and replace "head.length" with all of my other variables. Simple enough.
But it turns out I have 24 variables. So doing this will take much longer than this simple loop:
meas = names(dat)[9:length(dat)] # all morphological variables except standard length
sapply(meas, function(x) {
png(file = paste(x, ".png", sep = ""))
plot(dat$standard.length, dat[,x], col = as.factor(dat$Species), pch = as.character(dat$Number))
dev.off()
})
I can do better by putting down axis labels and such, but now I have .png files of each of my morphological variables that I can browse through with my favorite image viewer.
For example, I want a plot of all morphological variables against size (standard length). I want the points colored by species, and I want each point to be a unique value for the species.
plot(dat$standard.length, dat$head.length, col = as.factor(dat$Species), pch = as.character(dat$Number))
Now I can copy/paste this line and replace "head.length" with all of my other variables. Simple enough.
But it turns out I have 24 variables. So doing this will take much longer than this simple loop:
meas = names(dat)[9:length(dat)] # all morphological variables except standard length
sapply(meas, function(x) {
png(file = paste(x, ".png", sep = ""))
plot(dat$standard.length, dat[,x], col = as.factor(dat$Species), pch = as.character(dat$Number))
dev.off()
})
I can do better by putting down axis labels and such, but now I have .png files of each of my morphological variables that I can browse through with my favorite image viewer.
Sunday, July 29, 2012
Control Flow
I use if/else and for quite often, but rarely use while or repeat. Even still, there are a couple of things with if/else that give me trouble if I haven't used it in a while.
> if(x>0) {
+ x = -x
+ y = 1
+ }
> else {
Error: unexpected 'else' in "else"
> x = x
> y = 0
> }
Error: unexpected '}' in "}"
If everything is on one line, all is good:
> x = 1
> y = if(x>0) 1 else 0
> y
[1] 1
But if I put the curly braces in the wrong place:
> if(x>0) {
+ x = -x
+ y = 1
+ }
> else {
Error: unexpected 'else' in "else"
> x = x
> y = 0
> }
Error: unexpected '}' in "}"
So I have to always make sure to do this:
> if(x>0) {
+ x = -x
+ y = 1
+ } else {
+ x = x
+ y = 0
+ }
> x
[1] -1
> y
[1] 1
Thursday, July 5, 2012
No polytomies allowed?
I have recently been in a position to want an efficient way to obtain a binary Newick-format tree from a Newick-format tree that may or may not have polytomies. Using a few functions from ape, this was fairly simple to obtain:
makebinary = function(newick) {
tree = read.tree(text = newick)
if(is.binary.tree(tree)) {
return(newick)
} else {
return(write.tree(multi2di(tree)))
}
}
This contains three very useful functions from ape: read.tree, write.tree, is.binary.tree, and multi2di. The first two are used to read/write Newick-format trees. The functions read.nexus and write.nexus can be used for Nexus-format trees. read.tree can be used for files as well. is.binary.tree checks whether a tree has any polytomies, and multi2di converts a tree with multichotomies to a fully dichotomous tree with some branches of length 0. There is also a function di2multi that will collapse any branches less than a tolerance level to a polytomy.
makebinary = function(newick) {
tree = read.tree(text = newick)
if(is.binary.tree(tree)) {
return(newick)
} else {
return(write.tree(multi2di(tree)))
}
}
This contains three very useful functions from ape: read.tree, write.tree, is.binary.tree, and multi2di. The first two are used to read/write Newick-format trees. The functions read.nexus and write.nexus can be used for Nexus-format trees. read.tree can be used for files as well. is.binary.tree checks whether a tree has any polytomies, and multi2di converts a tree with multichotomies to a fully dichotomous tree with some branches of length 0. There is also a function di2multi that will collapse any branches less than a tolerance level to a polytomy.
Saturday, June 30, 2012
Animating SIMMAP trees
It took me quite a while to get this working, but I finally did:
The tree topology is from Rüber et al. (2004). This is what I used to generate the image:
animateSimmap = function(phy, interval = .02, numb = length(phy), name = "animation.gif", ...) {
if(class(phy) != "multiPhylo") stop("object must be multiPhylo")
saveGIF(for(i in 1:numb) {
dev.hold()
plotSimmap(phy[[i]], ...)
Sys.sleep(interval)
}, movie.name = name)
}
So this function takes in a multiPhylo object that has stochastic character mappings, the time interval between trees (I'm not sure if this does anything because saveGIF might just have a set interval already), the number of mappings to animate, the name of the file to create, and any arguments you want to give to the plotSimmap function. In terms of the mappings shown above, they're meaningless because the branches aren't proportional to anything meaningful, but it shows how this function would work.
plotSimmap is from the package phytools, and saveGIF is from the package animation. To use saveGIF, you also need either ImageMagick (which is what I used), GraphicsMagick, or LyX.
The tree topology is from Rüber et al. (2004). This is what I used to generate the image:
animateSimmap = function(phy, interval = .02, numb = length(phy), name = "animation.gif", ...) {
if(class(phy) != "multiPhylo") stop("object must be multiPhylo")
saveGIF(for(i in 1:numb) {
dev.hold()
plotSimmap(phy[[i]], ...)
Sys.sleep(interval)
}, movie.name = name)
}
So this function takes in a multiPhylo object that has stochastic character mappings, the time interval between trees (I'm not sure if this does anything because saveGIF might just have a set interval already), the number of mappings to animate, the name of the file to create, and any arguments you want to give to the plotSimmap function. In terms of the mappings shown above, they're meaningless because the branches aren't proportional to anything meaningful, but it shows how this function would work.
plotSimmap is from the package phytools, and saveGIF is from the package animation. To use saveGIF, you also need either ImageMagick (which is what I used), GraphicsMagick, or LyX.
Wednesday, May 16, 2012
SIMMAP Trees
Thanks to Liam Revell, we can now produce simulated discrete character mappings within R. The function make.simmap uses ace, from the package ape, to fit the model to use for the simulations. Thus, the proportion at which a character state appears at a node among many iterations of simulations should be roughly equivalent to the likelihood of that state as estimated in ace. I have heard this from many people, but I wanted to be able to summarize the actual simulated states. In order to do this, I took a look at the components of a SIMMAP tree. I'll illustrate this with a mock example:
We can take a look at what components this object has:
> names(exampleSimmap)
[1] "edge" "edge.length" "tip.label" "Nnode" "maps"
[6] "mapped.edge"
The first components are the same as any phylo object. So maps and mapped.edge are what make a SIMMAP tree special. Let's take a look (the middle elements removed to save space):
> exampleSimmap$maps
[[1]]
1 0
0.0958356 0.3197055
[[2]]
0 1
0.2371619 0.1659281
[[3]]
1
0.01380231
...
[[17]]
1
0.1164989
[[18]]
1 0
0.1161601 0.1769559
This is precisely what we need! Each element of exampleSimmap$maps represents a single branch, and the values represents the length of time that branch spends in each state, in this case 0 or 1. That means we can simply take each branch's starting value (whatever is the name of the first element of that branch), and that is the value at the node where the branch starts. Let's see if we can find this.
> exampleSimmap$edge
[,1] [,2]
[1,] 11 12
[2,] 12 19
[3,] 19 1
[4,] 19 2
[5,] 12 13
[6,] 13 17
[7,] 17 3
[8,] 17 4
[9,] 13 18
[10,] 18 5
[11,] 18 6
[12,] 11 14
[13,] 14 15
[14,] 15 7
[15,] 15 16
[16,] 16 8
[17,] 16 9
[18,] 14 10
The element named edge gives us the starting and ending node for each of the 18 edges in our tree. That means we can use to figure out which node corresponds with which state. Here I've written a function that takes in a SIMMAP tree and returns a named vector where the values are the node states and the names are the nodes.
mappedNode = function(phy) {
# phy must be a SIMMAP tree
nodes = phy$edge[,1] # this gives us the starting node for all edges
map = sapply(phy$maps, function(x) attr(x, "names")[1]) # this gives us the starting value of each branch
df = unique(data.frame(nodes = nodes, map = map)) # here we're removing the repeated values as interior nodes will have multiple branches
mapping = df$map
names(mapping) = df$nodes # naming the vector with node names
mapping
}
Now we can run this function over all of the simulated mappings:
mappings = sapply(simmapTrees, function(x) mappedNode(x))
# change the 0 and 1 to numeric
nummaps = as.data.frame(sapply(1:length(simmapTrees), function(x) as.numeric(mappings[,x])))
# make sure the row names correspond to node names
rownames(nummaps) = rownames(mappings)
# get the number of simulated trees with node state of 1
sums = sapply(rownames(nummaps), function(x) sum(nummaps[x,]))
# change that to a frequency
freq = sums/length(simmapTrees)
# plot
plot(exampleTree, label.offset = .05)
nodelabels(pie = freq, cex = .65, node = as.numeric(names(freq)))
tiplabels(pie = discTrait, cex = .65)
Looks fairly reasonable. Now let's estimate the ancestral character using ace:
MLACE = ace(discTrait, exampleTree, type = "discrete", model = "SYM")
plot(exampleTree, label.offset = .05)
nodelabels(pie = 1 - MLACE$lik.anc, cex = .65)
tiplabels(pie = discTrait, cex = .65)
We can take a look at what components this object has:
> names(exampleSimmap)
[1] "edge" "edge.length" "tip.label" "Nnode" "maps"
[6] "mapped.edge"
The first components are the same as any phylo object. So maps and mapped.edge are what make a SIMMAP tree special. Let's take a look (the middle elements removed to save space):
> exampleSimmap$maps
[[1]]
1 0
0.0958356 0.3197055
[[2]]
0 1
0.2371619 0.1659281
[[3]]
1
0.01380231
...
[[17]]
1
0.1164989
[[18]]
1 0
0.1161601 0.1769559
This is precisely what we need! Each element of exampleSimmap$maps represents a single branch, and the values represents the length of time that branch spends in each state, in this case 0 or 1. That means we can simply take each branch's starting value (whatever is the name of the first element of that branch), and that is the value at the node where the branch starts. Let's see if we can find this.
> exampleSimmap$edge
[,1] [,2]
[1,] 11 12
[2,] 12 19
[3,] 19 1
[4,] 19 2
[5,] 12 13
[6,] 13 17
[7,] 17 3
[8,] 17 4
[9,] 13 18
[10,] 18 5
[11,] 18 6
[12,] 11 14
[13,] 14 15
[14,] 15 7
[15,] 15 16
[16,] 16 8
[17,] 16 9
[18,] 14 10
The element named edge gives us the starting and ending node for each of the 18 edges in our tree. That means we can use to figure out which node corresponds with which state. Here I've written a function that takes in a SIMMAP tree and returns a named vector where the values are the node states and the names are the nodes.
mappedNode = function(phy) {
# phy must be a SIMMAP tree
nodes = phy$edge[,1] # this gives us the starting node for all edges
map = sapply(phy$maps, function(x) attr(x, "names")[1]) # this gives us the starting value of each branch
df = unique(data.frame(nodes = nodes, map = map)) # here we're removing the repeated values as interior nodes will have multiple branches
mapping = df$map
names(mapping) = df$nodes # naming the vector with node names
mapping
}
Now we can run this function over all of the simulated mappings:
mappings = sapply(simmapTrees, function(x) mappedNode(x))
# change the 0 and 1 to numeric
nummaps = as.data.frame(sapply(1:length(simmapTrees), function(x) as.numeric(mappings[,x])))
# make sure the row names correspond to node names
rownames(nummaps) = rownames(mappings)
# get the number of simulated trees with node state of 1
sums = sapply(rownames(nummaps), function(x) sum(nummaps[x,]))
# change that to a frequency
freq = sums/length(simmapTrees)
# plot
plot(exampleTree, label.offset = .05)
nodelabels(pie = freq, cex = .65, node = as.numeric(names(freq)))
tiplabels(pie = discTrait, cex = .65)

Looks fairly reasonable. Now let's estimate the ancestral character using ace:
MLACE = ace(discTrait, exampleTree, type = "discrete", model = "SYM")
plot(exampleTree, label.offset = .05)
nodelabels(pie = 1 - MLACE$lik.anc, cex = .65)
tiplabels(pie = discTrait, cex = .65)
Pretty close, but not identical. Maybe if we do more simulated mappings (I did 1000 here), they will start to look more similar?
Wednesday, May 2, 2012
Regular Expressions
My first experience with regular expressions came from Python for Dummies. It looked particularly relevant to the specific task I was working on, scraping specific bits of information from fishbase. When my advisor, Peter Wainwright, first approached me about this, I didn't know where to begin, so I went to two people with experience in these sorts of tasks: Bob Thomson and Carl Boettiger.
Bob's suggestions were to download each individual fish's HTML file using a short bash script, then use something like Python or Perl to extract relevant bits. With over 30,000 species, just downloading the HTML files took quite a long time. But with no background (at the time) in Python or Perl, I turned to Carl, who suggested using R. He quickly wrote a package, rfishbase that allows you to access information from the XML files on fishbase through R. Although the XML files don't have all of the information available on the HTML files, they still have quite a lot.
My reason for this post, though, is because of a task my lab mate, Patrick, wished to accomplish using the data he accessed using rfishbase. Looking at a character vector containing information of interest, he wanted to get all of the reference numbers within that vector. An example of an element might be something like this:
"Occurs mainly over rocky and muddy bottoms. Uncommon around coral reefs. Usually rests on the bottom (Ref. 9710). Juveniles may be found in shallow water, but adults are usually taken from depths of 70-330 m (Ref. 13442). Reptant and natant decapods were the main food items throughout the year (Ref 59311). Feeds on a wide variety of fishes and invertebrates."
Given this, he would want the numbers 9710, 13442, and 59311. Even in this one example, you can see that they are not always consistent: the first two have a period while the third doesn't. And there are even things like this:
"Common species. Free-living. Assumed to feed on small invertebrates and fish (Ref. 4741, 34024). Feed on small bottom animals (Ref. 35388)."
Notice the many spaces before the first reference and having two numbers. Or this:
"Occurs in various inshore habitats (Ref. 9800). Feeds on benthic invertebrates and fish (Ref. 11889). Also Ref. 43081."
This one doesn't even have parentheses around the last one. So the first thing I did was to find every instance of "Ref" followed by an optional period, any number of spaces, and a run of any number of numbers, commas, and spaces.
ref = regmatches(matches, gregexpr("Ref\\.? *[0-9 ,]*", matches))
where matches is the character vector with all of the information we're looking at. This is modified from here. Next, I removed all characters other than digits or commas and used strsplit to separate individual reference numbers.
refs = sapply(ref, function(x) unlist(strsplit(gsub("[^0-9,]", "", x), ",")))
You end up with a list the same length as the original character vector, and every element is a character vector of all of the reference numbers. From here, you can go in and find all of the unique values to find all of the references you need.
Bob's suggestions were to download each individual fish's HTML file using a short bash script, then use something like Python or Perl to extract relevant bits. With over 30,000 species, just downloading the HTML files took quite a long time. But with no background (at the time) in Python or Perl, I turned to Carl, who suggested using R. He quickly wrote a package, rfishbase that allows you to access information from the XML files on fishbase through R. Although the XML files don't have all of the information available on the HTML files, they still have quite a lot.
My reason for this post, though, is because of a task my lab mate, Patrick, wished to accomplish using the data he accessed using rfishbase. Looking at a character vector containing information of interest, he wanted to get all of the reference numbers within that vector. An example of an element might be something like this:
"Occurs mainly over rocky and muddy bottoms. Uncommon around coral reefs. Usually rests on the bottom (Ref. 9710). Juveniles may be found in shallow water, but adults are usually taken from depths of 70-330 m (Ref. 13442). Reptant and natant decapods were the main food items throughout the year (Ref 59311). Feeds on a wide variety of fishes and invertebrates."
Given this, he would want the numbers 9710, 13442, and 59311. Even in this one example, you can see that they are not always consistent: the first two have a period while the third doesn't. And there are even things like this:
"Common species. Free-living. Assumed to feed on small invertebrates and fish (Ref. 4741, 34024). Feed on small bottom animals (Ref. 35388)."
Notice the many spaces before the first reference and having two numbers. Or this:
"Occurs in various inshore habitats (Ref. 9800). Feeds on benthic invertebrates and fish (Ref. 11889). Also Ref. 43081."
This one doesn't even have parentheses around the last one. So the first thing I did was to find every instance of "Ref" followed by an optional period, any number of spaces, and a run of any number of numbers, commas, and spaces.
ref = regmatches(matches, gregexpr("Ref\\.? *[0-9 ,]*", matches))
where matches is the character vector with all of the information we're looking at. This is modified from here. Next, I removed all characters other than digits or commas and used strsplit to separate individual reference numbers.
refs = sapply(ref, function(x) unlist(strsplit(gsub("[^0-9,]", "", x), ",")))
You end up with a list the same length as the original character vector, and every element is a character vector of all of the reference numbers. From here, you can go in and find all of the unique values to find all of the references you need.
Tuesday, May 1, 2012
Geiger Bug
Big thanks to Luke Harmon for supplying me with updated code that fixes a small bug in the geiger package! For those of you who have had issues with fitDiscrete in the past, you can email Luke for a fix.
The issue that arises is when you use either the symmetrical model (model = "SYM") or the all rates different model (model = "ARD"). You will get an error message looking like this after waiting for some time for the likelihood optimization to finish:
Finding the maximum likelihood solution
[0 50 100]
[....................]
Error in getQ(exp(out$par), nRateCats, model) :
You must supply the correct number of rate categories.
But you don't actually control what goes into getQ through the arguments you give to fitDiscrete. So if you encounter this issue using either of the above models, be sure to email Luke to get the updated code. A revamped version of geiger is on the way, so hopefully this issue won't be around for much longer!
The one thing Luke cautioned me about using these two models is that they can both quickly become parameter-rich. The number of rate parameters for the symmetric model is n * (n-1) / 2, while for the all rates different model, the number of rate parameters is n * (n-1) where n is the number of discrete states. For example, if I have five discrete states, I would have ten parameters for the symmetric model and twenty parameters for the all rates different model. I need a lot of data to be estimating so many parameters!
I did try out his updated code, and it works perfectly fine. It still does take a while (I'm sure the speed depends on the size of your data set, the shape of your phylogeny, your computer's specs...), but I hear that there is already a faster version of the code if you ask Luke for it. It would definitely be interesting to test out how much of a difference there is and what they changed to make it faster.
If all you are trying to estimate is a single rate model (model = "ER", which stands for equal rates), then there is no need to use an updated version of the function. The old version will work just fine. The problem with the original was that the calculation of the rate categories occurred twice: once to get the number of rate categories, and another within getQ to test whether the number of rate categories was correct. So for example, with five discrete states nRateCats in the above will equal ten for the symmetric model. But you will get an error to supply the correct number of rate categories because within getQ, it compares the number of parameters to nRateCats * (nRateCats-1) / 2, which would compare ten to forty-five.
The issue that arises is when you use either the symmetrical model (model = "SYM") or the all rates different model (model = "ARD"). You will get an error message looking like this after waiting for some time for the likelihood optimization to finish:
Finding the maximum likelihood solution
[0 50 100]
[....................]
Error in getQ(exp(out$par), nRateCats, model) :
You must supply the correct number of rate categories.
But you don't actually control what goes into getQ through the arguments you give to fitDiscrete. So if you encounter this issue using either of the above models, be sure to email Luke to get the updated code. A revamped version of geiger is on the way, so hopefully this issue won't be around for much longer!
The one thing Luke cautioned me about using these two models is that they can both quickly become parameter-rich. The number of rate parameters for the symmetric model is n * (n-1) / 2, while for the all rates different model, the number of rate parameters is n * (n-1) where n is the number of discrete states. For example, if I have five discrete states, I would have ten parameters for the symmetric model and twenty parameters for the all rates different model. I need a lot of data to be estimating so many parameters!
I did try out his updated code, and it works perfectly fine. It still does take a while (I'm sure the speed depends on the size of your data set, the shape of your phylogeny, your computer's specs...), but I hear that there is already a faster version of the code if you ask Luke for it. It would definitely be interesting to test out how much of a difference there is and what they changed to make it faster.
If all you are trying to estimate is a single rate model (model = "ER", which stands for equal rates), then there is no need to use an updated version of the function. The old version will work just fine. The problem with the original was that the calculation of the rate categories occurred twice: once to get the number of rate categories, and another within getQ to test whether the number of rate categories was correct. So for example, with five discrete states nRateCats in the above will equal ten for the symmetric model. But you will get an error to supply the correct number of rate categories because within getQ, it compares the number of parameters to nRateCats * (nRateCats-1) / 2, which would compare ten to forty-five.
Thursday, April 19, 2012
Creating Trees
I told Luke Mahler about how I'm interested in practicing analyses in R with my particular group of interest, even though I haven't collected data on more than a few species and I don't have a tree. He told me creating a tree from previously published studies is easy: just import the tree into R in newick format! In the geiger package in R, there is a function read.tree that does just that.
The tricky part about doing this is creating the newick-format tree itself. It involves a lot of parentheses, colons, and parentheses:
exampleTree = read.tree(text = "(((A:1, B:1):1, C:2):1, D:3);")
will give you
Since I was typing out the newick-format tree directly within read.tree, I needed to make sure it went into the text argument instead of the file argument. As for the newick format itself, the tip names are followed by the immediately subtending branch length: hence A:1 or D:3. The nodes also need to be provided with branch lengths, which is the :1 following each clade designated in parentheses. Finally, you can't forget the semi-colon at the end. It is very easy to get lost in a sea of parentheses, especially for large trees, so inevitably I needed to go back to make changes. Testing small clades at a time makes this a little easier.
In my particular tree, I had polytomies, which I designated with 0-length internal branches. So if in the above tree, I actually don't have any information about the interrelationships of A, B, and C, I can use this:
polytomyTree = read.tree(text = "(((A:1, B:1):0, C:1):1, D:2);")
to get
The tricky part about doing this is creating the newick-format tree itself. It involves a lot of parentheses, colons, and parentheses:
exampleTree = read.tree(text = "(((A:1, B:1):1, C:2):1, D:3);")
will give you
Since I was typing out the newick-format tree directly within read.tree, I needed to make sure it went into the text argument instead of the file argument. As for the newick format itself, the tip names are followed by the immediately subtending branch length: hence A:1 or D:3. The nodes also need to be provided with branch lengths, which is the :1 following each clade designated in parentheses. Finally, you can't forget the semi-colon at the end. It is very easy to get lost in a sea of parentheses, especially for large trees, so inevitably I needed to go back to make changes. Testing small clades at a time makes this a little easier.
In my particular tree, I had polytomies, which I designated with 0-length internal branches. So if in the above tree, I actually don't have any information about the interrelationships of A, B, and C, I can use this:
polytomyTree = read.tree(text = "(((A:1, B:1):0, C:1):1, D:2);")
to get
Wednesday, April 11, 2012
RStudio
I was first introduced to RStudio when I took STA141 with Duncan Temple Lang. At the time, I was using a PC, and it made things much simpler. A few of the features that I love about it:
1) Color scheme for scripts. I can see at a glance what's a comment, and the parentheses pop out at you. This was something I envied about Macs when I used to use my PC exclusively. I do wish you could personalize it a little more, but the schemes they have are great. I personally like Cobalt.
2) Balancing parentheses and quotations. Yes, this can get obnoxious if you are tweaking existing code rather than starting from scratch, since if you try to type an end quote, it will be interpreted as another beginning quote. But I find being able to see what's matching your current end parenthesis incredibly helpful. Besides, if you don't like it, you can easily turn it off in the preferences.
3) Everything in one window. This wasn't a step up from my PC version since the normal R console keeps everything in one window, but it's so nice to have everything (script, console, history, help pages, graphics) in one window.
1) Color scheme for scripts. I can see at a glance what's a comment, and the parentheses pop out at you. This was something I envied about Macs when I used to use my PC exclusively. I do wish you could personalize it a little more, but the schemes they have are great. I personally like Cobalt.
2) Balancing parentheses and quotations. Yes, this can get obnoxious if you are tweaking existing code rather than starting from scratch, since if you try to type an end quote, it will be interpreted as another beginning quote. But I find being able to see what's matching your current end parenthesis incredibly helpful. Besides, if you don't like it, you can easily turn it off in the preferences.
3) Everything in one window. This wasn't a step up from my PC version since the normal R console keeps everything in one window, but it's so nice to have everything (script, console, history, help pages, graphics) in one window.
Friday, March 16, 2012
Color-coding
I've been helping some fellow students here at the workshop with R. One of the things I did was to create a vector of colors to label the tiplabels. Basically, the structure of the data was a data frame with the ID that roughly corresponds with the tip labels in the first column and the actual species name in the second column. Here being a toy example:
> species
ID species
1 A123 speca
2 SA123 speca
3 c123 specb
However, the tip labels of the tree don't exactly correspond to the IDs:
> tree$tip.label
[1] "SA123" "a123" "c123"
Time for some regular expressions!
> species[,1] = gsub('^A(.*)', 'a\\1', species[,1])
> species[,1]
[1] "a123" "SA123" "c123"
I'll break down the gsub function call a little bit. The first argument to gsub is the pattern. The ^ means the beginning, then A. () names a group, and since it is the first set of (), this group is 1. Within 1, there is a .*. The . indicates any character, and the * indicates any number of those characters. So essentially, this pattern is looking for a capital A at the beginning of a string followed by anything or nothing.
The second argument is the replacement. The sub in gsub stands for substitution. So it will take the entire pattern and replace it with whatever the replacement is. If nothing matches, then nothing happens. The \\1 refers to the named group, 1. The \\ escapes, so that gsub will replace with whatever it finds in group 1, not with the number 1. So essentially, it will match any string that starts with a capital A followed by a group 1 that can be anything, and replace it with a lowercase a followed by the characters in group 1. The third argument is simply the vector you want to perform this on.
Now that I've gotten the ID to match the tip labels, I'm going to reorder the data frame to match the order of the tip labels.
> rownames(species) = species$ID
> species = species[tree$tip.label,]
> species
ID species
SA123 SA123 speca
a123 a123 speca
c123 c123 specb
Now, all I have to do is change species$species into a factor and convert it to a numeric!
> species
ID species
1 A123 speca
2 SA123 speca
3 c123 specb
However, the tip labels of the tree don't exactly correspond to the IDs:
> tree$tip.label
[1] "SA123" "a123" "c123"
Time for some regular expressions!
> species[,1] = gsub('^A(.*)', 'a\\1', species[,1])
> species[,1]
[1] "a123" "SA123" "c123"
I'll break down the gsub function call a little bit. The first argument to gsub is the pattern. The ^ means the beginning, then A. () names a group, and since it is the first set of (), this group is 1. Within 1, there is a .*. The . indicates any character, and the * indicates any number of those characters. So essentially, this pattern is looking for a capital A at the beginning of a string followed by anything or nothing.
The second argument is the replacement. The sub in gsub stands for substitution. So it will take the entire pattern and replace it with whatever the replacement is. If nothing matches, then nothing happens. The \\1 refers to the named group, 1. The \\ escapes, so that gsub will replace with whatever it finds in group 1, not with the number 1. So essentially, it will match any string that starts with a capital A followed by a group 1 that can be anything, and replace it with a lowercase a followed by the characters in group 1. The third argument is simply the vector you want to perform this on.
Now that I've gotten the ID to match the tip labels, I'm going to reorder the data frame to match the order of the tip labels.
> rownames(species) = species$ID
> species = species[tree$tip.label,]
> species
ID species
SA123 SA123 speca
a123 a123 speca
c123 c123 specb
Now, all I have to do is change species$species into a factor and convert it to a numeric!
Rerooting Trees
One of the things I did for today's presentation was to obtain a consensus tree from MrBayes. Since MrBayes estimates unrooted trees, I can choose the proper outgroup with which to root my tree. You can do this in programs like FigTree, but this is an R blog, after all. There is a package called geiger, which contains the necessary functions to do this.
> tree = read.nexus("Trees.nex")
I typically like to include multiple outgroups, and the easiest way to specify the outgroup might be by specifying the node number instead of the tip labels. If you plot the tree and the node labels like so
> plot(tree)
> nodelabels()
you can find the number of the root that specifies the clade of the outgroup. To root, you can use a function in ape, root.
> rerooted = root(chaettree, node = 99)
Nice and simple.
> tree = read.nexus("Trees.nex")
I typically like to include multiple outgroups, and the easiest way to specify the outgroup might be by specifying the node number instead of the tip labels. If you plot the tree and the node labels like so
> plot(tree)
> nodelabels()
you can find the number of the root that specifies the clade of the outgroup. To root, you can use a function in ape, root.
> rerooted = root(chaettree, node = 99)
Nice and simple.
Playing with OUwie
After all, I am an evolutionary biologist. And I just finished my joint presentation with Katie Staab from George Washington University at the Bodega Phylogenetics Workshop. We did quite some fun analyses, one of which was playing with a very new package, OUwie, brought to us by Jeremy M. Beaulieu and Brian O'Meara. You can use OUwie to fit models of character evolution under multiple regimes.
Possible models:
model=BM1 # a Brownian motion model applied over the entire tree
model=BMS # a different Brownian rate parameter (sigma2) for each regime
model=OU1 # a single peak Ornstein-Uhlenbeck model across the entire tree
model=OUM # a multi-peak Ornstein-Uhlenbeck model with different optima (theta) for each regime
model=OUMV # a multi-peak Ornstein-Uhlenbeck model with different optima (theta) and different Brownian rate parameter (sigma2) for each regime
model=OUMA # a multi-peak Ornstein-Uhlenbeck model with different optima (theta) and different strength of selection parameter (alpha) for each regime
model=OUMVA # a multi-peak Ornstein-Uhlenbeck model with different optima (theta), different Brownian rate parameter (sigma2), and different strength of selection parameter (alpha) for each regime
A great step forward for all of us using comparative methods. To start playing, all we need to specify are a tree, a data frame, and a model. Katie and I had already been playing around with make.simmap from the package phytools, so we had trees with stochastically mapped discrete traits. The data frame needs to be in the format where the first column is the species names, the second column is the discrete character, and the third column is the continuous character. Specifying that can be as easy as this:
> df1 = data.frame(speciesNames, discrete, PC1)
We had multiple mapped trees (as we should), so we took one to go for a spin, starting with the simplest model first:
Tada! This will give us our log likelihood, AIC score, AICc score, parameter estimates, root estimate, number of iterations executed, parameter standard errors, and eigenvalues. Now we can try it with different models:
> bmsOutput = OUwie(mouthPsimmap[[1]], df1, model = "BMS", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Finished. Summarizing results.
Finished. Performing diagnostic tests.
For this model, though, we need to be sure to iterate over the different mappings as this depends on the selective regime. We should get different results from different trees as we can see here:
> bmsOutput2 = OUwie(mouthPsimmap[[2]], df1, model = "BMS", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Finished. Summarizing results.
Finished. Performing diagnostic tests.
> bmsOutput$loglik
[1] -30.24609
> bmsOutput2$loglik
[1] -29.44005
We have a multiphylo of 500 mapped trees, so let's just sample 20 of them to see what's going on.
> sampledTrees = sample(mouthPsimmap, 20)
Now, we can just throw this into an apply to get a list of output for each of these twenty trees.
> bmsSampledOutput = lapply(sampledTrees, function(x) {
+ OUwie(x, df1, model = "BMS", simmap.tree = TRUE)
+ })
This might take a while, but once it's done, everything seems great! We can use some more apply functions to get out particular bits of the data we want.
Unfortunately, as soon as I add varying alpha into my model, I run into some problems:
> oumaOutput1 = OUwie(mouthPsimmap[[1]], df1, model = "OUMA", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Error in solve.default(V, res) :
system is computationally singular: reciprocal condition number = 1.27328e-103
I've tried a variety of debugging strategies, including changing the optimization algorithm in the OUwie function, but I can't seem to get anything that varies alpha to work, unfortunately. This is probably a function of having very little data, and thus no power to estimate parameters. Unfortunate, but I would love to try this again on a larger dataset!
Possible models:
model=BM1 # a Brownian motion model applied over the entire tree
model=BMS # a different Brownian rate parameter (sigma2) for each regime
model=OU1 # a single peak Ornstein-Uhlenbeck model across the entire tree
model=OUM # a multi-peak Ornstein-Uhlenbeck model with different optima (theta) for each regime
model=OUMV # a multi-peak Ornstein-Uhlenbeck model with different optima (theta) and different Brownian rate parameter (sigma2) for each regime
model=OUMA # a multi-peak Ornstein-Uhlenbeck model with different optima (theta) and different strength of selection parameter (alpha) for each regime
model=OUMVA # a multi-peak Ornstein-Uhlenbeck model with different optima (theta), different Brownian rate parameter (sigma2), and different strength of selection parameter (alpha) for each regime
A great step forward for all of us using comparative methods. To start playing, all we need to specify are a tree, a data frame, and a model. Katie and I had already been playing around with make.simmap from the package phytools, so we had trees with stochastically mapped discrete traits. The data frame needs to be in the format where the first column is the species names, the second column is the discrete character, and the third column is the continuous character. Specifying that can be as easy as this:
> df1 = data.frame(speciesNames, discrete, PC1)
We had multiple mapped trees (as we should), so we took one to go for a spin, starting with the simplest model first:
> bm1Output = OUwie(mouthPsimmap[[1]], df1, model = "BM1", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Finished. Summarizing results.
Finished. Performing diagnostic tests.
> bmsOutput = OUwie(mouthPsimmap[[1]], df1, model = "BMS", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Finished. Summarizing results.
Finished. Performing diagnostic tests.
For this model, though, we need to be sure to iterate over the different mappings as this depends on the selective regime. We should get different results from different trees as we can see here:
> bmsOutput2 = OUwie(mouthPsimmap[[2]], df1, model = "BMS", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Finished. Summarizing results.
Finished. Performing diagnostic tests.
> bmsOutput$loglik
[1] -30.24609
> bmsOutput2$loglik
[1] -29.44005
We have a multiphylo of 500 mapped trees, so let's just sample 20 of them to see what's going on.
> sampledTrees = sample(mouthPsimmap, 20)
Now, we can just throw this into an apply to get a list of output for each of these twenty trees.
> bmsSampledOutput = lapply(sampledTrees, function(x) {
+ OUwie(x, df1, model = "BMS", simmap.tree = TRUE)
+ })
This might take a while, but once it's done, everything seems great! We can use some more apply functions to get out particular bits of the data we want.
Unfortunately, as soon as I add varying alpha into my model, I run into some problems:
> oumaOutput1 = OUwie(mouthPsimmap[[1]], df1, model = "OUMA", simmap.tree = TRUE)
Begin subplex optimization routine -- Starting value: 1
Error in solve.default(V, res) :
system is computationally singular: reciprocal condition number = 1.27328e-103
I've tried a variety of debugging strategies, including changing the optimization algorithm in the OUwie function, but I can't seem to get anything that varies alpha to work, unfortunately. This is probably a function of having very little data, and thus no power to estimate parameters. Unfortunate, but I would love to try this again on a larger dataset!
Thursday, March 15, 2012
Location Data
For my first post, I'd like to highlight what prompted me to start this blog. I was approached by a fellow student of the Bodega Phylogenetics Workshop 2012 about a transforming her data frame into a format she could use. Her dataframe consists of two columns, with species names in the first column and location codes in the second column. There can be repeats in each of the two columns. Here is a toy example:
> df
species location
1 a 1A
2 b 1A
3 c 1B
4 a 2A
5 c 2B
She wants to turn this data frame into a data frame with each species as a row and each location as a column. For each column, 0 indicates absence and 1 indicates presence. Here is the function that does just that! It requires two vectors: species and location, where the order of the elements in the two vectors correspond with each other. Hopefully commented enough such that it's easy to follow
transformPresAbs = function(species, location) {
# find the unique species and locations
ulocations = unique(location)
uspecies = unique(species)
# initialize the matrix with the proper dimensions
# first start with a matrix to make specifying the dimensions easier
presAbs = matrix(NA, ncol = length(ulocations), nrow = length(uspecies))
# then name the columns and rows with the locations and species, respectively
colnames(presAbs) = levels(location)
rownames(presAbs) = levels(uspecies)
for(i in 1:nrow(presAbs)) { # here we're iterating over the rows (species)
presAbs[i,1:ncol(presAbs)] = sapply(colnames(presAbs), function(x) {
# here we're iterating over the columns (locations)
if(any(df[which(species == rownames(presAbs)[i]),] == x)) 1 # assign 1 if present
else 0
})
}
# show us what we got!
return(presAbs)
}
> df
species location
1 a 1A
2 b 1A
3 c 1B
4 a 2A
5 c 2B
She wants to turn this data frame into a data frame with each species as a row and each location as a column. For each column, 0 indicates absence and 1 indicates presence. Here is the function that does just that! It requires two vectors: species and location, where the order of the elements in the two vectors correspond with each other. Hopefully commented enough such that it's easy to follow
transformPresAbs = function(species, location) {
# find the unique species and locations
ulocations = unique(location)
uspecies = unique(species)
# initialize the matrix with the proper dimensions
# first start with a matrix to make specifying the dimensions easier
presAbs = matrix(NA, ncol = length(ulocations), nrow = length(uspecies))
# then name the columns and rows with the locations and species, respectively
colnames(presAbs) = levels(location)
rownames(presAbs) = levels(uspecies)
for(i in 1:nrow(presAbs)) { # here we're iterating over the rows (species)
presAbs[i,1:ncol(presAbs)] = sapply(colnames(presAbs), function(x) {
# here we're iterating over the columns (locations)
if(any(df[which(species == rownames(presAbs)[i]),] == x)) 1 # assign 1 if present
else 0
})
}
# show us what we got!
return(presAbs)
}
Subscribe to:
Comments (Atom)