Computing Adventures and Phylogenetics: Linear Discriminant Analysis

This follows fairly naturally from the PCA I did on the data in my previous post. I have a dataset of several quantitative variables that can be grouped by a categorical variable. This time, I am going to maximize the separation between the two groups to see what traits are important in determining group inclusion.

While looking for information about linear discriminant analyses (LDA), I came across a very informative site by Dr. Avril Coghlan. I used a couple of the functions available from that website, including groupStandardise and calcWithinGroupsVariance, to obtain meaningful coefficients to determine what traits are more informative about group membership.

The lda() function is in the MASS package. Since I only have two groups, I only have one discriminant axis, as the number of discriminant axes is equal to the number of groups minus one. Thus I won't get a nice scatterplot the way I did with my PCA. Instead, if I try to plot the output to lda(), I get this:

I wanted a stacked histogram instead, so I had to do a little more work.

cats = levels(dat$Type) # the categories I'm using

histData = sapply(cats, function(x) {

hist(scores$x[which(dat$Type == x)],

breaks = seq(-3, 2.2, .2))$counts

# bad me, I hard-coded the breaks. My next project can be to use the same breaks as the above plot

})

# this gives me a matrix of the counts for each interval for each type

allhistData = do.call(rbind, list(histData))

barplot(t(allhistData),

space = 0,

ylab = "number of sequences",

xlab = "LD")

axis(side = 1,

at = c(-1, 4, 9, 14, 19, 24),

labels = c(-3, -2, -1, 0, 1, 2)) # more bad hard-coding

legend("topleft", legend = cats, text.col = c("black", "gray"))

And here is what I get:

It doesn't look great, but when did stacked histograms ever look good? Probably better if I have a lot more observations.

Computing Adventures and Phylogenetics

Thursday, January 10, 2013

Linear Discriminant Analysis

No comments:

Post a Comment