Exploring the properties of contrasts regression


1a. OLS has elevated type I error for phylogenetic data

The independent activity was as follows:

  1. Show that OLS for data simulated on the tree results in elevated type I error. How is the type I error rate affected by tree shape? (Hint, use pbtree(...,d=0), pbtree(...,d>0), and rcoal.)

  2. Show that phylogenetic independent contrasts results in the correct type I error.

  3. Show that OLS is unbiased.

1a. OLS has elevated type I error for phylogenetic data

The first task is to demonstrate that OLS has type I error above the nominal level (say, 0.05) when the data arose on a phylogeny.

The way I will do this is first generate 200 random, pure-birth phylogenetic trees. Next, I will simulate the uncorrelated Brownian evolution of two traits (x and y) on each tree. Finally, I will fit a standard OLS bivariate regression model to each dataset. I will estimate the type I error rate as the fraction of fitted models that are significant at the α = 0.05 level.

First, load phytools & dependencies:

require(phytools)
## Loading required package: phytools
## Loading required package: ape
## Loading required package: maps

Now, let's simulate 200 pure-birth trees using pbtree:

N<-200 # number of simulations
trees<-pbtree(n=100,scale=1,nsim=N)
x<-lapply(trees,fastBM)
y<-lapply(trees,fastBM)

Next, we fit each model:

foo<-function(x,y) lm(y~x)
fits<-mapply(foo,x=x,y=y,SIMPLIFY=FALSE)

Finally, let's count the number that are significant at the α = 0.05 level:

p<-sapply(fits,function(x) anova(x)$"Pr(>F)"[1])
## this should be uniform
hist(p,breaks=seq(0,1,by=0.05))

plot of chunk unnamed-chunk-4

## this is our realized type I error rate
mean(p<=0.05)
## [1] 0.48

From this we should see that type I error can be substantially increased if the phylogeny is not taken into account.

1b. Type I error & tree shape

Now that it is clear that ignoring phylogeny can induce substantial type I error what we want to find out is how that increased error rate is related to the shape of the tree.

There are multiple ways to go about doing this. We can either take trees generated under a particular model (say, pure-birth) & induce changes (such as by lengthening terminal edges or shortening internal edges). Alternatively, we could generate trees under different models under which different shapes should result, and then examine the outcome. Let's start with the latter. I generated trees under four different models: constant-rate pure-birth; birth-death with d/b = 0.5; the coalescent; and one final class that I'm going to call the 'slow-down' model. Since I don't have a ready method to generate trees that actually exhibit a slow-down in the rate of speciation through time, I'm going to simulate trees that exhibit a 'slow-down like' shape by generating pure-birth trees, and then dropping a large fraction of tips at random. In all cases I stopped the simulation when the number of extant lineages was 100 and then rescaled the total length of the (reconstructed) phylogeny to be 1.0 unit.

N<-200 # number of simulations
## simulate pure-birth trees
t1<-pbtree(n=100,nsim=N,scale=1)
## function to simulate birth-death trees
foo<-function(b,d){
    t<-NULL
    while(is.null(t))
        t<-pbtree(n=100,b=b,d=d,extant.only=TRUE,quiet=TRUE)
    t
}
t2<-replicate(N,foo(1,0.5),simplify=FALSE) # simulate trees
## function to rescale reconstructed phylogenies
foo<-function(x){
    x$edge.length<-x$edge.length/max(nodeHeights(x))
    x
}
t2<-lapply(t2,foo) # rescale trees
class(t2)<-"multiPhylo"
## simulate coalescent trees
t3<-replicate(N,rcoal(n=100),simplify=FALSE)
t3<-lapply(t3,foo) # rescale trees
class(t3)<-"multiPhylo"
## simulate 'slow-down' trees
t4<-pbtree(n=300,nsim=N)
t4<-lapply(t4,function(x) x<-drop.tip(x,sample(x$tip.label,200)))
t4<-lapply(t4,foo)
class(t4)<-"multiPhylo"

Having simulated trees in this way, I next quantified the difference in shape among these three classes of simulation. I did this using Harvey & Pybus's (2000) γ-statistic, as follows:

## compute gamma for each tree
g1<-sapply(t1,function(x) ltt(x,plot=FALSE)$gamma)  
g2<-sapply(t2,function(x) ltt(x,plot=FALSE)$gamma)
g3<-sapply(t3,function(x) ltt(x,plot=FALSE)$gamma)
g4<-sapply(t4,function(x) ltt(x,plot=FALSE)$gamma)
## estimate the probability density of gamma
d1<-density(g1,bw=0.5)
d2<-density(g2,bw=0.5)
d3<-density(g3,bw=0.5)
d4<-density(g4,bw=0.5)
## plot all three densities
plot(d1$x,d1$y,type="l",xlim=range(c(d1$x,d2$x,d3$x,d4$x)),
    ylim=c(0,1.15*max(c(d1$y,d2$y,d3$y,d4$y))),xlab="gamma",ylab="density",
    main="H&P's gamma as a function of the generating process")
lines(d2$x,d2$y)
lines(d3$x,d3$y)
lines(d4$x,d4$y)
text(x=d1$x[which(d1$y==max(d1$y))],y=max(d1$y),"pure-birth",pos=4,srt=40)
text(x=d2$x[which(d2$y==max(d2$y))],y=max(d2$y),"birth-death",pos=4,srt=40)
text(x=d3$x[which(d3$y==max(d3$y))],y=max(d3$y),"coalescent",pos=4,srt=40)
text(x=d4$x[which(d4$y==max(d4$y))],y=max(d4$y),"slow-down",pos=4,srt=40)

plot of chunk unnamed-chunk-6

OK, what this shows us is that we now have four different sets of trees that differ characteristically in their shape. In general, coalescent trees have more, shorter branches towards the present (positive γ), slow-down trees tends to have the longest terminal branches (negative γ), and pure-birth and birth-death trees are somewhere in the middle.

The question asks if these trees with different characteristic should result in different type I error rates if phylogeny is not incorporated in bivariate regression. To investigate this, I simulated data for x & y on each tree in each dataset. I next fit linear regression models using lm, then I obtained p-values for each fitted model using anova.lm. Finally I compared the type I error rate (the fraction of tests that were significant at the α = 0.05 level) for each set of trees.

## here's a custom function that conducts a set of simulations
## & returns the p-values
typeI<-function(trees){
    x<-lapply(trees,fastBM)
    y<-lapply(trees,fastBM)
    foo<-function(x,y) lm(y~x)
    fits<-mapply(foo,x=x,y=y,SIMPLIFY=FALSE)
    p<-sapply(fits,function(x) anova(x)$"Pr(>F)"[1])
    p
}
p1<-typeI(t1)
p2<-typeI(t2)
p3<-typeI(t3)
p4<-typeI(t4)
par(mfrow=c(2,2))
breaks<-seq(0,1,0.05)
hist(p1,xlab="p",main="pure-birth",breaks=breaks)
lines(breaks,rep(0.05*N,length(breaks)),lty="dashed")
hist(p2,xlab="p",main="birth-death",breaks=breaks)
lines(breaks,rep(0.05*N,length(breaks)),lty="dashed")
hist(p3,xlab="p",main="coalescent",breaks=breaks)
lines(breaks,rep(0.05*N,length(breaks)),lty="dashed")
hist(p4,xlab="p",main="slow-down",breaks=breaks)
lines(breaks,rep(0.05*N,length(breaks)),lty="dashed")

plot of chunk unnamed-chunk-7

This shows us that the type I error is high; however we can also compute it directly:

mean(p1<=0.05) # pure-birth tree
## [1] 0.395
mean(p2<=0.05) # birth-death tree
## [1] 0.53
mean(p3<=0.05) # coalescent
## [1] 0.725
mean(p4<=0.05) # slow-down tree
## [1] 0.39

This result should show (on average) that slow-downs (i.e., very long terminal edges) should result in the lowest type I error rate; whereas the coalescent trees (with very tippy shapes & thus high correlation among species) should result in the highest type I error when phylogeny is ignored.

2. Phylogenetic independent contrasts regression results in the correct type I error.

In this part, I'll show that our type I errors go to the nominal level (α = 0.05) when we first compute phylogenetically independent contrasts following Felsenstein (1985), and then fit our regression model without an intercept term to the contrasts rather than to the original data. To do this, I will use the trees generated in the previous exercise. Inconveniently, I didn't keep the simulated datasets, so I will have to conduct new simulations under the null hypothesis of no relationship between x and y.

First, here is an updated version of the custom function for type I error rate estimation, but this one uses the contrasts transformation before analysis:

typeI.pic<-function(trees){
    x<-lapply(trees,fastBM)
    y<-lapply(trees,fastBM)
    pic.x<-mapply(pic,x,trees,SIMPLIFY=FALSE)
    pic.y<-mapply(pic,y,trees,SIMPLIFY=FALSE)
    foo<-function(x,y) lm(y~x-1)
    fits<-mapply(foo,x=pic.x,y=pic.y,SIMPLIFY=FALSE)
    p<-sapply(fits,function(x) anova(x)$"Pr(>F)"[1])
    p
}
p1<-typeI.pic(t1)
p2<-typeI.pic(t2)
p3<-typeI.pic(t3)
p4<-typeI.pic(t4)
par(mfrow=c(2,2))
breaks<-seq(0,1,0.05)
hist(p1,xlab="p",main="pure-birth",breaks=breaks)
lines(breaks,rep(0.05*N,length(breaks)),lty="dashed")
hist(p2,xlab="p",main="birth-death",breaks=breaks)
lines(breaks,rep(0.05*N,length(breaks)),lty="dashed")
hist(p3,xlab="p",main="coalescent",breaks=breaks)
lines(breaks,rep(0.05*N,length(breaks)),lty="dashed")
hist(p4,xlab="p",main="slow-down",breaks=breaks)
lines(breaks,rep(0.05*N,length(breaks)),lty="dashed")

plot of chunk unnamed-chunk-9

mean(p1<=0.05) # pure-birth tree
## [1] 0.065
mean(p2<=0.05) # birth-death tree
## [1] 0.07
mean(p3<=0.05) # coalescent
## [1] 0.05
mean(p4<=0.05) # slow-down tree
## [1] 0.04

We see the following in this exercise: (1) PIC gives a uniform distribution for our p-values under the null; and (2) PIC restores the type I error to at or about the nominal level (0.05).

3. OLS is unbiased even for phylogenetic data.

Here, I will show that ordinary least-squares (OLS), in other words - ignoring the phylogeny - will actually give us an unbiased estimate of the regression coefficients even for data that arose on the phylogeny.

If this is true, then that would me that the long-run expectation of our fitted regression coefficient from OLS should converge to the generating value, β. Note that this does not justify using OLS for phylogenetic data, since we already have shown that the type I error rates when phylogeny is ignored can be very high.

To show this, I will first generate 200 pure-birth phylogenies; then, for each of 11 evely spaced values of β between -1 and 1 I will generate 200 datasets and for each one fit the regression model using OLS. I will then compute the mean fitted regression coefficients under OLS to demonstrate that it is an unbiased. For fun, I'll also compute the p-values under OLS, the fitted coefficient using PIC, and the p-values using PIC.

N<-200 # number of simulations
## simulate trees
trees<-pbtree(n=100,nsim=N,scale=1)
beta<-seq(-1,1,by=0.2)
b<-p<-b.pic<-p.pic<-list()
for(i in 1:length(beta)){
    x<-lapply(trees,fastBM)
    foo<-function(tree,x,beta) y<-beta*x+fastBM(tree)
    y<-mapply(foo,tree=trees,x=x,MoreArgs=list(beta=beta[i]),SIMPLIFY=FALSE)
    foo<-function(x,y) lm(y~x)
    fit.ols<-mapply(foo,x,y,SIMPLIFY=FALSE)
    foo<-function(x,y,tree) lm(pic(y,tree)~pic(x,tree)-1)
    fit.pic<-mapply(foo,x,y,trees,SIMPLIFY=FALSE)
    b[[i]]<-sapply(fit.ols,function(x) setNames(coefficients(x)[2],NULL))
    p[[i]]<-sapply(fit.ols,function(x) anova(x)$"Pr(>F)"[1])
    b.pic[[i]]<-sapply(fit.pic,function(x) setNames(coefficients(x)[1],NULL))
    p.pic[[i]]<-sapply(fit.pic,function(x) anova(x)$"Pr(>F)"[1])
}
## compute the mean fitted beta
B<-sapply(b,mean)
B.pic<-sapply(b.pic,mean)
## plot the results
plot(beta,B,pch=22,bg="grey",ylab="mean(fitted beta)")
points(beta,B.pic,pch=21,bg="grey")
legend(-1,0.95,legend=c("OLS","PIC"),pch=c(22,21),pt.bg="grey")
lines(c(-1,1),c(-1,1),lty="dashed")

plot of chunk unnamed-chunk-10

## estimate the power/type I error
power<-sapply(p,function(x) mean(x<=0.05))
power.pic<-sapply(p.pic,function(x) mean(x<=0.05))
## plot the results
plot(beta,power.pic,type="b",lty="dashed",ylab="power/type I error",pch=21,bg="grey")
lines(beta,power,type="b",pch=22,bg="grey")
legend(-1,0.3,legend=c("OLS","PIC"),pch=c(22,21),pt.bg="grey")

plot of chunk unnamed-chunk-10

## compute the SD in beta among simulations
SD<-sapply(b,sd)
SD.pic<-sapply(b.pic,sd)
plot(beta,SD,type="b",ylab="SD(fitted beta)",pch=22,bg="grey",ylim=c(0,max(SD)))
points(beta,SD.pic,type="b",pch=21,bg="grey")
legend(-1,0.05,legend=c("OLS","PIC"),pch=c(22,21),pt.bg="grey")

plot of chunk unnamed-chunk-10

From this analysis we can see: (1) that both OLS & PIC are unbiased; (2) OLS nonetheless results in substantially elevated type I error under the null, and decreased power. This is due to (3) much greater variance in the estimator if phylogeny is ignored.

Written by Liam J. Revell. Last updated July 2, 2015