Phylogenetically independent contrasts

In this tutorial we will learn to apply the independent contrasts method of Felsenstein (1985) for estimating the evolutionary correlation between characters.

Felsenstein (1985) identified (and, to a large extent, solved) a problem that had previously recognized, but that was underappreciated: and that is, that the data for species cannot be treated as independent data points from the point of view of statistical analysis.

To learn the contrasts method, we'll first need to learn some basics in linear model fitting in the R language.

## linear regression model is y = b0 + b*x + e
b0<-5.3
b<-0.75
x<-rnorm(n=100,sd=1)
y<-b0+b*x+rnorm(n=100,sd=0.2)
plot(x,y)
fit<-lm(y~x)
abline(fit)

plot of chunk unnamed-chunk-1

fit

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      5.3321       0.7377

summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52591 -0.12877 -0.01271  0.12295  0.57436 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.33205    0.02129  250.39   <2e-16 ***
## x            0.73766    0.02103   35.08   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2129 on 98 degrees of freedom
## Multiple R-squared:  0.9263, Adjusted R-squared:  0.9255 
## F-statistic:  1231 on 1 and 98 DF,  p-value: < 2.2e-16

anova(fit)

## Analysis of Variance Table
## 
## Response: y
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## x          1 55.804  55.804  1230.9 < 2.2e-16 ***
## Residuals 98  4.443   0.045                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## fit is an object of class "lm"
class(fit)

## [1] "lm"

## coefficients of the fitted model
fit$coefficients

## (Intercept)           x 
##   5.3320548   0.7376578

## p-value of the fitted model
anova(fit)[["Pr(>F)"]][1]

## [1] 2.758908e-57

We can see that when the residual error in y given x are normal & independent, hypothesis tests about our fitted model using ANOVA are unbiased & have the correct type I error rate:

foo<-function(){
    x<-rnorm(n=100)
    y<-rnorm(n=100)
    fit<-lm(y~x)
    setNames(c(fit$coefficients[2],anova(fit)[["Pr(>F)"]][1]),c("beta","p"))
}
X<-t(replicate(1000,foo()))
hist(X[,1],xlab="fitted beta",main="fitted beta")

plot of chunk unnamed-chunk-2

hist(X[,2],xlab="p-value",main="p-value")

plot of chunk unnamed-chunk-2

## expected value is the nominal type I error rate, 0.05
mean(X[,2]<=0.05)

## [1] 0.056

This simulation shows very simply that when standard statistical assumptions are satisfied, our parameter estimates are unbiased and type I error is at or around the nominal level.

The difficulty with phylogenetic data is that the observations for y are no longer independent & identically distributed. Let's consider one case:

## load phytools
library(phytools)
## simulate a coalescent shaped tree
tree<-rcoal(n=100)
plotTree(tree,ftype="off")

plot of chunk unnamed-chunk-3

## simulate uncorrelated Brownian evolution
x<-fastBM(tree)
y<-fastBM(tree)
par(mar=c(5.1,4.1,2.1,2.1))
plot(x,y)
fit<-lm(y~x)
abline(fit)

plot of chunk unnamed-chunk-3

fit

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      0.7647       0.5534

summary(fit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8952 -0.3452 -0.1002  0.2604  1.2855 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.76471    0.08144    9.39 2.56e-15 ***
## x            0.55339    0.02494   22.19  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4964 on 98 degrees of freedom
## Multiple R-squared:  0.834,  Adjusted R-squared:  0.8323 
## F-statistic: 492.2 on 1 and 98 DF,  p-value: < 2.2e-16

anova(fit)

## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## x          1 121.269 121.269  492.23 < 2.2e-16 ***
## Residuals 98  24.144   0.246                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see from this example, that it is not difficult for phylogeny to induce a type I error.

Here it is because clusters of closely related taxa have highly similar phenotypes. In other words, they are not independent data points about the evolutionary process for x and y on the tree:

## this is a projection of the tree into morphospace
phylomorphospace(tree,cbind(x,y),label="off",node.size=c(0.5,0.7))
abline(fit)

plot of chunk unnamed-chunk-4

Felsenstein's (1985) algorithm substitutes the contrasts between species (& nodes) for the original values. Let's see if it resolves our type I error in this case:

ix<-pic(x,tree)
iy<-pic(y,tree)
fit<-lm(iy~ix-1) ## we have to fit the model without an intercept term
fit

## 
## Call:
## lm(formula = iy ~ ix - 1)
## 
## Coefficients:
##     ix  
## 0.1785

summary(fit)

## 
## Call:
## lm(formula = iy ~ ix - 1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.79992 -0.78645  0.03339  0.53096  2.27406 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)  
## ix  0.17852    0.08959   1.993   0.0491 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9858 on 98 degrees of freedom
## Multiple R-squared:  0.03894,    Adjusted R-squared:  0.02913 
## F-statistic: 3.971 on 1 and 98 DF,  p-value: 0.04908

anova(fit)

## Analysis of Variance Table
## 
## Response: iy
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## ix         1  3.859  3.8585  3.9706 0.04908 *
## Residuals 98 95.233  0.9718                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Investigators sometimes get the naive impression that the contrasts algorithm 'weakens' the power of their statistical analysis. What we're seeing here is actually a spurious result being re-identified as such - not a loss of power.

Finally, IC is unbiased when we do have an effect:

foo<-function(beta,tree){
    tree<-pbtree(n=100,scale=1)
    x<-fastBM(tree)
    y<-beta*x+fastBM(tree,sig2=0.2)
    ix<-pic(x,tree)
    iy<-pic(y,tree)
    fit<-lm(iy~ix-1)
    fit$coefficients[1]
}
bhat<-replicate(1000,foo(0.75,tree))
hist(bhat,xlab="fitted beta",main="fitted beta")

plot of chunk unnamed-chunk-6

Work on your own:

Do as much as you can of the following exercise on your own. We are here to help!

Show that OLS for data simulated on the tree results in elevated type I error. How is the type I error rate affected by tree shape. (Hint, use pbtree(...,d=0), pbtree(...,d>0), and rcoal.)
Show that phylogenetic independent contrasts results in the correct type I error.
Show that OLS is unbiased.

Written by Liam J. Revell. Last updated July 2, 2015