Reconstruction ancestral characters II, Discrete characters

This tutorial focuses on the estimation of ancestral character states for discretely valued traits using a continuous-time Markov chain model commonly known as the Mk model.

"phytools" has a function for ancestral character estimation under a couple of different models that uses the re-rooting method of Yang (1995). For other models of evolutionary change not covered by this function, users should try Rich Fitzjohn's diversitree package, or the ace function in ape.

OK - let's pull our data from the Anolis ecomorph tree to use in these analyses:

require(phytools)
data(anoletree)
## this is just to pull out the tip states from the
## data object - normally we would read this from file
x<-getStates(anoletree,"tips")
tree<-anoletree
rm(anoletree)
tree

##
## Phylogenetic tree with 82 tips and 81 internal nodes.
##
## Tip labels:
##  Anolis_ahli, Anolis_allogus, Anolis_rubribarbus, Anolis_imias, Anolis_sagrei, Anolis_bremeri, ...
##
## Rooted; includes branch lengths.

##            Anolis_ahli         Anolis_allogus     Anolis_rubribarbus
##                   "TG"                   "TG"                   "TG"
##           Anolis_imias          Anolis_sagrei         Anolis_bremeri
##                   "TG"                   "TG"                   "TG"
## Anolis_quadriocellifer      Anolis_ophiolepis         Anolis_mestrei
##                   "TG"                   "GB"                   "TG"
##           Anolis_jubar      Anolis_homolechis        Anolis_confusus
##                   "TG"                   "TG"                   "TG"
##           Anolis_guafe         Anolis_garmani        Anolis_opalinus
##                   "TG"                   "CG"                   "TC"
##         Anolis_grahami     Anolis_valencienni      Anolis_lineatopus
##                   "TC"                   "Tw"                   "TG"
##       Anolis_evermanni       Anolis_stratulus           Anolis_krugi
##                   "TC"                   "TC"                   "GB"
##      Anolis_pulchellus       Anolis_gundlachi       Anolis_poncensis
##                   "GB"                   "TG"                   "GB"
##           Anolis_cooki    Anolis_cristatellus    Anolis_brevirostris
##                   "TG"                   "TG"                   "Tr"
##        Anolis_caudalis          Anolis_marron        Anolis_websteri
##                   "Tr"                   "Tr"                   "Tr"
##       Anolis_distichus         Anolis_alumina    Anolis_semilineatus
##                   "Tr"                   "GB"                   "GB"
##         Anolis_olssoni       Anolis_insolitus       Anolis_whitemani
##                   "GB"                   "Tw"                   "TG"
##       Anolis_haetianus        Anolis_breslini         Anolis_armouri
##                   "TG"                   "TG"                   "TG"
##         Anolis_cybotes         Anolis_shrevei   Anolis_longitibialis
##                   "TG"                   "TG"                   "TG"
##         Anolis_strahmi        Anolis_marcanoi        Anolis_baleatus
##                   "TG"                   "TG"                   "CG"
##       Anolis_barahonae        Anolis_ricordii         Anolis_cuvieri
##                   "CG"                   "CG"                   "CG"
##   Anolis_altitudinalis        Anolis_oporinus        Anolis_isolepis
##                   "TC"                   "TC"                   "TC"
##        Anolis_allisoni        Anolis_porcatus        Anolis_loysiana
##                   "TC"                   "TC"                   "Tr"
##         Anolis_guazuma        Anolis_placidus        Anolis_sheplani
##                   "Tw"                   "Tw"                   "Tw"
##         Anolis_alayoni     Anolis_angusticeps        Anolis_paternus
##                   "Tw"                   "Tw"                   "Tw"
##       Anolis_alutaceus    Anolis_inexpectatus       Anolis_clivicola
##                   "GB"                   "GB"                   "GB"
##    Anolis_cupeyalensis    Anolis_cyanopleurus         Anolis_alfaroi
##                   "GB"                   "GB"                   "GB"
##      Anolis_macilentus       Anolis_vanidicus        Anolis_baracoae
##                   "GB"                   "GB"                   "CG"
##          Anolis_noblei      Anolis_smallwoodi    Anolis_luteogularis
##                   "CG"                   "CG"                   "CG"
##       Anolis_equestris   Anolis_bahorucoensis Anolis_dolichocephalus
##                   "CG"                   "GB"                   "GB"
##      Anolis_hendersoni     Anolis_darlingtoni        Anolis_aliniger
##                   "GB"                   "Tw"                   "TC"
##      Anolis_singularis    Anolis_chlorocyanus     Anolis_coelestinus
##                   "TC"                   "TC"                   "TC"
##        Anolis_occultus
##                   "Tw"

In visual format, these are the data that we have:

plotTree(tree,type="fan",fsize=0.8,ftype="i")

## setEnv=TRUE for this type is experimental. please be patient with bugs

cols<-setNames(palette()[1:length(unique(x))],sort(unique(x)))
tiplabels(pie=to.matrix(x,sort(unique(x))),piecol=cols,cex=0.3)
add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
    y=-max(nodeHeights(tree)),fsize=0.8)

plot of chunk unnamed-chunk-2

Next, let's fit a single-rate model & reconstruct ancestral states at internal nodes in the tree.

## estimate ancestral states under a ER model
fitER<-ace(x,tree,model="ER",type="discrete")
fitER

##
##     Ancestral Character Estimation
##
## Call: ace(x = x, phy = tree, type = "discrete", model = "ER")
##
##     Log-likelihood: -78.04604
##
## Rate index matrix:
##    CG GB TC TG Tr Tw
## CG  .  1  1  1  1  1
## GB  1  .  1  1  1  1
## TC  1  1  .  1  1  1
## TG  1  1  1  .  1  1
## Tr  1  1  1  1  .  1
## Tw  1  1  1  1  1  .
##
## Parameter estimates:
##  rate index estimate std-err
##           1   0.0231   0.004
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
##          CG          GB          TC          TG          Tr          Tw
## 0.018197565 0.202238628 0.042841575 0.428609607 0.004383532 0.303729093

round(fitER$lik.anc,3)

##          CG    GB    TC    TG    Tr    Tw
##  [1,] 0.018 0.202 0.043 0.429 0.004 0.304
##  [2,] 0.008 0.197 0.020 0.501 0.001 0.272
##  [3,] 0.003 0.134 0.020 0.716 0.005 0.122
##  [4,] 0.002 0.042 0.019 0.858 0.002 0.077
##  [5,] 0.000 0.001 0.001 0.995 0.000 0.002
##  [6,] 0.000 0.000 0.000 0.998 0.000 0.001
##  [7,] 0.000 0.000 0.000 1.000 0.000 0.000
##  [8,] 0.000 0.000 0.000 1.000 0.000 0.000
##  [9,] 0.000 0.000 0.000 1.000 0.000 0.000
## [10,] 0.000 0.000 0.000 1.000 0.000 0.000
## [11,] 0.000 0.003 0.000 0.997 0.000 0.000
## [12,] 0.000 0.000 0.000 1.000 0.000 0.000
## [13,] 0.000 0.000 0.000 1.000 0.000 0.000
## [14,] 0.000 0.000 0.000 1.000 0.000 0.000
## [15,] 0.000 0.000 0.000 1.000 0.000 0.000
## [16,] 0.000 0.000 0.000 1.000 0.000 0.000
## [17,] 0.007 0.019 0.084 0.786 0.005 0.099
## [18,] 0.020 0.018 0.314 0.428 0.011 0.210
## [19,] 0.030 0.001 0.945 0.014 0.001 0.007
## [20,] 0.035 0.001 0.945 0.012 0.001 0.006
## [21,] 0.004 0.148 0.036 0.692 0.071 0.049
## [22,] 0.003 0.179 0.057 0.721 0.024 0.017
## [23,] 0.001 0.004 0.979 0.013 0.001 0.001
## [24,] 0.001 0.214 0.010 0.767 0.005 0.003
## [25,] 0.000 0.353 0.003 0.640 0.002 0.001
## [26,] 0.001 0.921 0.002 0.073 0.002 0.002
## [27,] 0.001 0.356 0.003 0.637 0.002 0.001
## [28,] 0.001 0.046 0.003 0.947 0.002 0.001
## [29,] 0.001 0.003 0.001 0.013 0.980 0.002
## [30,] 0.000 0.000 0.000 0.000 1.000 0.000
## [31,] 0.000 0.000 0.000 0.000 1.000 0.000
## [32,] 0.000 0.000 0.000 0.000 1.000 0.000
## [33,] 0.009 0.203 0.018 0.492 0.001 0.276
## [34,] 0.022 0.213 0.013 0.495 0.002 0.255
## [35,] 0.020 0.359 0.014 0.311 0.008 0.287
## [36,] 0.003 0.954 0.002 0.020 0.002 0.019
## [37,] 0.000 0.997 0.000 0.001 0.000 0.001
## [38,] 0.092 0.141 0.012 0.582 0.005 0.168
## [39,] 0.003 0.005 0.001 0.985 0.001 0.005
## [40,] 0.000 0.000 0.000 0.999 0.000 0.000
## [41,] 0.000 0.000 0.000 1.000 0.000 0.000
## [42,] 0.000 0.000 0.000 1.000 0.000 0.000
## [43,] 0.000 0.000 0.000 1.000 0.000 0.000
## [44,] 0.000 0.000 0.000 1.000 0.000 0.000
## [45,] 0.000 0.000 0.000 1.000 0.000 0.000
## [46,] 0.000 0.000 0.000 1.000 0.000 0.000
## [47,] 0.911 0.015 0.005 0.049 0.004 0.017
## [48,] 1.000 0.000 0.000 0.000 0.000 0.000
## [49,] 1.000 0.000 0.000 0.000 0.000 0.000
## [50,] 0.006 0.291 0.042 0.136 0.006 0.519
## [51,] 0.002 0.055 0.071 0.026 0.006 0.840
## [52,] 0.002 0.039 0.118 0.019 0.009 0.812
## [53,] 0.004 0.016 0.653 0.010 0.047 0.270
## [54,] 0.001 0.004 0.927 0.002 0.010 0.056
## [55,] 0.000 0.000 0.999 0.000 0.000 0.001
## [56,] 0.000 0.000 1.000 0.000 0.000 0.000
## [57,] 0.000 0.000 0.995 0.000 0.001 0.003
## [58,] 0.000 0.005 0.006 0.002 0.001 0.985
## [59,] 0.000 0.000 0.000 0.000 0.000 1.000
## [60,] 0.000 0.000 0.000 0.000 0.000 0.999
## [61,] 0.000 0.000 0.000 0.000 0.000 1.000
## [62,] 0.000 0.988 0.001 0.002 0.000 0.008
## [63,] 0.000 1.000 0.000 0.000 0.000 0.000
## [64,] 0.000 1.000 0.000 0.000 0.000 0.000
## [65,] 0.000 1.000 0.000 0.000 0.000 0.000
## [66,] 0.000 1.000 0.000 0.000 0.000 0.000
## [67,] 0.000 1.000 0.000 0.000 0.000 0.000
## [68,] 0.000 1.000 0.000 0.000 0.000 0.000
## [69,] 0.039 0.212 0.096 0.264 0.007 0.381
## [70,] 0.079 0.253 0.212 0.118 0.008 0.329
## [71,] 1.000 0.000 0.000 0.000 0.000 0.000
## [72,] 1.000 0.000 0.000 0.000 0.000 0.000
## [73,] 1.000 0.000 0.000 0.000 0.000 0.000
## [74,] 1.000 0.000 0.000 0.000 0.000 0.000
## [75,] 0.048 0.280 0.278 0.071 0.006 0.317
## [76,] 0.043 0.322 0.238 0.062 0.007 0.328
## [77,] 0.003 0.959 0.013 0.004 0.002 0.018
## [78,] 0.000 0.999 0.000 0.000 0.000 0.000
## [79,] 0.007 0.034 0.907 0.010 0.002 0.039
## [80,] 0.000 0.001 0.997 0.000 0.000 0.001
## [81,] 0.000 0.000 1.000 0.000 0.000 0.000

The element lik.anc gives us the marginal ancestral states, also known as the 'empirical Bayesian posterior probabilities.'

It is fairly straightforward to overlay these posterior probabilities on the tree:

plotTree(tree,type="fan",fsize=0.8,ftype="i")

## setEnv=TRUE for this type is experimental. please be patient with bugs

nodelabels(node=1:tree$Nnode+Ntip(tree),
    pie=fitER$lik.anc,piecol=cols,cex=0.5)
tiplabels(pie=to.matrix(x,sort(unique(x))),piecol=cols,cex=0.3)
add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
    y=-max(nodeHeights(tree)),fsize=0.8)

plot of chunk unnamed-chunk-4

An alternative approach to the one outline above is to use an MCMC approach to sample character histories from their posterior probability distribution. This is called stochastic character mapping (Huelsenbeck et al. 2003). The model is the same but in this case we get a sample of unambiguous histories for our discrete character's evolution on the tree - rather than a probability distribution for the character at nodes.

For instance, given the data simulated above - we can generate the stochastic character map as follows:

## simulate single stochastic character map using empirical Bayes method
mtree<-make.simmap(tree,x,model="ER")

## make.simmap is sampling character histories conditioned on the transition matrix
## Q =

##             CG          GB          TC          TG          Tr          Tw
## CG -0.11570723  0.02314145  0.02314145  0.02314145  0.02314145  0.02314145
## GB  0.02314145 -0.11570723  0.02314145  0.02314145  0.02314145  0.02314145
## TC  0.02314145  0.02314145 -0.11570723  0.02314145  0.02314145  0.02314145
## TG  0.02314145  0.02314145  0.02314145 -0.11570723  0.02314145  0.02314145
## Tr  0.02314145  0.02314145  0.02314145  0.02314145 -0.11570723  0.02314145
## Tw  0.02314145  0.02314145  0.02314145  0.02314145  0.02314145 -0.11570723

## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =

##        CG        GB        TC        TG        Tr        Tw
## 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

## Done.

plotSimmap(mtree,cols,type="fan",fsize=0.8,ftype="i")

## setEnv=TRUE for this type is experimental. please be patient with bugs

add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
    y=-max(nodeHeights(tree)),fsize=0.8)

plot of chunk unnamed-chunk-5

A single stochastic character map does not mean a whole lot in isolation - we need to look at the whole distribution from a sample of stochastic maps. This can be a bit overwhelming. For instance, the following code generates 100 stochastic character maps from our dataset and plots them in a grid:

mtrees<-make.simmap(tree,x,model="ER",nsim=100)

## make.simmap is sampling character histories conditioned on the transition matrix
## Q =

##             CG          GB          TC          TG          Tr          Tw
## CG -0.11570723  0.02314145  0.02314145  0.02314145  0.02314145  0.02314145
## GB  0.02314145 -0.11570723  0.02314145  0.02314145  0.02314145  0.02314145
## TC  0.02314145  0.02314145 -0.11570723  0.02314145  0.02314145  0.02314145
## TG  0.02314145  0.02314145  0.02314145 -0.11570723  0.02314145  0.02314145
## Tr  0.02314145  0.02314145  0.02314145  0.02314145 -0.11570723  0.02314145
## Tw  0.02314145  0.02314145  0.02314145  0.02314145  0.02314145 -0.11570723

## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =

##        CG        GB        TC        TG        Tr        Tw
## 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667

## Done.

par(mfrow=c(10,10))
null<-sapply(mtrees,plotSimmap,colors=cols,lwd=1,ftype="off")

plot of chunk unnamed-chunk-6

It's possible to summarize a set of stochastic maps in a much more meaningful way. For instance, we can estimate the number of changes of each type, the proportion of time spent in each state, and the posterior probabilities that each internal node is in each state, under our model. For example:

pd<-describe.simmap(mtrees,plot=FALSE)
pd

## 100 trees with a mapped discrete character with states:
##  CG, GB, TC, TG, Tr, Tw
##
## trees have 23.84 changes between states on average
##
## changes are of the following types:
##      CG,GB CG,TC CG,TG CG,Tr CG,Tw GB,CG GB,TC GB,TG GB,Tr GB,Tw TC,CG
## x->y  0.31  0.29  0.25  0.16  0.32  0.56  0.95  1.29  0.37  1.31  1.27
##      TC,GB TC,TG TC,Tr TC,Tw TG,CG TG,GB TG,TC TG,Tr TG,Tw Tr,CG Tr,GB
## x->y  0.48  0.34  0.65  1.01   0.9  3.11  1.75  0.96  1.57  0.14  0.21
##      Tr,TC Tr,TG Tr,Tw Tw,CG Tw,GB Tw,TC Tw,TG Tw,Tr
## x->y  0.23  0.17  0.19  0.74  1.49  1.48  0.87  0.47
##
## mean total time spent in each state is:
##               CG         GB         TC         TG          Tr         Tw
## raw  13.59430464 46.3032646 32.4980763 68.0843311 12.88762444 32.2993595
## prop  0.06609863  0.2251371  0.1580131  0.3310417  0.06266259  0.1570469
##        total
## raw  205.667
## prop   1.000

plot(pd,fsize=0.6,ftype="i")

plot of chunk unnamed-chunk-7

## now let's plot a random map, and overlay the posterior probabilities
plotSimmap(mtrees[[1]],cols,type="fan",fsize=0.8,ftype="i")

## setEnv=TRUE for this type is experimental. please be patient with bugs

nodelabels(pie=pd$ace,piecol=cols,cex=0.5)
add.simmap.legend(colors=cols,prompt=FALSE,x=0.9*par()$usr[1],
    y=-max(nodeHeights(tree)),fsize=0.8)

plot of chunk unnamed-chunk-8

Finally, since we obtained these inferences under exactly the same model, let's compare the posterior probabilities from stochastic mapping with our marginal ancestral states. In the former case, our probabilities were obtained by sampling from the joint (rather than marginal) probability distribution for the ancestral states.

plot(fitER$lik.anc,pd$ace,xlab="marginal ancestral states",
    ylab="posterior probabilities from stochastic mapping")
lines(c(0,1),c(0,1),lty="dashed",col="red",lwd=2)

plot of chunk unnamed-chunk-9

This tells us that although joint & marginal reconstruction are not the same, the marginal probabilities from joint stochastic sampling and the marginal ancestral states are quite highly correlated - at least in this case study.

Written by Liam J. Revell. Last updated July 3, 2015