Incomplete taxon sampling is not a problem for phylogenetic inference

Peer-reviewed Article

Incomplete taxon sampling is not a problem for phylogenetic inference

Abstract

A major issue in all data collection for molecular phylogenetics is taxon sampling, which refers to the use of data from only a small representative set of species for inferring higher-level evolutionary history. Insufficient taxon sampling is often cited as a significant source of error in phylogenetic studies, and consequently, acquisition of large data sets is advocated. To test this assertion, we have conducted computer simulation studies by using natural collections of evolutionary parameters—rates of evolution, species sampling, and gene lengths—determined from data available in genomic databases. A comparison of the true tree with trees constructed by using taxa subsamples and trees constructed by using all taxa shows that the amount of phylogenetic error per internal branch is similar; a result that holds true for the neighbor-joining, minimum evolution, maximum parsimony, and maximum likelihood methods. Furthermore, our results show that even though trees inferred by using progressively larger taxa subsamples of a real data set become increasingly similar to trees inferred by using the full sample, all inferred trees are equidistant from the true tree in terms of phylogenetic error per internal branch. Our results suggest that longer sequences, rather than extensive sampling, will better improve the accuracy of phylogenetic inference.

Taxon sampling refers to the process of selecting representative taxa for a phylogenetic analysis. Nonexhaustive taxon sampling occurs for a number of reasons. Data may not be available from every extant species because of constraints of time, money, or rarity. In most cases, the number of potential species increases quickly if one is interested in phylogenetic relationships above the level of genus or family. Therefore, it is impractical, if not impossible, to sample every species from clades of interest. Rather, representative species from each clade are chosen and the reconstructed phylogenetic relationships of these species are taken to represent the evolutionary history of their respective clades.

Insufficient taxon sampling is often cited as a major source of error in phylogenetic analysis (e.g., refs. 1–10). However, as expected, the value of increasing the number of sequences (species) in a data set depends on the scope of sampling (11–14). Sampling within a fully framed monophyletic group may improve phylogenetic accuracy, but sampling outside of the group pushes the most recent common ancestor of the new set of taxa back in time and may decrease accuracy (13). Random sampling of additional taxa is thought to decrease, rather than increase, phylogenetic accuracy (12–14).

One reason why increased taxon sampling is thought to improve phylogenetic resolution is that it may counteract the “long branch attraction” problem, where long, unrelated branches may group together erroneously (15, 16). Increased taxon sampling may break long branches and help reduce the average branch length throughout the tree (13, 17–19). However, computer simulation results have been equivocal about the benefit of increased taxon sampling for reducing the long branch problem (11, 12, 19–21). The importance of extensive taxon sampling is already well established for estimating evolutionary parameters (4, 22, 23) and in independent contrasts (24).

There have also been a number of empirical studies on the value of taxon sampling on phylogenetic inference (2–10). These studies typically begin with a large number of species and then examine the results of analyzing subsamples; most have concluded that phylogenetic trees reconstructed with more taxa are more accurate than those inferred from fewer taxa. These conclusions assume that the phylogeny inferred by using the largest data set available is closest to the true tree; an assumption that is not well established, because the “true tree” is not known in empirical studies. At present these studies appear to have simply demonstrated that topologies reconstructed by using larger subsamples show higher congruence with the full tree. Therefore, this problem is most readily studied by computer simulation because the “true tree” is known. However, previous simulation and theoretical studies (11, 19–21) were often not conducted by subsampling from a large tree, as mentioned above, but rather began with a small number of species and progressively added additional species to long branches in the starting cluster, keeping the subsample tree fixed.

We conducted a simulation study motivated by issues an evolutionary biologist would encounter with real data. We began with a large predetermined phylogeny (as is the case with all empirical studies, the true tree of life having been fixed via evolution) and generated data sets consisting of sampled taxa from the “known” full phylogeny. In our simulations, we examined the problem of taxon sampling by using evolutionary rates, species representations, and gene length parameters for DNA and amino acid sequences derived from molecular sequence databases. In addition, we used model trees based on actual trees published in the literature, rather than an artificial tree created from a theoretical branching process or an artificial clustering scheme, in order to make our simulations an accurate representation of the topologies and distributions of branch lengths found in real data.

Full Citation

Rosenberg, M.S., and S. Kumar (2001) Incomplete taxon sampling is not a problem for phylogenetic inference. Proceedings of the National Academy of Sciences USA 98(19):10751–10756.

353 citations as of 2024-02-19

Evolutionary Biological Data Science Laboratory