Our objective is to produce large and highly divergent sequence sets whose evolutionary histories are known using a sequence simulator, and to test MSA and phylogenetic tree reconstruction methods that are capable of working with large datasets. The level of sequence divergence we aim at is the average normalized hamming distance (ANHD) between sequences approaching saturation, i.e., ANHD = 0.75 for nucleotide sequences, with varying levels of gappiness in the true MSAs.
In these benchmarks, we simulate non-coding DNA sequences using the same nucleotide frequencies, indel
length distributions, and GTR+Gamma parameters as Liu
et al. (2009). We simulated the datasets using indel-Seq-Gen v.2.1.0, with the guide tree generated by r8s. These datasets include 20 replicates 125 distict model conditions; each model condition is defined by a distribution of gap lengths (short, medium, or long), a probability of indel occurrence, average root-to-tip tree length, and the number of taxa (5000, 10000, or 25000 [beginning to generate]). Trees on the true MSA were reconstructed using FastTree2.