Simulated Nucleotide Datasets for Alignment and Phylogeny Estimation

iSGv2.1.0 Benchmark Datasets
Hosted at: UT-Austin Computational Phylogenetics Datasets

corystrope (at) gmail.com Manter 403
University of Nebraska
Lincoln, NE

iSG_logo

Objective
Our objective is to produce large and highly divergent sequence sets whose evolutionary histories are known using a sequence simulator, and to test MSA and phylogenetic tree reconstruction methods that are capable of working with large datasets. The level of sequence divergence we aim at is the average normalized hamming distance (ANHD) between sequences approaching saturation, i.e., ANHD = 0.75 for nucleotide sequences, with varying levels of gappiness in the true MSAs.
Methods
In these benchmarks, we simulate non-coding DNA sequences using the same nucleotide frequencies, indel length distributions, and GTR+Gamma parameters as Liu et al. (2009). We simulated the datasets using indel-Seq-Gen v.2.1.0, with the guide tree generated by r8s. These datasets include 20 replicates 125 distict model conditions; each model condition is defined by a distribution of gap lengths (short, medium, or long), a probability of indel occurrence, average root-to-tip tree length, and the number of taxa (5000, 10000, or 25000 [beginning to generate]). Trees on the true MSA were reconstructed using FastTree2.
Dataset Statistics

5000

10000

25000 (TBA)

Free JavaScripts provided
by The JavaScript Source