tax logo
Simulated Protein Benchmark Database
Large scale protein alignments - Home Page
Generated and maintained by Catherine Anderson
Computer Science, University of Nebraska, Lincoln
anderson@cse.unl.edu
last update: 06-25-2012
  Objective:
To create large scale protein alignment benchmark database.

Data is organized into 5000 and 10000 taxa simulations.

  Procedure:
The following workflow was followed:
  • Generation of guide trees:
    • r8s was used to generate topology (ultra metric tree).
    • Branch lengths were equalized and then adjusted randomly ranging from 1/4 to 4 and normalized to maximum tree depth of 1.0.
    • Appropriate parameters were added to create guide tree files for iSGv2.

  • iSGv2 simulation:
    • 30 replications performed for each parameter set.
    • Parameters include:
      • Indel rate (0.0001 - 0.05)
      • Branch length scale (1x - 5x) [corresponding to substitution rates]
      • Gamma distribution (none or alpha = 0.1 - 0.5)

  • iSGv2 simulation outputs:
    • MSA file
    • Sequence file
    • Tree used for simulation (including the scaled branch lengths)

  • Analysis of results:
    • Maximum and minimum sequence lengths as well as the average alignment length were calculated for each of the 30 replicates.
    • Average percent gaps was also calculated for each of the 30 replicates.
    • Maximum likelihood phylogenies were reconstructed using FastTree 2.1.4 based on true MSAs, and missing branch rates were calculated between guide tree and FastTree tree using spruce 1.1 .
    • MAFFT v.6.864 MSAs were also generated for the first 6 replicates, and missing branch rates were calculated between guide tree and FastTree tree using these MAFFT MSAs.
    • The average, maximum, and minimum p-distances (the uncorrected difference divided by the number of alignment positions excluding gap sites) were calculated from all pairwise comparisons.


Downloads:
    5000 taxa
    10000 taxa
    r8s
    iSGv2
    FastTree
    Spruce