Simulated Protein DataBase

Procedure:

The following workflow was followed:

Generation of guide trees:
- r8s was used to generate topology (ultra metric tree).
- Branch lengths were equalized and then adjusted randomly ranging from 1/4 to 4 and normalized to maximum tree depth of 1.0.
- Appropriate parameters were added to create guide tree files for iSGv2.
iSGv2 simulation:
- 30 replications performed for each parameter set.
- Parameters include:
  - Indel rate (0.0001 - 0.05)
  - Branch length scale (1x - 5x) [corresponding to substitution rates]
  - Gamma distribution (none or alpha = 0.1 - 0.5)
iSGv2 simulation outputs:
- MSA file
- Sequence file
- Tree used for simulation (including the scaled branch lengths)
Analysis of results:
- Maximum and minimum sequence lengths as well as the average alignment length were calculated for each of the 30 replicates.
- Average percent gaps was also calculated for each of the 30 replicates.
- Maximum likelihood phylogenies were reconstructed using FastTree 2.1.4 based on true MSAs, and missing branch rates were calculated between guide tree and FastTree tree using spruce 1.1 .
- MAFFT v.6.864 MSAs were also generated for the first 6 replicates, and missing branch rates were calculated between guide tree and FastTree tree using these MAFFT MSAs.
- The average, maximum, and minimum p-distances (the uncorrected difference divided by the number of alignment positions excluding gap sites) were calculated from all pairwise comparisons.