indel-Seq-Gen
Developer: Cory L. Strope
   




corystrope (at) gmail.com
Ricks 318
North Carolina State University
Raleigh, NC

iSG_logo

Introduction

Software

Manual

Publications

Version History

Introduction

indel-Seq-Gen (iSG) is a biological sequence simulation program that simulates highly divergent DNA sequences and protein superfamilies. This is accomplished through the addition of subsequence length constraints and lineage- and site-specific evolution. iSG tracks insertion and deletion processes that occur during the simulation run. iSG records all evolutionary events and outputs the "true" multiple alignment of the sequences, and can generate a larger simulated sequence space by allowing the use of multiple related root sequences. iSG can be used to test the accuracy of multiple alignment methods, evolutionary hypotheses, ancestral protein reconstruction methods, and protein superfamily classification methods. iSG utilizes a highly modified version of the substitution engine from Seq-Gen v1.3.2.

Software

indel-Seq-Gen v2.1.03 source download

Obselete versions: NOTE (June 8, 2011): Obselete versions offer only a subset of the functionalities of the distribution version. When testing new sequence simulation software, please compare against only the newest version of iSG!

Version 2.1.0
Version 2.0.8
Version 2.0.7
Version 2.0.6
Version 2.0.5
Version 2.0.4
Version 2.0.3
Version 2.0
Version 1.0.3
Version 1.0.2
Version 1.0.1
Version 1.0

indel-Seq-Gen is distributed in an archived, gzipped file. To extract .tar.gz files, type:

gunzip [filename].tar.gz

tar -xvf [filename].tar

indel-Seq-Gen v2.1.03 manual (pdf)

Publications

Anderson, C.L., Strope, C.L., and Moriyama, E.N. "Assessing Multiple Sequence Alignments Using Visual Tools", Bioinformatics - Trends and Methodologies, Editor Mahmood A. Mahdavi, InTech, 2011. ISBN: 978-953-307-282-1.

Anderson, C.L., Strope, C.L., and Moriyama, E.N. (2011) SuiteMSA: visual tools for multiple sequence alignment comparison and molecular sequence simulation. BMC Bioinformatics, 12:184 PMID: 21600033

Strope, C.L., Abel, K., Scott, S.D., and Moriyama, E.N. (2009) Biological sequence simulation for testing complex evolutionary hypotheses: indel-Seq-Gen version 2.0. Mol. Biol. Evol 26: 2581-2593 (Advanced Access published in 08/2009; doi: 10.1093/molbev/msp174) ( PMCID:2760465)

Strope, C.L., Scott, S.D., and Moriyama, E.N. (2007) indel-Seq-Gen: a new protein family simulatory incorporating domains, motifs, and indels. Mol. Biol. Evol 24: 640-649 (Advanced Access published in 12/2006; doi: 10.1093/molbev/ms1195) PMID: 17158778, Preprint PDF, Supplementary files

Version History

iSG version 2.1.03 (12-23-2010)

Fixed issue under the Gillespie model. For substitutions, only subsets of sites are tested for transitions. This is especially an issue when stationary frequencies are changed between lineages.

Small error checking change to inTree for lineage file input.

Added checks so that an input root sequence with not enough characters to fill the template or motif constraints will have a random sequence added so that the minimum requirements are met.

Bug fix: Fixed issues with template constraints not being upheld.

Bug fix: Out-of-bounds exception fixed for root sequence inputs that begin with a '-'.

Bug fix: Fixed issue with motif sites that were not inheriting the substitution constraints from their ancestral sequences.

Bug fix: Moved the creation of the Sequence structure (and corresponding Site structures) until after all '.' in the template are removed from the input root sequence. This fixes a segmentation fault issue.

Added new functionality to the branch scaling option (-b ). If the value is negative, iSG scales the tree so that the average root to tip path lengths are equal to the input value, i.e., -b -5 will average the path lengths of all root-to-tip paths, and scale all branches so that the average root-to-tip path length is equal to 5 substitutions per site.

iSG version 2.1.0 (10-8-2010)

Fixed issue with the output in the .anc_tree file when branch lengths are scaled (both globally and on partitions).

Fixed problem in which an input MSA caused iSG to spin.

Added more informative message when frequency file contains a trailing comma (iSG thinks that there are 21 frequencies, even if there are not).

Added warning when a simulation produces all empty columns. If the trace file is not being created, iSG removes the empty columns from the output.

Changed the separator for indel length distribution files in the guide tree file from '/' to ':' to allow for the indel length distribution files to be absolute paths.

iSG version 2.0.8 (5-5-2010)

Added error-checking for indel specifications in tree- and lineage-files.

Added an error check for unrooted trees. If an unrooted tree is entered, iSG exits with an error message indicating the location of the trifurcation found.

Added an option (-T --perturbTree {double}) that randomly rescales each branch length by the factor [1/{double}:{double}], where {double} is the input value. Perturbation of branches is different for each simulation run for multi-set simulations.

Fixed a bug that prevents the first position in a partition to be invariable when specified in the quaternary invariable array (array values '1' and '3').

Fixed an error in which lineages adjusted only when a motif with the same subtree name existed.

Fixed a segmentation fault-causing error in runs where both (i) random sequences without PROSITE motifs and (ii) input root sequences with motifs were specified.

iSG version 2.0.7 (4-27-2010):

Added option "ran" for the indel fill model (option -u).

Corrected the MSA length for phylip output.

iSG version 2.0.6 (4-16-2010):

Fixed a segmentation fault that appeared when the insertion and deletion probabilities were specified, but the length distribution was left empty (specifies that the Chang & Benner length dist files should be used).

iSG version 2.0.5 (4-6-2010):

Made example files and manual consistent.

Fixed Phylip and Nexus output formats.

Added support for both big endian and little endian architectures.

Added checks for the existence of files specified in the -k option (lineage files).

indel-Seq-Gen version 2.0.4:

Fixed problem reporting time relative indel events greater than 1 in the Gillespie implementation.

Added command-line variable that allows users to flag which post-simulation files to output. (.root, .seq, .ma, .anc_tree, .trace, .verb)

indel-Seq-Gen version 2.0.3:

Fixed bug that did not suppress stop codons from appearing in coding sequences. (Thanks, WS)

Output the original guide tree to the end of the .trace file. Needed for GUI.

The Chang and Benner model of indel creation has no meaningful representation in the Gillespie formulation of indel creation. Set the C&B model using Gillespie to be the same as iSGv1.0. Since the indel rate is 50/50, this should not change much in the end result. Fixed a problem with Chang and Benner model for des and trs that caused under-gapping. (Thanks, WS)

Removed support for the .verb file.

Extensive updates to the user manual.

Hyper-substitution issue when using the Chang and Benner indel model fixed. (Thanks, WS)

Fixed an issue with seg. faulting when a x(0,y) places a dash at the end of the sequence.

Added PROSITE motif library to use with the random sequence option. Option '-1' with a floating point number (0,1] activates the motif placement, where the number entered is the maximum proportion of the root sequence that will be covered by a motif. iSG randomly chooses with replacement motifs from the library, and places them randomly on the sequence, as long as the total proportion of motif-covered sites is not larger than the desired proportion. While the desired proportion is not met, iSG continues to place motifs, subject to the constraint that 20 randomly chosen motifs cannot be placed because of the root sequence motif proportion restriction.

Added support for PROSITE motif positions of the form "[...](#)" and "{...}(#)".

Added Gillespie representation of indel formation, using option "-j gil". Sequence runs using the Gillespie algorithm are analogous to the TRS simulation scheme, but are much faster.

indel-Seq-Gen version 2.0:

Includes protein, coding DNA, and non-coding DNA simulation

Incorporates PROSITE-like regular expressions for motif conservation

Allows the user to impose minimum and maximum lengths on subsequences (e.g., GPCR transmembrane subsequence can be a minimum of 17 and a maximum of 24 amino acids)

Changed continuous simulation by breaking up branch lengths into small, discrete values.

Logs insertion and deletion events, outputting (i) Type of event, (ii) branch of occurrence, (iii) length of event, (iv) relative time of occurrence, and (v) position of the inserted characters (for insertions) or gaps (for deletion) in the true multiple sequence alignment.

Introduced a novel representation of event probability (insertions, deletions and substitutions) for a sequence that is constrained

Fixed a flaw in the modeling of indels (see publication for details).

The Perl script from iSGv1.0 has been converted into ANSI C++, and iSGv2.0 is now packaged using GNU autotools for compatibility.

Full version history

Free JavaScripts provided
by The JavaScript Source