Cory L. Strope
E-mail: cstrope AT cse DOT unl DOT edu
Office: Bioinformatics lab, N169 Beadle Center
Curriculum Vitae
I am currently a Computer Science
Ph.D. candidate,
specializing in Bioinformatics at the University of Nebraska -
Lincoln. I work in the
Bioinformatics Lab.
- Sequence Simulation:
Strope CL, Scott SD, Moriyama EN. 2007.
indel-Seq-Gen: a new protein
family simulator incorporating domains, motifs, and indels. Mol. Biol. Evol. 24:640-649. (preprint PDF file,
Supplementary Files)
Abstract: Reconstructing the evolutionary history of protein sequences will provide a
better understanding of divergence mechanisms of protein superfamilies and their functions.
Long-term protein evolution often includes dynamic changes such as insertion, deletion, and
domain-shuffling. Such dynamic changes make reconstructing protein sequence evolution
difficult and affect the accuracy of molecular evolutionary methods, such as multiple
alignments and phylogenetic methods. Unfortunately, currently available simulation methods are
not sufficiently flexible, and do not allow biologically realistic dynamic protein sequence
evolution. We introduce a new method, indel-Seq-Gen (iSG), that can simulate realistic
evolutionary processes of protein sequences with insertions and deletions (indels). Unlike
other simulation methods, iSG allows the user to simulate multiple subsequences according to
different evolutionary parameters, which is necessary for generating realistic protein
families with multiple domains. iSG tracks all evolutionary events including indels and
outputs the ``true'' multiple alignment of the simulated sequences. iSG can also generate a
larger sequence space by allowing the use of multiple related root sequences. With all these
functions, iSG can be used to test the accuracy of, e.g., multiple alignment methods,
phylogenetic methods, evolutionary hypotheses, ancestral protein reconstruction methods, and
protein family classification methods. We empirically evaluated the performance of iSG against
currently available methods by simulating the evolution of the G protein-coupled receptor and
lipocalin protein families. We examined their ``true'' multiple alignments, reconstruction of
the transmembrane regions and beta-strands, and the results of similarity search against a
protein database using the simulated sequences. We also presented an example of using iSG for
examining how phylogenetic reconstruction is affected by high indel rates.
- Indel Informativeness: Insertions and deletions, although believed to be rare
occurrances in the evolutionary history of protein sequences, are ignored in nearly all
protein sequence analysis. Much has been made of indels as ``missing'' information in protein
sequence sets [9], but with large genomic and proteomic experiments occurring
today, datasets of proteins are large enough to be able to pinpoint some indel
events [5]. There have been some research in the area
(See [4,6,7,8,10]), but mainstream applications
continue to ignore indel information. Specifically:
- Objective functions used to optimize multiple alignments are meant to be a mathematical
model of a biological system. By maximizing the mathematical function, a biologically optimal
alignment should be obtained. Indels in objective functions are treated homogeneously
regardless of their position in an alignment. Incorporation of a more meaningful indel
characterization will make for more meaningful multiple alignment objective functions.
- Evolutionary information, particularly among large sets of sequences, can use indel
information to resolve evolution occurring deep in the phylogenetic trees.
- Subcellular Localization: An important step in the post-genomics era is the
functional characterization of protein sequences. Characterization of protein sequences
provides hints to the function of the protein in the cell, as well as other sequences to which
the protein may interact with. One such method of characterizing the protein sequences is to
determine the subcellular localization of each protein. Eukaryotic cells have many organelles
that perform certain functions to assist in the overall function of the cell. Following the
central dogma of biology, which states that function is carried out by the protein sequences,
predicting the localization of protein sequences within these subcellular compartments will
greatly assist in the functional annotation of the protein sequences. As such, prediction of
the subcellular localization of protein sequences by computational means has been a hot area
of research.
I have been keeping track of many interesting journals, related to biology, computer science,
and bioinformatics. One item of note is that many of these journals are for subscribers only.
For these journals, using a computer under the .unl.edu domain is often a good idea. Finally,
it is always a very good idea to perform a search using HighWire Press (the first link). The
advanced search feature allows you to search any subject you are interested in (with a feature
allowing you to sort the hits by date); there is also a very nice feature called the TopicMap,
which allows you to navigate a tree of subjects to find journals that have relevant
information on your preferred subject. I highly recommend that you play around on their site
to find all of the interesting things you can do!
- 1
- Grassly,N., Adachi,J., Rambaut,A. (1997) PSeq-Gen: an application for the monte carlo simulation of
protein sequence evolution along phylogenetic trees, Bioinformatics, 13, 559-560.
- 2
- Benner,S., Cohen,M., Gonnett,G. (1993) Empirical and structural models for insertions and deletions
in the divergent evolution of proteins, J. Mol. Biol., 229, 1065-1082.
- 3
- Chang,M.S.S., Benner,S.A. (2004) Empirical analysis of protein insertions and deletions determining
parameters for the correct placement of gaps in protein sequence alignments, J. Mol. Biol., 341, 617-631.
- 4
- Giribet, G., Wheeler, W.C. (1999) On Gaps, Mol. Phyl. Evol., 13, 132-143.
- 5
- Gupta, R.S. (2006) Molecular signatures (unique proteins and conserved
indels) that are specific for the epsilon proteobacteria Campylobacterales). BMC
Genomics 7:167.
- 6
- Mitchison, G.J. (1999) A probabilistic treatment of phylogeny and sequence alignment, J.
Mol. Evol., 49, 11-22.
- 7
- Simmons, M.P., Ochoterena, H. (2000) Gaps as characters in sequence-based phylogenetic analyses,
Syst. Biol., 49, 369-381.
- 8
- Simmons, M.P., Ochoterena, H., Carr, T.G. (2001) Incorporation, relative homoplasy, and effect of
gap characters in sequence-based phylogenetic analyses, Syst. Biol., 50, 454-462.
- 9
- Waddell, P.J. (2005) Measuring the fit of sequence data to phylogenetic
model: allowing for missing data. Mol. Biol. Evol22:395-401.
- 10
- Young, N.D., Healy, J. (2003) GapCoder automates the use of indel characters in phylogenetic
analysis, BMC Bioinformatics, 4.
cory strope
2007-03-04