Hidden Markov models (HMMs)
A profile HMM is a full probabilistic
representation of a sequence profile. A non-probabilistic profile was first described by
Gribskov et al. (1987). Based on a multiple alignment of related protein sequences,
a profile represents amino acid conservation and insertion/deletion information in
a position-specific score matrix (PSSM). A profile includes
more flexible information on a given protein family than a single sequence. Therefore,
database search methods using profiles is more sensitive to remote similarities than
those based on pairwise alignments (e.g., regular BLAST).
Profiles are used in, for example, PROSITE and
Profiles do not have an underlying probabilistic models, and profile search methods
rely on ad hoc scoring schemes. Profile HMMs provide probabilistic models for protein
families (e.g., Eddy 1998).
Information in multiple alignments is captured by linear sequences of 'match', 'insert', and 'delete' states.
Each of 'match' and 'insert' states is associated with emission probabilites for
amino acids, and states are linked with transition probabilities. Different
profile HMM architectures are adopted by profile HMM implementations. Two most popular ones
are HMMER and
HMMER is used in the PFAM database of
protein families and HMMs and in
SMART: Simple Modular Architecture Research Tool. SAM is used in
HMM library and genome assignments server.
We use SAM:
Sequence Alignment and Modeling System for the profile HMM analysis.
All E-values for SAM are calculated based on the same sample size, 30000, so that the E-values can be directly
compared between databases.Three E-value thresholds are chosen to provide different
levels of identification: E=0.05 (SAM; the most stringent), E=4.23 (SAM1; based on the highest E-value of
Arabidopsis MLO, MLO3), and E=6.52 (SAM2; the least stringernt, based on the minimum error points from the
We also include another HMM classifier, GPCRHMM.
GPCRHMM was developed by
Wistrand et al. (2006). The authors constructed a compartmentalized HMM incorporating distinct loop length patterns
and differences in amino acid composition between cytosolic loops, extracellular loops, and membrane regions based
on a diverse set of GPCR sequences. Note that their training set included eleven of 13 PFAM GPCR protein families.
The two families were excluded from their training set as the outliers. Those are Drosophila odorant receptor family 7tm_6 (PF02949),
and the plant family Mlo (PF03094).
Discriminant function analysis and related methods
Kim et al. (2000), we described a protein classification method that relies on
neither multiple alignments nor pattern/profile database search. The method uses concise
statistical variables that characterize the quasi-periodic physico-chemical properties of
multi-transmembrane proteins. Using both positive (proteins of interests) and negative samples (other proteins),
a nonparametric linear discriminant function was successfully trained with these variables
to discriminate G-protein coupled receptors (GPCRs) from non-GPCRs.
Moriyama and Kim (2005), we further examined several parametric and non-parametric discrimination
methods including linear (LDA), quadratic (QDA), logistic (LOG) discriminant analysis, as well as
nonparametric K-nearest neighbor method (KNN). All of these discriminant function analysis
methods performed better than PFAM, PROSITE, and PRINTS especially when applied against
short partial sequences.
We use S-PLUS (Insightful Corporation) with
the MASS library for
the discriminant function analyses: LDA, QDA, LOG, and KNN. For KNN method, the number of neighbors (K) is chosen
from 5, 10, 15, or 20.
These methods are used for the analysis of complete genomes and users can explore all these results.
They are not currrently available for user submitted data analysis due to S-plus license.
We are considering porting it to R (freeware) in the future to make them available for public use.
Support Vector Machines (SVMs)
SVMs are learning machines that make binary classifications based on a hyperplane
separating a remapped instance space (e.g.,
Burges 1998). Kernel functions are chosen so that the remapped instances on
a multidimensional space are linearly separable. Similar to discriminant analysis, multiple variables
extracted from protein sequences can be used, and training is done using both positive and negative samples.
In Strope and Moriyama (2007), we compared the performance of various SVM classifiers and
profile HMMs. We showed that SVMs using simple amino acid composition as descriptors can identify
remotely similar GPCRs better than profile HMMs.
We use SVM-light developed by
Thorsten Joachims for the SVM analysis with amino acid composition (SVM-AA) and dipeptide frequencies
(SVM-di). The radial basis kernel function is used in this application.
Partial least squares regression (PLS)
PLS is a projection method similar to principal component analysis but takes into account
correlations between independent and dependent variables.
Lapinsh et al. (2002) used PLS for a GPCR discrimination problem. Each protein sequence was
first transformed to a vector of five principal components based on physico-chemical properties of
amino acids. Then the auto/crosscovariance (ACC) transformation is done to obtain a uniform matrix
from unaligned sequences. In Opiyo and Moriyama (2007), we used the same strategy but obtained
our own five principal components and selected the 30 amino acid lag size for ACC for optimal
performance for GPCR vs. non-GPCR classification. We found that the performance of PLS with ACC
descriptors is robust even when only a small number of positive samples are available for training.
The performance of profile HMMs suffered when positive sample size was small.
We use an R package
pls developed by Ron Wehrens and Bjørn-Helge Mevik for the PLS analysis.
Transmembrane prediction (TM)
In our analysis, TM methods can be used as one of the protein classifiers. For example, we can consider proteins that
have 7 TM regions and the topology with internal N-terminals as the "positives", and all other proteins as "negatives".
With our tool, users can set up various grouping rules using prediction results by HMMTOP and Phobius.
HMMTOP is a hidden Markov model method for topology prediction of helical transmembrane
proteins developed by Tusnády and Simon (1998,
It is one of the best TM prediction methods (e.g.,
Cuthbertson et al. 2005;
Chen et al. 2002). HMMTOP predicts 97% or more of protein sequences included in
GPCRDB: Information system for G protein-coupled
receptors to have 6-8 TM regions (Moriyama et al. 2006).
Our server includes HMMTOP 2.1.
Many proteins include a short N-terminal signal peptide, which includes a strongly hydrophobic segment.
Many TM prediction methods make errors by misidentifying these signal peptides as TM regions.
Phobius developed by
Käll et al. (2006)
addressed this problem by combining a signal peptide model,
SignalP-HMM (Bendtsen et al. 2004), and
TMHMM (Krogh et al. 2001).
It improved overall accuracy in detecting and differentiating proteins with signal peptides and proteins with
Training datasets and data preprocessing
Datasets used for training these methods are available below:
The older datasets and trained models used for our original
Arabidopsis 7TMR mining page
are also available from our supplementary material page for Moriyama et al. (2006).
Protein sequences used in our analysis were preprocessed as follows:
- Sequences shorter than 35 amino acids (aa) are excluded. Such short sequences cannot be used with
PLS-ACC (with lag=30aa). Transmembrane prediction methods cannot be applied, either.
- A symbol '*' at the end of the sequence is considered as a 'stop", and removed from the sequence.
- All other non-alphabetical letters are changed to 'X'. Irregular letters cause transmembrane prediction programs
to quit. Instead of removing these letters, we simply change them to 'X' to keep the sequence lengths.
- Sequences that have Xs more than 30% of the length are further excluded.
- All lower case letters are also changed to upper case letters.