— a hierarchical mining tool for 7TMRs —
Home Introduction Methods Summary Stats Links


Hidden Markov models (HMMs)

A profile HMM is a full probabilistic representation of a sequence profile. A non-probabilistic profile was first described by Gribskov et al. (1987). Based on a multiple alignment of related protein sequences, a profile represents amino acid conservation and insertion/deletion information in a position-specific score matrix (PSSM). A profile includes more flexible information on a given protein family than a single sequence. Therefore, database search methods using profiles is more sensitive to remote similarities than those based on pairwise alignments (e.g., regular BLAST). Profiles are used in, for example, PROSITE and PSI-BLAST.

Profiles do not have an underlying probabilistic models, and profile search methods rely on ad hoc scoring schemes. Profile HMMs provide probabilistic models for protein families (e.g., Eddy 1998). Information in multiple alignments is captured by linear sequences of 'match', 'insert', and 'delete' states. Each of 'match' and 'insert' states is associated with emission probabilites for amino acids, and states are linked with transition probabilities. Different profile HMM architectures are adopted by profile HMM implementations. Two most popular ones are HMMER and SAM. HMMER is used in the PFAM database of protein families and HMMs and in SMART: Simple Modular Architecture Research Tool. SAM is used in the Superfamily HMM library and genome assignments server.

We use SAM: Sequence Alignment and Modeling System for the profile HMM analysis. All E-values for SAM are calculated based on the same sample size, 30000, so that the E-values can be directly compared between databases.Three E-value thresholds are chosen to provide different levels of identification: E=0.05 (SAM; the most stringent), E=4.23 (SAM1; based on the highest E-value of Arabidopsis MLO, MLO3), and E=6.52 (SAM2; the least stringernt, based on the minimum error points from the training set).

We also include another HMM classifier, GPCRHMM. GPCRHMM was developed by Wistrand et al. (2006). The authors constructed a compartmentalized HMM incorporating distinct loop length patterns and differences in amino acid composition between cytosolic loops, extracellular loops, and membrane regions based on a diverse set of GPCR sequences. Note that their training set included eleven of 13 PFAM GPCR protein families. The two families were excluded from their training set as the outliers. Those are Drosophila odorant receptor family 7tm_6 (PF02949), and the plant family Mlo (PF03094).

Discriminant function analysis and related methods

In Kim et al. (2000), we described a protein classification method that relies on neither multiple alignments nor pattern/profile database search. The method uses concise statistical variables that characterize the quasi-periodic physico-chemical properties of multi-transmembrane proteins. Using both positive (proteins of interests) and negative samples (other proteins), a nonparametric linear discriminant function was successfully trained with these variables to discriminate G-protein coupled receptors (GPCRs) from non-GPCRs. In Moriyama and Kim (2005), we further examined several parametric and non-parametric discrimination methods including linear (LDA), quadratic (QDA), logistic (LOG) discriminant analysis, as well as nonparametric K-nearest neighbor method (KNN). All of these discriminant function analysis methods performed better than PFAM, PROSITE, and PRINTS especially when applied against short partial sequences.

We use S-PLUS (Insightful Corporation) with the MASS library for the discriminant function analyses: LDA, QDA, LOG, and KNN. For KNN method, the number of neighbors (K) is chosen from 5, 10, 15, or 20.

These methods are used for the analysis of complete genomes and users can explore all these results. They are not currrently available for user submitted data analysis due to S-plus license. We are considering porting it to R (freeware) in the future to make them available for public use.

Support Vector Machines (SVMs)

SVMs are learning machines that make binary classifications based on a hyperplane separating a remapped instance space (e.g., Burges 1998). Kernel functions are chosen so that the remapped instances on a multidimensional space are linearly separable. Similar to discriminant analysis, multiple variables extracted from protein sequences can be used, and training is done using both positive and negative samples. In Strope and Moriyama (2007), we compared the performance of various SVM classifiers and profile HMMs. We showed that SVMs using simple amino acid composition as descriptors can identify remotely similar GPCRs better than profile HMMs.

We use SVM-light developed by Thorsten Joachims for the SVM analysis with amino acid composition (SVM-AA) and dipeptide frequencies (SVM-di). The radial basis kernel function is used in this application.

Partial least squares regression (PLS)

PLS is a projection method similar to principal component analysis but takes into account correlations between independent and dependent variables. Lapinsh et al. (2002) used PLS for a GPCR discrimination problem. Each protein sequence was first transformed to a vector of five principal components based on physico-chemical properties of amino acids. Then the auto/crosscovariance (ACC) transformation is done to obtain a uniform matrix from unaligned sequences. In Opiyo and Moriyama (2007), we used the same strategy but obtained our own five principal components and selected the 30 amino acid lag size for ACC for optimal performance for GPCR vs. non-GPCR classification. We found that the performance of PLS with ACC descriptors is robust even when only a small number of positive samples are available for training. The performance of profile HMMs suffered when positive sample size was small.

We use an R package pls developed by Ron Wehrens and Bjørn-Helge Mevik for the PLS analysis.

Transmembrane prediction (TM)

In our analysis, TM methods can be used as one of the protein classifiers. For example, we can consider proteins that have 7 TM regions and the topology with internal N-terminals as the "positives", and all other proteins as "negatives". With our tool, users can set up various grouping rules using prediction results by HMMTOP and Phobius.

HMMTOP is a hidden Markov model method for topology prediction of helical transmembrane proteins developed by Tusnády and Simon (1998, 2001). It is one of the best TM prediction methods (e.g., Cuthbertson et al. 2005; Chen et al. 2002). HMMTOP predicts 97% or more of protein sequences included in GPCRDB: Information system for G protein-coupled receptors to have 6-8 TM regions (Moriyama et al. 2006). Our server includes HMMTOP 2.1.

Many proteins include a short N-terminal signal peptide, which includes a strongly hydrophobic segment. Many TM prediction methods make errors by misidentifying these signal peptides as TM regions. Phobius developed by Käll et al. (2006) addressed this problem by combining a signal peptide model, SignalP-HMM (Bendtsen et al. 2004), and TMHMM (Krogh et al. 2001). It improved overall accuracy in detecting and differentiating proteins with signal peptides and proteins with TM segments.

Training datasets and data preprocessing

Datasets used for training these methods are available below:

The older datasets and trained models used for our original Arabidopsis 7TMR mining page are also available from our supplementary material page for Moriyama et al. (2006).

Protein sequences used in our analysis were preprocessed as follows:

  • Sequences shorter than 35 amino acids (aa) are excluded. Such short sequences cannot be used with PLS-ACC (with lag=30aa). Transmembrane prediction methods cannot be applied, either.
  • A symbol '*' at the end of the sequence is considered as a 'stop", and removed from the sequence.
  • All other non-alphabetical letters are changed to 'X'. Irregular letters cause transmembrane prediction programs to quit. Instead of removing these letters, we simply change them to 'X' to keep the sequence lengths.
  • Sequences that have Xs more than 30% of the length are further excluded.
  • All lower case letters are also changed to upper case letters.

[Last updated: 04/28/2009]