A Combined Transmembrane Topology and Signal Peptide Prediction Method

https://doi.org/10.1016/j.jmb.2004.03.016Get rights and content

Abstract

An inherent problem in transmembrane protein topology prediction and signal peptide prediction is the high similarity between the hydrophobic regions of a transmembrane helix and that of a signal peptide, leading to cross-reaction between the two types of predictions. To improve predictions further, it is therefore important to make a predictor that aims to discriminate between the two classes. In addition, topology information can be gained when successfully predicting a signal peptide leading a transmembrane protein since it dictates that the N terminus of the mature protein must be on the non-cytoplasmic side of the membrane. Here, we present Phobius, a combined transmembrane protein topology and signal peptide predictor. The predictor is based on a hidden Markov model (HMM) that models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of interconnected states. Training was done on a newly assembled and curated dataset. Compared to TMHMM and SignalP, errors coming from cross-prediction between transmembrane segments and signal peptides were reduced substantially by Phobius. False classifications of signal peptides were reduced from 26.1% to 3.9% and false classifications of transmembrane helices were reduced from 19.0% to 7.7%. Phobius was applied to the proteomes of Homo sapiens and Escherichia coli. Here we also noted a drastic reduction of false classifications compared to TMHMM/SignalP, suggesting that Phobius is well suited for whole-genome annotation of signal peptides and transmembrane regions. The method is available at http://phobius.cgb.ki.se/ as well as at http://phobius.binf.ku.dk/

Introduction

A well-known weakness of currently available transmembrane (TM) helix predictors is the frequent false classifications of signal peptides (SPs) as TM helices.1., 2., 3. Conversely, SP predictors have a tendency of falsely classifying TM helices as SPs.4., 5., 6. These frequent false classifications are a consequence of the fact that both predictions are primarily looking for a stretch of hydrophobic residues as the main recognition pattern. It therefore seems natural to try to resolve this lack of discrimination by constructing a joint TM topology and SP predictor.

TM protein topology prediction is a classical problem in bioinformatics. Since the structure of TM proteins is difficult to determine by experimental means, it has been a rewarding task to predict their topologies computationally. It may seem easy to recognize an α-helical TM segment since it normally consists of a 15–30 amino acid residues long region with an overrepresentation of hydrophobic residues. However, it is complicated by the fact that many TM helices in multispanning TM proteins are partially or completely shielded by other TM helices. Since they are not entirely exposed to the lipid bilayer they constitute amphipatic helices. Long stretches of hydrophobic residues also exist in other types of protein moieties, e.g., buried within globular domains or in SPs, which could be falsely predicted as TM helices. The task to make TM topology predictions, i.e. to localize all TM segments as well as determine the location (inside the cytoplasm or outside) of the loops turns out to be far from trivial.

Early TM helix prediction methods were based on experimentally determined hydropathy indices of hydrophobic properties for each amino acid. For the examined protein, a hydropathy plot was calculated by adding the hydropathy indexes over a window with a fixed length. A heuristically determined cut-off value was then used to indicate possible TM segments.7., 8. An important improvement to this strategy was the observation that there is an overrepresentation of positively charged amino acid residues in the cytoplasmic loops of TM proteins.9 This gave a hint about the location of the loops and led to the development of the first automated full TM topology prediction methods e.g. TOPPred.10 The method first scans the sequence for certain and putative TM segments and then selects the most likely topology, including none, some or all of the putative segments, based on the charge of the loops. Instead of calculating hydropathy plots there are methods letting a sequence profile (DAS11) or an Artificial Neural Network (PHDhtm12) detect potential TM segments.

Instead of scanning the sequence for TM segments and then sorting out the topology as a second step, the search for TM segments can be integrated with the evaluation of possible topologies in one step. The amino acid distribution of the investigated sequence is compared to precalculated expected amino acid distributions in each type of topologically distinct region (TM helices and cytoplasmic and non-cytoplasmic loops) of a TM protein. Given the correlation measurements between the amino acid distributions of the examined protein and the expected amino acid distributions in different topological regions, the most likely topology can be predicted. A nice feature of this approach is the ability to model all parts of the protein so that all topogenic signals are weighted properly, which is preferable to giving priority to the hydrophobic signal. This was first done by expectation maximization in the method Memsat.13 Probabilistic approaches to the problem have been taken as well. A commonly used probabilistic framework for such tasks is the hidden Markov model (HMM).14 Some popular HMM-based predictors are TMHMM1., 15. and HMMTOP.2

β-Barrel TM proteins seem to be hard to predict with the classical TM prediction methods since their TM segments generally are shorter and with a different amino acid composition than α-helical TM segments. Lately some methods to predict such structures have been published.16., 17. We have chosen not to include β-barrel TM proteins in this study and we restrict our efforts to model α-helical TM segments.

Similar to the TM segment, one of the strongest indications of an SP is a hydrophobic α-helical region. This is called the h-region of the SP. However, the hydrophobic region is generally shorter for an SP (approximately 7–15 residues) than for a TM helix. The h-region is near the N-terminal of the protein but it is preceded by a slight positively charged n-region with high variability in length (approximately 1–12 amino acid residues). Between the h-region and the cleavage site, a somewhat polar and uncharged 3–8 amino acid residues long c-region is situated. Another clear motif on the SP is the presence of small, neutral residues at the −3 and −1 relative to the cleavage site.18., 19.

Most available SP prediction methods use weight matrices,20 Artificial Neural Networks (e.g. SignalP4), HMMs (e.g. SignalP-HMM5) or Support Vector Machines.21., 22. An evaluation23 showed that the very popular method, SignalP V2.0.b2, is more sensitive than the other methods, and predicts cleavage sites more accurately, but includes many false positive predictions.

Alongside its SP model, SignalP-HMM5 uses a model of a signal anchor, i.e. a TM protein with one TM segment near the N-terminal of the protein, to help discriminate against false positives. Similarly, LipoP24 models N-terminal TM helices, SPs and lipoprotein signal peptides in Gram-negative bacteria to improve discrimination between these categories. However, as far as we know, nobody has yet constructed a joint TM topology and SP predictor.

An additional reason for including an SP model when predicting TM topology is, apart from improved SP/TM discrimination, that the presence of an SP indicates that the N terminus of the mature TM protein is on the non-cytoplasmic side of the protein. In that case, the TM topology prediction problem is reduced to finding the correct TM helices, since the orientation of the protein is given by the SP prediction.

Here we describe a new method, Phobius, based on HMM, aiming to predict both TM topology of a protein and the presence of an SP in the protein. The choice of the HMM framework as prediction technique is natural because it has successfully been used for both prediction types separately, and a combination of the model types is relatively straightforward. The main strength of Phobius lies in the ability to discriminate TM segments from SPs. This makes it more accurate on mixed TM/SP proteins than the best TM-only and SP-only predictors. For SP-only proteins, it is more conservative than SignalP, i.e. has a lower false positive rate but also a higher false negative rate.

Section snippets

Model architecture

The model architecture of Phobius can be regarded as a combination of the models made in TMHMM and SignalP-HMM, with a transition from the last state of the SP model in SignalP-HMM to the outer loop state in the TMHMM model. However, several modifications were made to both models. Different combinations of these modifications were then compared against each other (data not shown). The final Phobius model, which is the architecture with the best performance, is shown in Figure 1(a). In contrast

Discussion

We have trained and tested a new prediction method, Phobius, that predicts both transmembrane helices and SPs. In handling both types of predictions at the same time it discriminates better between SPs and TM helices. The method is based on a HMM and works without any post processing of the HMM decoding.

We have shown that Phobius is able to reduce cross-prediction errors when analyzing the genome of H. sapiens and E. coli and thereby giving more accurate figures of TM protein and SP content. We

Data sets

We have collected and curated four different datasets:

  • A set with TM proteins with SPs (Both-TM-and-SP set)

  • A set with TM proteins without SPs (TM-only set)

  • A set with SPs and no TM helices (SP-only set)

  • A set without SPs or TM helices (Neither-TM-nor-SP set)

The sets containing TM helices originate from different sources:

  • 146 sequences from the TMHMM “160 dataset”15

  • 140 sequences from TMPDB ver 6.235

  • 2 sequences from the Möller dataset36

  • 4 sequences of TM proteins with known 3D structure found in

Acknowledgements

This work was supported by grants from Pfizer Corporation and from the Swedish Knowledge Foundation. A.K. was supported by EU grant no. QLRI-CT-2001-00015.

References (40)

  • H Nielsen et al.

    Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites

    Protein Eng.

    (1997)
  • H Nielsen et al.

    Prediction of signal peptides and signal anchors by a hidden Markov model

    Proc. Int. Conf. Intell. Syst. Mol. Biol.

    (1998)
  • Nielsen, H. (1999). From sequence to sorting: prediction of signal peptides. PhD Thesis, Stockholm...
  • P Argos et al.

    Structural prediction of membrane-bound proteins

    Eur. J. Biochem.

    (1982)
  • G von Heijne

    The distribution of positively charged residues in bacterial inner membrane proteins correlates with the trans-membrane topology

    EMBO J.

    (1986)
  • M Cserzo et al.

    Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method

    Protein Eng.

    (1997)
  • B Rost et al.

    Transmembrane helices predicted at 95% accuracy

    Protein Sci.

    (1995)
  • D.T Jones et al.

    A model recognition approach to the prediction of all-helical membrane protein structure and topology

    Biochemistry

    (1994)
  • L.R Rabiner

    A tutorial on hidden Markov models and selected applications in speech recognition

    Proc. IEEE

    (1989)
  • E.L Sonnhammer et al.

    A hidden Markov model for predicting transmembrane helices in protein sequences

    Proc. Int. Conf. Intell. Syst. Mol. Biol.

    (1998)
  • Cited by (1796)

    View all citing articles on Scopus
    View full text