Journal of Molecular Biology
A Combined Transmembrane Topology and Signal Peptide Prediction Method
Introduction
A well-known weakness of currently available transmembrane (TM) helix predictors is the frequent false classifications of signal peptides (SPs) as TM helices.1., 2., 3. Conversely, SP predictors have a tendency of falsely classifying TM helices as SPs.4., 5., 6. These frequent false classifications are a consequence of the fact that both predictions are primarily looking for a stretch of hydrophobic residues as the main recognition pattern. It therefore seems natural to try to resolve this lack of discrimination by constructing a joint TM topology and SP predictor.
TM protein topology prediction is a classical problem in bioinformatics. Since the structure of TM proteins is difficult to determine by experimental means, it has been a rewarding task to predict their topologies computationally. It may seem easy to recognize an α-helical TM segment since it normally consists of a 15–30 amino acid residues long region with an overrepresentation of hydrophobic residues. However, it is complicated by the fact that many TM helices in multispanning TM proteins are partially or completely shielded by other TM helices. Since they are not entirely exposed to the lipid bilayer they constitute amphipatic helices. Long stretches of hydrophobic residues also exist in other types of protein moieties, e.g., buried within globular domains or in SPs, which could be falsely predicted as TM helices. The task to make TM topology predictions, i.e. to localize all TM segments as well as determine the location (inside the cytoplasm or outside) of the loops turns out to be far from trivial.
Early TM helix prediction methods were based on experimentally determined hydropathy indices of hydrophobic properties for each amino acid. For the examined protein, a hydropathy plot was calculated by adding the hydropathy indexes over a window with a fixed length. A heuristically determined cut-off value was then used to indicate possible TM segments.7., 8. An important improvement to this strategy was the observation that there is an overrepresentation of positively charged amino acid residues in the cytoplasmic loops of TM proteins.9 This gave a hint about the location of the loops and led to the development of the first automated full TM topology prediction methods e.g. TOPPred.10 The method first scans the sequence for certain and putative TM segments and then selects the most likely topology, including none, some or all of the putative segments, based on the charge of the loops. Instead of calculating hydropathy plots there are methods letting a sequence profile (DAS11) or an Artificial Neural Network (PHDhtm12) detect potential TM segments.
Instead of scanning the sequence for TM segments and then sorting out the topology as a second step, the search for TM segments can be integrated with the evaluation of possible topologies in one step. The amino acid distribution of the investigated sequence is compared to precalculated expected amino acid distributions in each type of topologically distinct region (TM helices and cytoplasmic and non-cytoplasmic loops) of a TM protein. Given the correlation measurements between the amino acid distributions of the examined protein and the expected amino acid distributions in different topological regions, the most likely topology can be predicted. A nice feature of this approach is the ability to model all parts of the protein so that all topogenic signals are weighted properly, which is preferable to giving priority to the hydrophobic signal. This was first done by expectation maximization in the method Memsat.13 Probabilistic approaches to the problem have been taken as well. A commonly used probabilistic framework for such tasks is the hidden Markov model (HMM).14 Some popular HMM-based predictors are TMHMM1., 15. and HMMTOP.2
β-Barrel TM proteins seem to be hard to predict with the classical TM prediction methods since their TM segments generally are shorter and with a different amino acid composition than α-helical TM segments. Lately some methods to predict such structures have been published.16., 17. We have chosen not to include β-barrel TM proteins in this study and we restrict our efforts to model α-helical TM segments.
Similar to the TM segment, one of the strongest indications of an SP is a hydrophobic α-helical region. This is called the h-region of the SP. However, the hydrophobic region is generally shorter for an SP (approximately 7–15 residues) than for a TM helix. The h-region is near the N-terminal of the protein but it is preceded by a slight positively charged n-region with high variability in length (approximately 1–12 amino acid residues). Between the h-region and the cleavage site, a somewhat polar and uncharged 3–8 amino acid residues long c-region is situated. Another clear motif on the SP is the presence of small, neutral residues at the −3 and −1 relative to the cleavage site.18., 19.
Most available SP prediction methods use weight matrices,20 Artificial Neural Networks (e.g. SignalP4), HMMs (e.g. SignalP-HMM5) or Support Vector Machines.21., 22. An evaluation23 showed that the very popular method, SignalP V2.0.b2, is more sensitive than the other methods, and predicts cleavage sites more accurately, but includes many false positive predictions.
Alongside its SP model, SignalP-HMM5 uses a model of a signal anchor, i.e. a TM protein with one TM segment near the N-terminal of the protein, to help discriminate against false positives. Similarly, LipoP24 models N-terminal TM helices, SPs and lipoprotein signal peptides in Gram-negative bacteria to improve discrimination between these categories. However, as far as we know, nobody has yet constructed a joint TM topology and SP predictor.
An additional reason for including an SP model when predicting TM topology is, apart from improved SP/TM discrimination, that the presence of an SP indicates that the N terminus of the mature TM protein is on the non-cytoplasmic side of the protein. In that case, the TM topology prediction problem is reduced to finding the correct TM helices, since the orientation of the protein is given by the SP prediction.
Here we describe a new method, Phobius, based on HMM, aiming to predict both TM topology of a protein and the presence of an SP in the protein. The choice of the HMM framework as prediction technique is natural because it has successfully been used for both prediction types separately, and a combination of the model types is relatively straightforward. The main strength of Phobius lies in the ability to discriminate TM segments from SPs. This makes it more accurate on mixed TM/SP proteins than the best TM-only and SP-only predictors. For SP-only proteins, it is more conservative than SignalP, i.e. has a lower false positive rate but also a higher false negative rate.
Section snippets
Model architecture
The model architecture of Phobius can be regarded as a combination of the models made in TMHMM and SignalP-HMM, with a transition from the last state of the SP model in SignalP-HMM to the outer loop state in the TMHMM model. However, several modifications were made to both models. Different combinations of these modifications were then compared against each other (data not shown). The final Phobius model, which is the architecture with the best performance, is shown in Figure 1(a). In contrast
Discussion
We have trained and tested a new prediction method, Phobius, that predicts both transmembrane helices and SPs. In handling both types of predictions at the same time it discriminates better between SPs and TM helices. The method is based on a HMM and works without any post processing of the HMM decoding.
We have shown that Phobius is able to reduce cross-prediction errors when analyzing the genome of H. sapiens and E. coli and thereby giving more accurate figures of TM protein and SP content. We
Data sets
We have collected and curated four different datasets:
- •
A set with TM proteins with SPs (Both-TM-and-SP set)
- •
A set with TM proteins without SPs (TM-only set)
- •
A set with SPs and no TM helices (SP-only set)
- •
A set without SPs or TM helices (Neither-TM-nor-SP set)
The sets containing TM helices originate from different sources:
- •
146 sequences from the TMHMM “160 dataset”15
- •
140 sequences from TMPDB ver 6.235
- •
2 sequences from the Möller dataset36
- •
4 sequences of TM proteins with known 3D structure found in
Acknowledgements
This work was supported by grants from Pfizer Corporation and from the Swedish Knowledge Foundation. A.K. was supported by EU grant no. QLRI-CT-2001-00015.
References (40)
- et al.
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes
J. Mol. Biol.
(2001) - et al.
Principles governing amino acid composition of integral membrane proteins: application to topology prediction
J. Mol. Biol.
(1998) - et al.
A simple method for displaying the hydropathic character of a protein
J. Mol. Biol.
(1982) Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule
J. Mol. Biol.
(1992)- et al.
A putative signal peptidase recognition site and sequence in eukaryotic and prokaryotic signal peptides
J. Mol. Biol.
(1983) - et al.
Hidden Markov models in computational biology. Applications to protein modeling
J. Mol. Biol.
(1994) - et al.
Comprehensive analysis of transmembrane topologies in prokaryotic genomes
Gene
(2003) - et al.
Reliability of transmembrane predictions in whole-genome data
FEBS Letters
(2002) - et al.
Reliability measures for membrane protein topology prediction algorithms
J. Mol. Biol.
(2003) - et al.
The presence of signal peptide significantly affects transmembrane topology prediction
Bioinformatics
(2002)
Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites
Protein Eng.
Prediction of signal peptides and signal anchors by a hidden Markov model
Proc. Int. Conf. Intell. Syst. Mol. Biol.
Structural prediction of membrane-bound proteins
Eur. J. Biochem.
The distribution of positively charged residues in bacterial inner membrane proteins correlates with the trans-membrane topology
EMBO J.
Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: the dense alignment surface method
Protein Eng.
Transmembrane helices predicted at 95% accuracy
Protein Sci.
A model recognition approach to the prediction of all-helical membrane protein structure and topology
Biochemistry
A tutorial on hidden Markov models and selected applications in speech recognition
Proc. IEEE
A hidden Markov model for predicting transmembrane helices in protein sequences
Proc. Int. Conf. Intell. Syst. Mol. Biol.
Cited by (1796)
Improving Signal and Transit Peptide Predictions Using AlphaFold2-predicted Protein Structures
2024, Journal of Molecular Biology