Journal of Molecular Biology
Volume 302, Issue 4, 29 September 2000, Pages 917-926
Journal home page for Journal of Molecular Biology

Regular article
A fast method to predict protein interaction sites from sequences1

https://doi.org/10.1006/jmbi.2000.4092Get rights and content

Abstract

A simple method for predicting residues involved in protein interaction sites is proposed. In the absence of any structural report, the procedure identifies linear stretches of sequences as “receptor-binding domains” (RBDs) by analysing hydrophobicity distribution. The sequences of two databases of non-homologous interaction sites eliciting various biological activities were tested; 59–80 % were detected as RBDs. A statistical analysis of amino acid frequencies was carried out in known interaction sites and in predicted RBDs. RBDs were predicted from the 80,000 sequences of the Swissprot database. In both cases, arginine is the most frequently occurring residue. The RBD procedure can also detect residues involved in specific interaction sites such as the DNA-binding (95 % detected) and Ca-binding domains (83 % detected). We report two recent analyses; from the prediction of RBDs in sequences to the experimental demonstration of the functional activities. The examples concern a retroviral Gag protein and a penicillin-binding protein. We support that this method is a quick way to predict protein interaction sites from sequences and is helpful for guiding experiments such as site-specific mutageneses, two-hybrid systems or the synthesis of inhibitors.

Introduction

Protein interaction sites are critical domains for selective recognition of molecules and for the formation of complexes. They are responsible for diverse important biological functions. Therefore, detection of interaction domains in sequences could help in identifying protein function. It could also help, for example, to validate functional hypotheses via the design of restricted fragments for two-hybrid assays (Vidal et al., 1996) or of specific mutageneses (Phizicky & Fields, 1995). Computational methods are of great interest in predicting protein interacting pairs, and thus to construct metabolic pathways or signalling cascades for recently sequenced genomes. The prediction of interaction sites should be a good starting point to help identify pharmacological targets and help drug design studies. Such analyses require the elaboration of docking procedures Janin 1995, Shoichet and Kuntz 1996, Sternberg et al 1998, the knowledge of protein and ligand structures (Bamborough & Cohen, 1996) and the consideration of conformational changes (Betts & Sternberg, 1999).

Several methods exist for predicting protein structure; they identify interaction domains by analysing the hydrophobicity, solvation, protrusion and the accessibility of residues Young et al 1994, Jones and Thornton 1997a, Jones and Thornton 1997b. Those approaches are interesting but cannot answer requests of the great number of biochemists with sequences, but no structural data. Indeed, despite the amount of protein structures already solved, the bank of structures is ridiculously small as compared to those of sequences.

To our knowledge, very few methods use sequences as their starting point. The algorithm by Kini & Evans (1995) supports that proline residues frequently occur near interaction sites. The frequency is 2.5 times higher than expected by random distribution. They suggest that “proline-brackets” encircle a large number of protein-protein interaction sites (Kini & Evans, 1996). Another method uses multiple sequence alignments and focuses on correlated mutations to detect protein interacting sites (Pazos et al., 1997). The hypothesis is that residues close to protein-protein interaction sites tend to mutate simultaneously during evolution. Therefore, from multiple sequence alignments, the authors detect the residues linking different protein domains and interacting in heterodimer complexes.

In a recent analysis, Marcotte et al. (1999) report that they can predict which proteins interact by analysing genome sequences. The hypothesis is that two proteins are interacting if, in another living organism, they are assembled as a single protein. The procedure is also very powerful to predict the functions of wide protein complexes if one can trace domain homologies. However, the procedure gives no information on the interacting amino acids per se.

Here, we test a fast and simple method to predict stretches of protein interaction sites from sequences in the absence of any structural report. Eisenberg et al. (1982) previously showed that plotting the mean alpha-helical hydrophobic moment 〈μHversus the mean hydrophobicity 〈H〉 allows us to classify protein fragments according to their location in the structures; either they are membrane segments, parts of globular domains or surface-seeking helices. The authors demonstrated that a high level of hydrophobicity, together with a low hydrophobic moment, support that the fragment is membranous, whereas residues from surface-seeking helices cover a wide diagonal area beginning at the upper left of the plot (Figure 1). The diagram was thus divided into four regions corresponding to globular, surface and membrane (monomeric and multimeric) domains, called G, S and M, respectively (Eisenberg et al., 1984; see Figure 1(a)). Here, a fifth domain, the “receptor-binding domain” (RBD) is investigated in which we detect some residues of protein interaction sites. The RBD method is described and is applied to different sequence databases. Results show that the plot drawn from the Eisenberg’s method detects most of the experimentally known interaction sites. The effects of several parameters of the procedure were tested. The structures, the accessibility and the functional characterisation of predicted sites were also investigated on few 3D structures. The results obtained with the DNA-binding and the calcium-binding sequences and with the 3D structures, such as the ultrabithorax-extradenticle-DNA complex and the calcium-binding protein, demonstrate that our procedure can detect various types of interaction sites as long as they involve hydrophilic residues. Finally, we demonstrate that the RBD analysis could be valuable in identifying mutations. Two examples, in the Mason-Pfizer monkey virus Gag protein and in a penicillin-binding protein, are shown.

Section snippets

Apolipoprotein E and Newcastle disease virus fusion protein analysis

In the analysis of the apolipoprotein E sequence, De Loof et al. (1986) extended the concept previously suggested by Eisenberg by considering an additional region of the hydrophobicity/hydrophobic moment plot that they called “receptor-binding-domain” (RBD). The RBD method is thus based on the calculation of the mean hydrophobic moment 〈μH〉 and the mean hydrophobicity 〈H〉 of an N-residue window (N being odd) centred at the amino acid of interest. The δ angle is 100° to correspond to the

Acknowledgements

The authors wish to thank J.M. Ghuysen and M. Nguyen-Distèche for their contribution and discussion during the analysis of PBP3. We are also grateful to A. Burny, F. Bex and S. Arnould for their constructive discussion about the M-PMV Gag protein and to M.R. Conte for kindly providing the atomic coordinates of the structure. We acknowledge R.M. Kini for the access to its database of known interaction sites. X.G. is supported by the Interuniversity Poles of Attraction Programme-Belgian State,

References (35)

Cited by (180)

  • Concepts and Experimental Protocols of Modelling and Informatics in Drug Design

    2020, Concepts and Experimental Protocols of Modelling and Informatics in Drug Design
  • Protein-Protein Interaction Site Prediction Based on Attention Mechanism and Convolutional Neural Networks

    2023, IEEE/ACM Transactions on Computational Biology and Bioinformatics
View all citing articles on Scopus
1

Edited by B. Holland

View full text