Identifying sequential differences between protein structural classes using network and statistical approaches
Abstract
Protein sequence information is believed to embed the hint of their structures. To uncover the nature between protein sequence and their structures, this study motivates to inspect the dynamic interactions between various protein sequence features, and identify the sequential differences between the different protein structures. Protein sequence data from all structural classes in CATH and SCOP, and the structural disordered proteins from DisProt, as well as the structural motifs in PROSITE, are analyzed in this study. Betweenness and closeness centrality measures are employed to capture the topology of the networks constructed from amino acid feature interactions, while statistical tests are further implemented to compare the feature series distributions. Key findings suggest that in all structural classes, the features for Ala and α-helix and bend preference property, Ala and side-chain size, Ala and Gly, as well as Met and Leu attain significant interactions between each other, and the feature for Leu, Val, and Asn are acted as the critical sources of feature interactions, whereas Cys, His, Trp, and Met exhibit weak intra-type interactions with other features. These implicate that these feature interactions may have little impact in coding the structural differences. For the α structures, Glu, Pro and side-chain size, hydrophobicity properties exhibit high importance in feature interactions, whereas Gly, Thr and physical properties such as α-helix and bend preference, extended structural preference, pK-C value and surrounding hydrophobicity for β structures, show special high importance in β structures. Both α and β types of structures show Ser as the common sources of feature interactions, while the mixed α and β structures not only show common characters with the α and β types of structures, but also preferred interactions between Met, Lys and double-bend preference property, and between the sequence arrangements of Cys, His, Met, Tyr and amino acid composition features. The intrinsically disordered proteins (IDPs) present high frequency for the repetition patterns of certain amino acids, while the different structural motifs also show special characters. More sequential differences between the structures can also be identified from K-mers statistics and feature series distributions. The new discoveries reveal the nature of amino acid feature interaction mechanics, and show great importance of these interactions in coding the different types of protein structures. The results can not only contribute to future molecular design for protein-based vaccine or drug, but also enlighten the development for new protein structural classifiers.
References
1. Levitt M. Nature of the protein universe. P. Natl. Acad. Sci. 2009; 106(27): 11079–11084.
2. Wang J, Wang, Z. & Tian, X. Bioinformatics: Fundementals and applications. Beijing: Tsinghua University Press; 2014 (In Chinese).
3. Yu, C, Deng M, Cheng SY, Yau SC, He RL, Yau ST. Protein space: A natural method for realizing the nature of protein universe. J. Theor. Biol. 2013; 318: 197–204.
4. Zhao B, He RL, Yau ST. A new distribution vector and its application in genome clustering. Mol. Phylogenet. Evol. 2011; 59: 438–443.
5. Zhao X, Wan X, He RL, Yau ST. A new method for studying the evolutionary origin of the SAR11 clade marine bacteria. Mol. Phylogenet. Evol. 2016; 98: 271–279.
6. Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Zidek A, Nelson A, Bridgland A, Penedones H, Petersen S, Simonyan K, Crossan S, Kohli P, Jones DT, Silver D, Kavukcuoglu K, Hassabis D. Improved protein structure prediction using potentials from deep learning. Nature. 2020; 577(7792): 706-710.
7. Cramer P. AlphaFold2 and the future of structural biology. Nat. Struct. Mol. Biol. 2021; 28: 704–705.
8. Wu T, Guo Z, Cheng J. Atomic protein structure refinement using all-atom graph representations and SE(3)-equivariant graph transformer. Bioinformatics. 2023; 39(5): btad298.
9. Hong Y, Lee J, Ko J. A Prot: protein structure modeling using MSA transformer. BMC Bioinformatics. 2022; 23: 93.
10. Pearce R, Li Y, Omenn GS, Zhan Y. Fast and accurate Ab Initio Protein structure prediction using deep learning potentials. PLoS Comput. Biol. 2022; 18(9): e1010539.
11. Rachitskii P, Kruglov I, Finkelstein AV, Oganov AR. Protein structure prediction using the evolutionary algorithm USPEX. Proteins. 2023; 91: 933–943.
12. Hou M, Peng C, Zhou, Zhang B, Zhang G. Multi contact-based folding method for de novo protein structure prediction. Brief. Bioinform. 2022; 23(1): bbab463.
13. Stapor K, Kotowski K, Smolarczyk T, Roterman I. Lightweight ProteinUnet2 network for protein secondary structure prediction: a step towards proper evaluation. BMC Bioinformatics. 2022; 23(1): 1-16.
14. Kim Y, Kim J. AttSec: protein secondary structure prediction by capturing local patterns from attention map. BMC Bioinformatics. 2023; 24(1): 183.
15. Zhang B, Liu D, Zhang Y, Shen H, Zhan G. Accurate flexible refinement for atomic-level protein structure using cryo-EM density maps and deep learning. Brief. Bioinform. 2022; 23(2): bbac026.
16. Gormez Yasin, Sabzekar M, Aydın Z. IGPRED: Combination of Convolutional Neural and Graph Convolutional Networks for Protein Secondary Structure Prediction. Proteins. 2022; 90(8): 1613.
17. Zhang B, Zhang X, Pearce R, Shen HB, Zhang Y. A New Protocol for Atomic Level Protein Structure Modeling and Refinement Using Low-to-Medium Resolution Cryo-EM Density Maps. J. Mol. Biol. 2020; 432: 5365-5377.
18. Liu B, Li CC, Yan K. DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks. Brief. Bioinform. 2019; 21 (5): 1733-1741.
19. Wan X, Tan X. A protein structural study based on the centrality analysis of protein sequence feature networks. PLoS ONE. 2021; 16(3): e0248861.
20. Rackovsky S. Sequence physical properties encode the global organization of protein structure space. P. Natl. Acad. Sci. 2009; 106(34): 14345–14348.
21. Duda RO. Pattern classification (second edition). New York: John Wiley & Sons, Inc; 2001.
22. Tian K, Xin Z, Yau S. Convex hull analysis of evolutionary and phylogenetic relationships between biological groups. J. Theor. Biol. 2018; 456: 34–40.
23. Jeong JC, Lin X, Chen X. On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2011; 8(2): 308–315.
24. Shen H, Chou K. PseAAC: A flexible web server for generating various kinds of protein pseudo amino acid composition. Anal. Biochem. 2008; 373: 386-388.
25. Liu B, Liu F, Wang X, Chen J, Fang L, Chou K. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015; W1: W65-W71.
26. Zhang Y, Wen J, Yau SS-T. Phylogenetic analysis of protein sequences based on a novel k-mer natural vector method. Genomics. 2019; 111: 1298–1305.
27. Yu C, He RL, Yau SS-T. Protein sequence comparison based on K-string dictionary. Gene. 2013; 529(2): 250-256.
28. Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA-binding proteins by combining auto-cross covariance transformation and ensemble learning. IEEE T. on Nanobiosci. 2016; 15(4): 328-334.
29. Wen J, Zhang Y, Yau SS-T. K-mer Sparse matrix model for genetic sequence and its applications in sequence comparison. J. Theor. Biol. 2014; 363: 145-150.
30. Mu Z, Yu T, Liu X, Zheng H, Wei L, Liu J. FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinformatics. 2021; 22: 297.
31. Kidera A, Konishi Y, Oka M, Ooi T, Scheraga HA. Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J. Protein Chem. 1985; 4(1): 23-54.
32. Isogai Y, Nemethy G, Rackovsky S, Leach SJ, Scheraga HA. Characterization of multiple bends in proteins. Biopolymers. 1980; 19: 1183-1210.
33. Jukes TH, Holmquist R, Moise H. Amino acid composition of proteins: Selection against the genetic code. Science. 1975; 189: 50-51.
34. Rackovsky S, Scheraga HA. Differential geometry and polymer confirmation. 4. Conformational and neucleation properties of individual amino acids. Macromolecules. 1982; 15: 1240-1346.
35. Maxfield FR, Scheraga HA. Status of empirical methods for the prediction of protein backbone topography. Biochemistry. 1976; 15: 5138-5153.
36. Fasman GD. Handbook of Biochemistry and Molecular Biology (3rd ed). Boca Raton: CRC Press; 1976.
37. Ponnuswamy P, Prabhakaran M, Manavalan P. Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. Biochim. Biophys. Acta. 1980; 623: 301-316.
38. Wan X, Tan X. A Simple protein evolutionary classification method based on the mutual relations between protein sequences. Curr. Bioinform. 2020; 15(10): 1113-1129.
39. Newman MEJ. Networks: An Introduction. New York: Oxford University Press; 2010.
40. Fang J. Statistical methods for biomedical research (2nd Edition). Beijing: Higher Education Press; 2019.
41. Joan FB. Guinness, gosset, fisher, and small samples. Stat. Sci. 1987; 2 (1), 45–52.
42. Morikawa N. Discrete differential geometry of n-simplices and protein structure analysis. Applied Mathematics. 2014; 5(16), 2458-2463.
Copyright (c) 2024 Xiaogeng Wan, Xinying Tan
This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright on all articles published in this journal is retained by the author(s), while the author(s) grant the publisher as the original publisher to publish the article.
Articles published in this journal are licensed under a Creative Commons Attribution 4.0 International, which means they can be shared, adapted and distributed provided that the original published version is cited.