A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides.
Journal
Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288
Informations de publication
Date de publication:
10 12 2021
10 12 2021
Historique:
received:
23
08
2021
accepted:
01
12
2021
entrez:
11
12
2021
pubmed:
12
12
2021
medline:
27
1
2022
Statut:
epublish
Résumé
Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906-0.910) and 2-17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.
Identifiants
pubmed: 34893688
doi: 10.1038/s41598-021-03293-w
pii: 10.1038/s41598-021-03293-w
pmc: PMC8664844
doi:
Substances chimiques
Amino Acids
0
Dipeptides
0
Heat-Shock Proteins
0
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
23782Informations de copyright
© 2021. The Author(s).
Références
Burley, S. K. et al. Protein data bank (PDB): The single global macromolecular structure archive. In Protein Crystallography: Methods and Protocols (eds Wlodawer, A. et al.) 627–641 (Springer, 2017).
doi: 10.1007/978-1-4939-7000-1_26
Gromiha, M. M. Protein Bioinformatics (Academic Press, 2010).
Gromiha, M. M., Nagarajan, R. & Selvaraj, S. Protein structural bioinformatics: an overview. In Encyclopedia of Bioinformatics and Computational Biology (eds Ranganathan, S. et al.) 445–459 (Academic Press, 2019).
doi: 10.1016/B978-0-12-809633-8.20278-1
Haki, G. D. & Rakshit, S. K. Developments in industrially important thermostable enzymes: A review. Bioresour. Technol. 89(1), 17–34 (2003).
pubmed: 12676497
doi: 10.1016/S0960-8524(03)00033-6
Gromiha, M. M., Oobatake, M. & Sarai, A. Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophys. Chem. 82(1), 51–67 (1999).
pubmed: 10584295
doi: 10.1016/S0301-4622(99)00103-9
Gaucher, E. A., Govindarajan, S. & Ganesh, O. K. Palaeotemperature trend for Precambrian life inferred from resurrected proteins. Nature 451(7179), 704–707 (2008).
pubmed: 18256669
doi: 10.1038/nature06510
Pica, A. & Graziano, G. Shedding light on the extra thermal stability of thermophilic proteins. Biopolymers 105(12), 856–863 (2016).
pubmed: 27449333
doi: 10.1002/bip.22923
Gromiha, M. M. & Nagarajan, R. Chapter three—computational approaches for predicting the binding sites and understanding the recognition mechanism of protein–DNA complexes. In Advances in Protein Chemistry and Structural Biology Vol. 91 (ed. Donev, R.) 65–99 (Academic Press, 2013).
Habbeche, A. et al. Purification and biochemical characterization of a detergent-stable keratinase from a newly thermophilic actinomycete Actinomadura keratinilytica strain Cpt29 isolated from poultry compost. J. Biosci. Bioengi. 117(4), 413–421 (2014).
doi: 10.1016/j.jbiosc.2013.09.006
Diaz, J. E. et al. Computational design and selections for an engineered, thermostable terpene synthase. Protein Sci. 20(9), 1597–1606 (2011).
pubmed: 21739507
pmcid: 3190154
doi: 10.1002/pro.691
Huang, S. Y., Zhang, Y. H. & Zhong, J. J. A thermostable recombinant transaldolase with high activity over a broad pH range. Appl. Microbiol. Biotechnol. 93(6), 2403–2410 (2012).
pubmed: 21947648
doi: 10.1007/s00253-011-3578-7
Narasimhan, D. et al. Structural analysis of thermostabilizing mutations of cocaine esterase. Protein Eng. Des. Select. PEDS 23(7), 537–547 (2010).
doi: 10.1093/protein/gzq025
Vieille, C. & Zeikus, G. J. Hyperthermophilic enzymes: Sources, uses, and molecular mechanisms for thermostability. Microbiol. Mol. Biol. Rev. 65(1), 1–43 (2001).
pubmed: 11238984
pmcid: 99017
doi: 10.1128/MMBR.65.1.1-43.2001
Rodriguez, E., Mullaney, E. J. & Lei, X. G. Expression of the Aspergillus fumigatus phytase gene in Pichia pastoris and characterization of the recombinant enzyme. Biochem. Biophys. Res. Commun. 268(2), 373–378 (2000).
pubmed: 10679211
doi: 10.1006/bbrc.2000.2121
Xu, H., Shen, D., Wu, X. Q., Liu, Z. W. & Yang, Q. H. Characterization of a mutant glucose isomerase from Thermoanaerobacterium saccharolyticum. J. Ind. Microbiol. Biotechnol. 41(10), 1581–1589 (2014).
pubmed: 25139657
doi: 10.1007/s10295-014-1478-4
Charoenkwan, P., Kanthawong, S., Nantasenamat, C., Hasan, M. M. & Shoombuatong, W. iAMY-SCM: Improved prediction and analysis of amyloid proteins using a scoring card method with propensity scores of dipeptides. Genomics 2, 2 (2020).
Charoenkwan, P., Nantasenamat, C., Hasan, M. M. & Shoombuatong, W. Meta-iPVP: A sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation. J. Comput. Aided Mol. Des. 34(10), 1105–1116 (2020).
pubmed: 32557165
doi: 10.1007/s10822-020-00323-z
Charoenkwan, P. et al. SCMCRYS: Predicting protein crystallization using an ensemble scoring card method with estimating propensity scores of P-collocated amino acid pairs. PLoS ONE 8(9), e72368 (2013).
pubmed: 24019868
pmcid: 3760885
doi: 10.1371/journal.pone.0072368
Huang, H.-L. et al. Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. BMC Bioinform. 13(S17), S3 (2012).
doi: 10.1186/1471-2105-13-S17-S3
Lin, H. & Chen, W. Prediction of thermophilic proteins using feature selection technique. J. Microbiol. Methods 84(1), 67–70 (2011).
pubmed: 21044646
doi: 10.1016/j.mimet.2010.10.013
Montanucci, L., Fariselli, P., Martelli, P. L. & Casadio, R. Predicting protein thermostability changes from sequence upon multiple mutations. Bioinformatics 24(13), i190–i195 (2008).
pubmed: 18586713
pmcid: 2718644
doi: 10.1093/bioinformatics/btn166
Qian, N. & Sejnowski, T. J. Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol. 202(4), 865–884 (1988).
pubmed: 3172241
doi: 10.1016/0022-2836(88)90564-5
Shoombuatong, W., Schaduangrat, N. & Nantasenamat, C. Unraveling the bioactivity of anticancer peptides as deduced from machine learning. EXCLI J. 17, 734 (2018).
pubmed: 30190664
pmcid: 6123611
Wang, D., Yang, L., Fu, Z. & Xia, J. Prediction of thermophilic protein with pseudo amino acid composition: An approach from combined feature selection and reduction. Protein Pept. Lett. 18(7), 684–689 (2011).
pubmed: 21413920
doi: 10.2174/092986611795446085
Fan, G.-L., Liu, Y.-L. & Wang, H. Identification of thermophilic proteins by incorporating evolutionary and acid dissociation information into Chou’s general pseudo amino acid composition. J. Theor. Biol. 407, 138–142 (2016).
pubmed: 27396359
doi: 10.1016/j.jtbi.2016.07.010
Feng, C. et al. A method for prediction of thermophilic protein based on reduced amino acids and mixed features. Front. Bioeng. Biotechnol. 8, 285 (2020).
pubmed: 32432088
pmcid: 7214540
doi: 10.3389/fbioe.2020.00285
Gromiha, M. M. & Suresh, M. X. Discrimination of mesophilic and thermophilic proteins using machine learning algorithms. Proteins 70(4), 1274–1279 (2008).
pubmed: 17876820
doi: 10.1002/prot.21616
Nakariyakul, S., Liu, Z.-P. & Chen, L. Detecting thermophilic proteins through selecting amino acid and dipeptide composition features. Amino Acids 42(5), 1947–1953 (2012).
pubmed: 21547362
doi: 10.1007/s00726-011-0923-1
Tang, H. et al. A two-step discriminated method to identify thermophilic proteins. Int. J. Biomath. 10(04), 1750050 (2017).
doi: 10.1142/S1793524517500504
Wang, L. & Li, C. Optimal subset selection of primary sequence features using the genetic algorithm for thermophilic proteins identification. Biotech. Lett. 36(10), 1963–1969 (2014).
doi: 10.1007/s10529-014-1577-3
Zhang, G. & Fang, B. Discrimination of thermophilic and mesophilic proteins via pattern recognition methods. Process Biochem. 41(3), 552–556 (2006).
doi: 10.1016/j.procbio.2005.09.003
Zhang, G. & Fang, B. LogitBoost classifier for discriminating thermophilic and mesophilic proteins. J. Biotechnol. 127(3), 417–424 (2007).
pubmed: 17045354
doi: 10.1016/j.jbiotec.2006.07.020
Zuo, Y.-C., Chen, W., Fan, G.-L. & Li, Q.-Z. A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins. Amino Acids 44(2), 573–580 (2013).
pubmed: 22851052
doi: 10.1007/s00726-012-1374-z
Huang, Y., Niu, B., Gao, Y., Fu, L. & Li, W. CD-HIT suite: A web server for clustering and comparing biological sequences. Bioinformatics 26(5), 680–682 (2010).
pubmed: 20053844
pmcid: 2828112
doi: 10.1093/bioinformatics/btq003
Zhang, G. & Fang, B. Application of amino acid distribution along the sequence for discriminating mesophilic and thermophilic proteins. Process Biochem. 41(8), 1792–1798 (2006).
doi: 10.1016/j.procbio.2006.03.026
Charoenkwan, P. et al. Improved prediction and characterization of anticancer activities of peptides using a novel flexible scoring card method. Sci. Rep. 11(1), 1–13 (2021).
doi: 10.1038/s41598-021-82513-9
Charoenkwan, P., Kanthawong, S., Nantasenamat, C., Hasan, M. M. & Shoombuatong, W. iDPPIV-SCM: A sequence-based predictor for identifying and analyzing dipeptidyl peptidase IV (DPP-IV) inhibitory peptides using a scoring card method. J. Proteome Res. 19(10), 4125–4136 (2020).
pubmed: 32897718
doi: 10.1021/acs.jproteome.0c00590
Charoenkwan, P., Kanthawong, S., Schaduangrat, N., Yana, J. & Shoombuatong, W. PVPred-SCM: Improved prediction and analysis of phage virion proteins using a scoring card method. Cells 9(2), 353 (2020).
pmcid: 7072630
doi: 10.3390/cells9020353
Charoenkwan, P. et al. iBitter-SCM: Identification and characterization of bitter peptides using a scoring card method with propensity scores of dipeptides. Genomics 2, 2 (2020).
Charoenkwan, P., Yana, J., Nantasenamat, C., Hasan, M. M. & Shoombuatong, W. iUmami-SCM: A novel sequence-based predictor for prediction and analysis of umami peptides using a scoring card method with propensity scores of dipeptides. J. Chem. Inf. Model. 2, 2 (2020).
Kawashima, S. & Kanehisa, M. AAindex: Amino acid index database. Nucleic Acids Res. 28(1), 374–374 (2000).
pubmed: 10592278
pmcid: 102411
doi: 10.1093/nar/28.1.374
Charoenkwan, P., Nantasenamat, C., Hasan, M. M., Manavalan, B. & Shoombuatong, W. BERT4Bitter: A bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics 2, 2 (2021).
Charoenkwan, P. et al. StackIL6: A stacking ensemble model for improving the prediction of IL-6 inducing peptides. Brief. Bioinform. 2, 2 (2021).
Charoenkwan, P., Nantasenamat, C., Hasan, M. M. & Shoombuatong, W. iTTCA-Hybrid: Improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. Anal. Biochem. 599, 113747 (2020).
pubmed: 32333902
doi: 10.1016/j.ab.2020.113747
Shoombuatong, W., Prachayasittikul, V., Prachayasittikul, V. & Nantasenamat, C. Prediction of aromatase inhibitory activity using the efficient linear method (ELM). EXCLI J. 14, 452 (2015).
pubmed: 26535037
pmcid: 4614109
Hongjaisee, S., Nantasenamat, C., Carraway, T. S. & Shoombuatong, W. HIVCoR: A sequence-based tool for predicting HIV-1 CRF01_AE coreceptor usage. Comput. Biol. Chem. 80, 419–432 (2019).
pubmed: 31146118
doi: 10.1016/j.compbiolchem.2019.05.006
Hasan, M. M. et al. HLPpred-Fuse: Improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation. Bioinformatics 36(11), 3350–3356 (2020).
pubmed: 32145017
doi: 10.1093/bioinformatics/btaa160
Pei, J., Tang, M. & Grishin, N. V. PROMALS3D web server for accurate multiple protein sequence and structure alignments. Nucleic Acids Res. 36(2), W30–W34 (2008).
pubmed: 18503087
pmcid: 2447800
doi: 10.1093/nar/gkn322
Joo, K. et al. All-atom chain-building by optimizing MODELLER energy function using conformational space annealing. Proteins 75(4), 1010–1023 (2009).
pubmed: 19089941
doi: 10.1002/prot.22312
Pedregosa, F. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Mehrotra, S. & Balaram, H. Kinetic characterization of adenylosuccinate synthetase from the thermophilic archaea Methanocaldococcus jannaschii. Biochemistry 46(44), 12821–12832 (2007).
pubmed: 17929831
doi: 10.1021/bi701009y
Szilágyi, A. & Závodszky, P. Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits: results of a comprehensive survey. Structure 8(5), 493–504 (2000).
pubmed: 10801491
doi: 10.1016/S0969-2126(00)00133-7
Haney, P. J. et al. Thermal adaptation analyzed by comparison of protein sequences from mesophilic and extremely thermophilic Methanococcus species. Proc. Natl. Acad. Sci. 96(7), 3578–3583 (1999).
pubmed: 10097079
pmcid: 22336
doi: 10.1073/pnas.96.7.3578
Ding, Y., Cai, Y., Zhang, G. & Xu, W. The influence of dipeptide composition on protein thermostability. FEBS Lett. 569(1–3), 284–288 (2004).
pubmed: 15225649
doi: 10.1016/j.febslet.2004.06.009
Zhou, X.-X., Wang, Y.-B., Pan, Y.-J. & Li, W.-F. Differences in amino acids composition and coupling patterns between mesophilic and thermophilic proteins. Amino Acids 34(1), 25–33 (2008).
pubmed: 17710363
doi: 10.1007/s00726-007-0589-x
Fukuchi, S. & Nishikawa, K. Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. J. Mol. Biol. 309(4), 835–843 (2001).
pubmed: 11399062
doi: 10.1006/jmbi.2001.4718
Chakravarty, S. & Varadarajan, R. Elucidation of factors responsible for enhanced thermal stability of proteins: A structural genomics based study. Biochemistry 41(25), 8152–8161 (2002).
pubmed: 12069608
doi: 10.1021/bi025523t
Rackovsky, S. & Scheraga, H. A. Hydrophobicity, hydrophilicity, and the radial and orientational distributions of residues in native proteins. Proc. Natl. Acad. Sci. U.S.A. 74(12), 5248–5251 (1977).
pubmed: 271950
pmcid: 431666
doi: 10.1073/pnas.74.12.5248
Bristol, A. N. et al. Effects of stereochemistry and hydrogen bonding on glycopolymer–amyloid-β interactions. Biomacromol 21(10), 4280–4293 (2020).
doi: 10.1021/acs.biomac.0c01077
Querol, E., Perez-Pons, J. A. & Mozo-Villarias, A. Analysis of protein conformational characteristics related to thermostability. Protein Eng. Des. Sel. 9(3), 265–271 (1996).
doi: 10.1093/protein/9.3.265
Das, R. & Gerstein, M. The stability of thermophilic proteins: A study based on comprehensive genome comparison. Funct. Integr. Genomics 1(1), 76–88 (2000).
pubmed: 11793224
doi: 10.1007/s101420000003
Kumar, S., Tsai, C.-J., Ma, B. & Nussinov, R. Contribution of salt bridges toward protein thermostability. J. Biomol. Struct. Dyn. 17(sup1), 79–85 (2000).
pubmed: 22607409
doi: 10.1080/07391102.2000.10506606
Pack, S. P. & Yoo, Y. J. Protein thermostability: Structure-based difference of amino acid between thermophilic and mesophilic proteins. J. Biotechnol. 111(3), 269–277 (2004).
pubmed: 15246663
doi: 10.1016/j.jbiotec.2004.01.018
Chakravarty, S. & Varadarajan, R. Elucidation of determinants of protein stability through genome sequence analysis. FEBS Lett. 470(1), 65–69 (2000).
pubmed: 10722847
doi: 10.1016/S0014-5793(00)01267-9
Kumar, S., Tsai, C.-J. & Nussinov, R. Factors enhancing protein thermostability. Protein Eng. 13(3), 179–191 (2000).
pubmed: 10775659
doi: 10.1093/protein/13.3.179