Machine learning pipeline to analyze clinical and proteomics data: experiences on a prostate cancer case.

Biological pipeline Data enhancing Machine learning Prostate cancer

Journal

BMC medical informatics and decision making
ISSN: 1472-6947
Titre abrégé: BMC Med Inform Decis Mak
Pays: England
ID NLM: 101088682

Informations de publication

Date de publication:
08 Apr 2024
Historique:
received: 07 01 2024
accepted: 25 03 2024
medline: 8 4 2024
pubmed: 8 4 2024
entrez: 7 4 2024
Statut: epublish

Résumé

Proteomic-based analysis is used to identify biomarkers in blood samples and tissues. Data produced by devices such as mass spectrometry requires platforms to identify and quantify proteins (or peptides). Clinical information can be related to mass spectrometry data to identify diseases at an early stage. Machine learning techniques can be used to support physicians and biologists in studying and classifying pathologies. We present the application of machine learning techniques to define a pipeline aimed at studying and classifying proteomics data enriched using clinical information. The pipeline allows users to relate established blood biomarkers with clinical parameters and proteomics data. The proposed pipeline entails three main phases: (i) feature selection, (ii) models training, and (iii) models ensembling. We report the experience of applying such a pipeline to prostate-related diseases. Models have been trained on several biological datasets. We report experimental results about two datasets that result from the integration of clinical and mass spectrometry-based data in the contexts of serum and urine analysis. The pipeline receives input data from blood analytes, tissue samples, proteomic analysis, and urine biomarkers. It then trains different models for feature selection, classification and voting. The presented pipeline has been applied on two datasets obtained in a 2 years research project which aimed to extract hidden information from mass spectrometry, serum, and urine samples from hundreds of patients. We report results on analyzing prostate datasets serum with 143 samples, including 79 PCa and 84 BPH patients, and an urine dataset with 121 samples, including 67 PCa and 54 BPH patients. As results pipeline allowed to identify interesting peptides in the two datasets, 6 for the first one and 2 for the second one. The best model for both serum (AUC=0.87, Accuracy=0.83, F1=0.81, Sensitivity=0.84, Specificity=0.81) and urine (AUC=0.88, Accuracy=0.83, F1=0.83, Sensitivity=0.85, Specificity=0.80) datasets showed good predictive performances. We made the pipeline code available on GitHub and we are confident that it will be successfully adopted in similar clinical setups.

Identifiants

pubmed: 38584282
doi: 10.1186/s12911-024-02491-6
pii: 10.1186/s12911-024-02491-6
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

93

Informations de copyright

© 2024. The Author(s).

Références

Zhou X, Mao J, Ai J, Deng Y, Roth MR, Pound C, et al. Identification of plasma lipid biomarkers for prostate cancer by lipidomics and bioinformatics. PLoS ONE. 2012;7:e48889.
doi: 10.1371/journal.pone.0048889 pubmed: 23152813 pmcid: 3495963
Vizza P, Pascuzzi L, Aracri F, Tavolaro E, Lambardi P, Gaspari M, et al. Prostate Cancer Disease Study by Integrating Peptides and Clinical Data. In: AAI4H@ ECAI. Amsterdam: IOS Press; 2020. p. 45–48.
Pienta KJ, Esper PS. Risk factors for prostate cancer. Ann Intern Med. 1993;118(10):793–803.
doi: 10.7326/0003-4819-118-10-199305150-00007 pubmed: 8470854
Pierre-Victor D, Parnes HL, Andriole GL, Pinsky PF. Prostate cancer incidence and mortality following a negative biopsy in a population undergoing PSA screening. Urology. 2021;155:62–9.
doi: 10.1016/j.urology.2021.05.060 pubmed: 34186135
White CN, Chan DW, Zhang Z. Bioinformatics strategies for proteomic profiling. Clin Biochem. 2004;37(7):636–41.
doi: 10.1016/j.clinbiochem.2004.05.004 pubmed: 15234244
Petricoin EF III, Ornstein DK, Paweletz CP, Ardekani A, Hackett PS, Hitt BA, et al. Serum proteomic patterns for detection of prostate cancer. J Natl Cancer Inst. 2002;94(20):1576–8.
doi: 10.1093/jnci/94.20.1576 pubmed: 12381711
Garg A, Mago V. Role of machine learning in medical research: a survey. Comput Sci Rev. 2021;40:100370.
doi: 10.1016/j.cosrev.2021.100370
Mahmud M, Kaiser MS, McGinnity TM, Hussain A. Deep learning in mining biological data. Cogn Comput. 2021;13(1):1–33.
doi: 10.1007/s12559-020-09773-x
Li Y, Wu FX, Ngom A. A review on machine learning principles for multi-view biological data integration. Brief Bioinform. 2018;19(2):325–40.
pubmed: 28011753
Khalsan M, Machado LR, Al-Shamery ES, Ajit S, Anthony K, Mu M, et al. A survey of machine learning approaches applied to gene expression analysis for cancer prediction. IEEE Access. 2022;10:27522–34.
doi: 10.1109/ACCESS.2022.3146312
Fan Z, Kong F, Zhou Y, Chen Y, Dai Y. Intelligence algorithms for protein classification by mass spectrometry. BioMed Res Int. 2018;2018.
Taskin V, Dogan B, Ölmez T. Prostate cancer classification from mass spectrometry data by using wavelet analysis and Kernel Partial Least Squares Algorithm. Int J Biosci Biochem Bioinforma. 2013;3(2):98.
Oh JH, Lotan Y, Gurnani P, Rosenblatt KP, Gao J. Prostate cancer biomarker discovery using high performance mass spectral serum profiling. Comput Methods Prog Biomed. 2009;96(1):33–41. https://doi.org/10.1016/j.cmpb.2009.04.003 .
doi: 10.1016/j.cmpb.2009.04.003
Datta S, Pihur V. Feature selection and machine learning with mass spectrometry data. Bioinforma Methods Clin Res. 2010;593:205–29.
Khoo A, Liu LY, Nyalwidhe JO, Semmes OJ, Vesprini D, Downes MR, et al. Proteomic discovery of non-invasive biomarkers of localized prostate cancer using mass spectrometry. Nat Rev Urol. 2021;18(12):707–24.
doi: 10.1038/s41585-021-00500-1 pubmed: 34453155 pmcid: 8639658
Palopoli L, Rombo SE, Terracina G, Tradigo G, Veltri P. Improving protein secondary structure predictions by prediction fusion. Inf Fusion. 2009;10(3):217–32.
doi: 10.1016/j.inffus.2008.11.004
Theriault RL, Kaufmann M, Ren KY, Varma S, Ellis RE. Metabolomics patterns of breast cancer tumors using mass spectrometry imaging. Int J CARS. 2021;16(7):1089–99.
doi: 10.1007/s11548-021-02387-0
Roseiro M, Henriques J, Paredes S, Rocha T, Sousa J. An interpretable machine learning approach to estimate the influence of inflammation biomarkers on cardiovascular risk assessment. Comput Methods Prog Biomed. 2023;230:107347.
Battista A, Battista RA, Battista F, Iovane G, Landi RE. BH-index: a predictive system based on serum biomarkers and ensemble learning for early colorectal cancer diagnosis in mass screening. Comput Methods Prog Biomed. 2021;212:106494.
doi: 10.1016/j.cmpb.2021.106494
Wang F, Su Q, Li C. Identidication of novel biomarkers in non-small cell lung cancer using machine learning. Sci Rep. 2022;12(1):16693.
doi: 10.1038/s41598-022-21050-5 pubmed: 36202977 pmcid: 9537298
Taghizadeh E, Heydarheydari S, Saberi A, JafarpoorNesheli S, Rezaeijo SM. Breast cancer prediction with transcriptome profiling using feature selection and machine learning methods. BMC Bioinformatics. 2022;23(1):1–9.
doi: 10.1186/s12859-022-04965-8
Botlagunta M, Botlagunta MD, Myneni MB, Lakshmi D, Nayyar A, Gullapalli JS, et al. Classification and diagnostic prediction of breast cancer metastasis on clinical data using machine learning algorithms. Sci Rep. 2023;13(1):485.
doi: 10.1038/s41598-023-27548-w pubmed: 36627367 pmcid: 9831019
Kopitar L, Kocbek P, Cilar L, Sheikh A, Stiglic G. Early detection of type 2 diabetes mellitus using machine learning-based prediction models. Sci Rep. 2020;10(1):1–12.
doi: 10.1038/s41598-020-68771-z
Srivastava S, Soman S, Rai A, Srivastava PK. Deep learning for health informatics: recent trends and future directions. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE; 2017. p. 1665–1670.
Callahan A, Shah NH. Machine learning in healthcare. In: Key Advances in Clinical Informatics. Elsevier; 2017. p. 279–291.
Paul TK, Iba H. Prediction of cancer class with majority voting genetic programming classifier using gene expression data. IEEE/ACM Trans Comput Biol Bioinforma. 2008;6(2):353–67.
doi: 10.1109/TCBB.2007.70245
Prestagiacomo L, Tradigo G, Aracri F, Gabriele C, Rota MA, Alba S, et al. Data-Independent Acquisition Mass Spectrometry of EPS-urine coupled to Machine Learning: a predictive model for prostate cancer. ACS Omega; 2023.
Gabriele C, Aracri F, Prestagiacomo LE, Rota MA, Alba S, Tradigo G, et al. Development of a predictive model to distinguish prostate cancer from benign prostatic hyperplasia by integrating serum glycoproteomics and clinical variables. Clin Proteomics. 2023;20(1):52.
doi: 10.1186/s12014-023-09439-4 pubmed: 37990292 pmcid: 10662699
Beg M, Taka J, Kluyver T, Konovalov A, Ragan-Kelley M, Thiéry NM, et al. Using Jupyter for reproducible scientific workflows. Comput Sci Eng. 2021;23(2):36–46.
doi: 10.1109/MCSE.2021.3052101
Mukaka MM. A guide to appropriate use of correlation coefficient in medical research. Malawi Med J. 2012;24(3):69–71.
pubmed: 23638278 pmcid: 3576830
Tallarida RJ, Murray RB. Chi-square test. In: Manual of pharmacologic calculations. Springer; 1987. p. 140–142.
Vanjimalar S, Ramyachitra D, Manikandan P. A review on feature selection techniques for gene expression data. In: 2018 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC). IEEE; 2018. p. 1–4.
Speiser JL, Miller ME, Tooze J, Ip E. A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst Appl. 2019;134:93–101.
doi: 10.1016/j.eswa.2019.05.028 pubmed: 32968335 pmcid: 7508310
Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22.
doi: 10.1016/j.jclinepi.2019.02.004 pubmed: 30763612
Huang HC, Zheng S, Zhao Z. Application of Pearson correlation coefficient (PCC) and Kolmogorov-Smirnov distance (KSD) metrics to identify disease-specific biomarker genes. BMC Bioinformatics. 2010;11:P23.
doi: 10.1186/1471-2105-11-S4-P23 pmcid: 3290077
Wang L, Jiang Z, Sui M, Shen J, Xu C, Fan W. The potential biomarkers in predicting pathologic response of breast cancer to three different chemotherapy regimens: a case control study. BMC Cancer. 2009;9:226.
doi: 10.1186/1471-2407-9-226 pubmed: 19591668 pmcid: 2716368
Lv Y, Wang Y, Tan Y, Du W, Liu K, Wang H. Pancreatic cancer biomarker detection using recursive feature elimination based on Support Vector Machine and large margin distribution machine. 4th International Conference on Systems and Informatics (ICSAI). New York: IEEE; 2017. p. 1450–1455.
Ram M, Najafi A, Shakeri MT. Classification and biomarker genes selection for cancer gene expression data using random forest. Iran J Pathol. 2017;12:339.
doi: 10.30699/ijp.2017.27990 pubmed: 29563929 pmcid: 5844678
Aggarwal CC, et al. Data mining: the textbook, vol 1. Springer; 2015.
Gabriele C, Aracri F, Prestagiacomo LE, Rota MA, Alba S, Tradigo G, et al. Development of a predictive model of prostate cancer: integration of a panel of formerly N-linked glycopeptides and clinical variables for serum testing. 2022. https://doi.org/10.21203/rs.3.rs-2036305/v1 .
Cannataro M, Guzzi PH, Mazza T, Tradigo G, Veltri P. Using ontologies for preprocessing and mining spectra data on the Grid. Futur Gener Comput Syst. 2007;23(1):55–60.
doi: 10.1016/j.future.2006.04.011
Din S, Paul A, Guizani N, Ahmed SH, Khan M, Rathore MM. Features selection model for internet of e-health things using big data. In: GLOBECOM 2017-2017 IEEE Global Communications Conference. IEEE; 2017. p. 1–7.
Naheed N, Shaheen M, Khan SA, Alawairdhi M, Khan MA. Importance of features selection, attributes selection, challenges and future directions for medical imaging data: a review. Comput Model Eng Sci. 2020;125(1):314–44.
Goh WWB, Wong L. Advanced bioinformatics methods for practical applications in proteomics. Brief Bioinform. 2019;20(1):347–55.
doi: 10.1093/bib/bbx128 pubmed: 30657890
Gallo Cantafio ME, Grillone K, Caracciolo D, Scionti F, Arbitrio M, Barbieri V, et al. From single level analysis to multi-omics integrative approaches: a powerful strategy towards the precision oncology. High-throughput. 2018;7(4):33.
doi: 10.3390/ht7040033 pubmed: 30373182 pmcid: 6306876
Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng. 2014;40(1):16–28.
doi: 10.1016/j.compeleceng.2013.11.024
Malm EK, Srivastava V, Sundqvist G, Bulone V. APP: an Automated Proteomics Pipeline for the analysis of mass spectrometry data based on multiple open access tools. BMC Bioinformatics. 2014;15:1–8.
doi: 10.1186/s12859-014-0441-8
Weber SR, Zhao Y, Ma J, Gates C, da Veiga Leprevost F, Basrur V, et al. A validated analysis pipeline for mass spectrometry-based vitreous proteomics: new insights into proliferative diabetic retinopathy. Clin Proteomics. 2021;18:1–27.
doi: 10.1186/s12014-021-09328-8
Bichmann L, Gupta S, Rosenberger G, Kuchenbecker L, Sachsenberg T, Ewels P, et al. DIAproteomics: a multifunctional data analysis pipeline for data-independent acquisition proteomics and peptidomics. J Proteome Res. 2021;20(7):3758–66.
doi: 10.1021/acs.jproteome.1c00123 pubmed: 34153189
Keller A, Shteynberg D. Software pipeline and data analysis for MS/MS proteomics: the trans-proteomic pipeline. Bioinforma Comp Proteomics. 2011;694:169–89.
Liang D, Liu Q, Zhou K, Jia W, Xie G, Chen T. IP4M: an integrated platform for mass spectrometry-based metabolomics data mining. BMC Bioinformatics. 2020;21(1):1–16.
doi: 10.1186/s12859-020-03786-x

Auteurs

Patrizia Vizza (P)

Department of Surgical and Medical Sciences, Magna Græcia University, 88100, Catanzaro, Italy.

Federica Aracri (F)

Department of Surgical and Medical Sciences, Magna Græcia University, 88100, Catanzaro, Italy. federica.aracri@unicz.it.

Pietro Hiram Guzzi (PH)

Department of Surgical and Medical Sciences, Magna Græcia University, 88100, Catanzaro, Italy.

Marco Gaspari (M)

Department of Experimental and Clinical Medicine, Magna Græcia University, 88100, Catanzaro, Italy.

Pierangelo Veltri (P)

Department of Computers, Modeling, Electronics and Systems Engineering, University of Calabria, 87036, Rende, Italy.

Giuseppe Tradigo (G)

Department of Theoretical and Applied Sciences, eCampus University, 22060, Novedrate, CO, Italy.

Classifications MeSH