Evaluating the performance of large language models in haematopoietic stem cell transplantation decision-making.

GPT HSC transplantation artificial intelligence interrater agreement transplant

Journal

British journal of haematology
ISSN: 1365-2141
Titre abrégé: Br J Haematol
Pays: England
ID NLM: 0372544

Informations de publication

Date de publication:
09 Dec 2023
Historique:
revised: 14 10 2023
received: 29 08 2023
accepted: 31 10 2023
medline: 10 12 2023
pubmed: 10 12 2023
entrez: 9 12 2023
Statut: aheadofprint

Résumé

In a first-of-its-kind study, we assessed the capabilities of large language models (LLMs) in making complex decisions in haematopoietic stem cell transplantation. The evaluation was conducted not only for Generative Pre-trained Transformer 4 (GPT-4) but also conducted on other artificial intelligence models: PaLm 2 and Llama-2. Using detailed haematological histories that include both clinical, molecular and donor data, we conducted a triple-blind survey to compare LLMs to haematology residents. We found that residents significantly outperformed LLMs (p = 0.02), particularly in transplant eligibility assessment (p = 0.01). Our triple-blind methodology aimed to mitigate potential biases in evaluating LLMs and revealed both their promise and limitations in deciphering complex haematological clinical scenarios.

Identifiants

pubmed: 38070128
doi: 10.1111/bjh.19200
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© 2023 The Authors. British Journal of Haematology published by British Society for Haematology and John Wiley & Sons Ltd.

Références

Thirunavukarasu AJ, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M, et al. Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ. 2023;9:e46599.
Hoch CC, Wollenberg B, Lüers J-C, Knoedler S, Knoedler L, Frank K, et al. ChatGPT's quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol. 2023;280(9):4271-4278.
Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. arxiv.org Cornell University 2020.
Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. arXiv. 2019.
Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring massive multitask language understanding. arXiv. 2020.
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180.
Carreras E, Dufour C, Mohty M, Kröger N, editors. The EBMT handbook: hematopoietic stem cell transplantation and cellular therapies. 7th ed. Cham (CH): Springer; 2019.
Terwey TH, Hemmati PG, Martus P, Dietz E, Vuong LG, Massenkeil G, et al. A modified EBMT risk score and the hematopoietic cell transplantation-specific comorbidity index for pre-transplant risk assessment in adult acute lymphoblastic leukemia. Haematologica. 2010;95(5):810-818.
Sorror ML. Comorbidities and hematopoietic cell transplantation outcomes. Hematology. 2010;2010(1):237-247.
Parimon T, Au DH, Martin PJ, Chien JW. A risk score for mortality after allogeneic hematopoietic cell transplantation. Ann Intern Med. 2006;144(6):407-414.
Available from: https://openai.com/. Accessed 13 Oct 2023.
Available from: https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/api-quickstart. Accessed 13 Oct 2023.
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models. arXiv, 2307.09288. https://doi.org/10.48550/arXiv.2307.09288
Mahan D, Carlow R, Castricato L, Cooper N, Laforte C. Stable Beluga models. Available from: https://huggingface.co/stabilityai/StableBeluga2. Accessed 13 Oct 2023.
Available from: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML. Accessed 13 Oct 2023.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37-46.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378-382. https://doi.org/10.1037/h0031619
Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther. 2005;85(3):257-268.
Haemmerli J, Sveikata L, Nouri A, May A, Egervari K, Freyschlag C, et al. ChatGPT in glioma adjuvant therapy decision making: ready to assume the role of a doctor in the tumour board? BMJ Health Care Inform. 2023;30(1):e100775.
Available from: https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/9870847. Accessed 13 Oct 2023.

Auteurs

Ivan Civettini (I)

Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy.
Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Arianna Zappaterra (A)

Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy.
Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.
Department of Haematology and Bone Marrow Transplantation Unit, ASST Grande Ospedale Metropolitano Niguarda, Milan, Italy.

Bianca Maria Granelli (BM)

Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy.
Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Giovanni Rindone (G)

Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy.
Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Andrea Aroldi (A)

Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Stefano Bonfanti (S)

Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy.
Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Federica Colombo (F)

Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy.
Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Marilena Fedele (M)

Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Giovanni Grillo (G)

Department of Haematology and Bone Marrow Transplantation Unit, ASST Grande Ospedale Metropolitano Niguarda, Milan, Italy.

Matteo Parma (M)

Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Paola Perfetti (P)

Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Elisabetta Terruzzi (E)

Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Carlo Gambacorti-Passerini (C)

Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy.
Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Daniele Ramazzotti (D)

Department of Medicine and Surgery, University of Milano-Bicocca, Monza, Italy.

Fabrizio Cavalca (F)

Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Classifications MeSH