Evaluating the performance of large language models in haematopoietic stem cell transplantation decision-making.

GPT HSC transplantation artificial intelligence interrater agreement transplant

Journal

British journal of haematology

ISSN: 1365-2141

Titre abrégé: Br J Haematol

Pays: England

ID NLM: 0372544

Informations de publication

Date de publication:
09 Dec 2023

Historique:

revised: 14 10 2023

received: 29 08 2023

accepted: 31 10 2023

medline: 10 12 2023

pubmed: 10 12 2023

entrez: 9 12 2023

Statut: aheadofprint

Résumé

In a first-of-its-kind study, we assessed the capabilities of large language models (LLMs) in making complex decisions in haematopoietic stem cell transplantation. The evaluation was conducted not only for Generative Pre-trained Transformer 4 (GPT-4) but also conducted on other artificial intelligence models: PaLm 2 and Llama-2. Using detailed haematological histories that include both clinical, molecular and donor data, we conducted a triple-blind survey to compare LLMs to haematology residents. We found that residents significantly outperformed LLMs (p = 0.02), particularly in transplant eligibility assessment (p = 0.01). Our triple-blind methodology aimed to mitigate potential biases in evaluating LLMs and revealed both their promise and limitations in deciphering complex haematological clinical scenarios.

Identifiants

DOI: 10.1111/bjh.19200 PMID: 38070128

pubmed: 38070128

doi: 10.1111/bjh.19200

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Informations de copyright

Références

Thirunavukarasu AJ, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M, et al. Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ. 2023;9:e46599.

Hoch CC, Wollenberg B, Lüers J-C, Knoedler S, Knoedler L, Frank K, et al. ChatGPT's quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol. 2023;280(9):4271-4278.

Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. arxiv.org Cornell University 2020.

Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. arXiv. 2019.

Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring massive multitask language understanding. arXiv. 2020.

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180.

Carreras E, Dufour C, Mohty M, Kröger N, editors. The EBMT handbook: hematopoietic stem cell transplantation and cellular therapies. 7th ed. Cham (CH): Springer; 2019.

Terwey TH, Hemmati PG, Martus P, Dietz E, Vuong LG, Massenkeil G, et al. A modified EBMT risk score and the hematopoietic cell transplantation-specific comorbidity index for pre-transplant risk assessment in adult acute lymphoblastic leukemia. Haematologica. 2010;95(5):810-818.

Sorror ML. Comorbidities and hematopoietic cell transplantation outcomes. Hematology. 2010;2010(1):237-247.

Parimon T, Au DH, Martin PJ, Chien JW. A risk score for mortality after allogeneic hematopoietic cell transplantation. Ann Intern Med. 2006;144(6):407-414.

Available from: https://openai.com/. Accessed 13 Oct 2023.

Available from: https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/api-quickstart. Accessed 13 Oct 2023.

Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models. arXiv, 2307.09288. https://doi.org/10.48550/arXiv.2307.09288

Mahan D, Carlow R, Castricato L, Cooper N, Laforte C. Stable Beluga models. Available from: https://huggingface.co/stabilityai/StableBeluga2. Accessed 13 Oct 2023.

Available from: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML. Accessed 13 Oct 2023.

Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37-46.

Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378-382. https://doi.org/10.1037/h0031619

Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther. 2005;85(3):257-268.

Haemmerli J, Sveikata L, Nouri A, May A, Egervari K, Freyschlag C, et al. ChatGPT in glioma adjuvant therapy decision making: ready to assume the role of a doctor in the tumour board? BMJ Health Care Inform. 2023;30(1):e100775.

Available from: https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/9870847. Accessed 13 Oct 2023.

Evaluating the performance of large language models in haematopoietic stem cell transplantation decision-making.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Informations de copyright

Références

Auteurs

Ivan Civettini (I)

Arianna Zappaterra (A)

Bianca Maria Granelli (BM)

Giovanni Rindone (G)

Andrea Aroldi (A)

Stefano Bonfanti (S)

Federica Colombo (F)

Marilena Fedele (M)

Giovanni Grillo (G)

Matteo Parma (M)

Paola Perfetti (P)

Elisabetta Terruzzi (E)

Carlo Gambacorti-Passerini (C)

Daniele Ramazzotti (D)

Fabrizio Cavalca (F)

Classifications MeSH