Evaluating the performance of large language models in haematopoietic stem cell transplantation decision-making.
GPT
HSC transplantation
artificial intelligence
interrater agreement
transplant
Journal
British journal of haematology
ISSN: 1365-2141
Titre abrégé: Br J Haematol
Pays: England
ID NLM: 0372544
Informations de publication
Date de publication:
09 Dec 2023
09 Dec 2023
Historique:
revised:
14
10
2023
received:
29
08
2023
accepted:
31
10
2023
medline:
10
12
2023
pubmed:
10
12
2023
entrez:
9
12
2023
Statut:
aheadofprint
Résumé
In a first-of-its-kind study, we assessed the capabilities of large language models (LLMs) in making complex decisions in haematopoietic stem cell transplantation. The evaluation was conducted not only for Generative Pre-trained Transformer 4 (GPT-4) but also conducted on other artificial intelligence models: PaLm 2 and Llama-2. Using detailed haematological histories that include both clinical, molecular and donor data, we conducted a triple-blind survey to compare LLMs to haematology residents. We found that residents significantly outperformed LLMs (p = 0.02), particularly in transplant eligibility assessment (p = 0.01). Our triple-blind methodology aimed to mitigate potential biases in evaluating LLMs and revealed both their promise and limitations in deciphering complex haematological clinical scenarios.
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© 2023 The Authors. British Journal of Haematology published by British Society for Haematology and John Wiley & Sons Ltd.
Références
Thirunavukarasu AJ, Hassan R, Mahmood S, Sanghera R, Barzangi K, El Mukashfi M, et al. Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ. 2023;9:e46599.
Hoch CC, Wollenberg B, Lüers J-C, Knoedler S, Knoedler L, Frank K, et al. ChatGPT's quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol. 2023;280(9):4271-4278.
Jin D, Pan E, Oufattole N, Weng W-H, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. arxiv.org Cornell University 2020.
Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering. arXiv. 2019.
Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al. Measuring massive multitask language understanding. arXiv. 2020.
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172-180.
Carreras E, Dufour C, Mohty M, Kröger N, editors. The EBMT handbook: hematopoietic stem cell transplantation and cellular therapies. 7th ed. Cham (CH): Springer; 2019.
Terwey TH, Hemmati PG, Martus P, Dietz E, Vuong LG, Massenkeil G, et al. A modified EBMT risk score and the hematopoietic cell transplantation-specific comorbidity index for pre-transplant risk assessment in adult acute lymphoblastic leukemia. Haematologica. 2010;95(5):810-818.
Sorror ML. Comorbidities and hematopoietic cell transplantation outcomes. Hematology. 2010;2010(1):237-247.
Parimon T, Au DH, Martin PJ, Chien JW. A risk score for mortality after allogeneic hematopoietic cell transplantation. Ann Intern Med. 2006;144(6):407-414.
Available from: https://openai.com/. Accessed 13 Oct 2023.
Available from: https://cloud.google.com/vertex-ai/docs/generative-ai/start/quickstarts/api-quickstart. Accessed 13 Oct 2023.
Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models. arXiv, 2307.09288. https://doi.org/10.48550/arXiv.2307.09288
Mahan D, Carlow R, Castricato L, Cooper N, Laforte C. Stable Beluga models. Available from: https://huggingface.co/stabilityai/StableBeluga2. Accessed 13 Oct 2023.
Available from: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML. Accessed 13 Oct 2023.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37-46.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378-382. https://doi.org/10.1037/h0031619
Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther. 2005;85(3):257-268.
Haemmerli J, Sveikata L, Nouri A, May A, Egervari K, Freyschlag C, et al. ChatGPT in glioma adjuvant therapy decision making: ready to assume the role of a doctor in the tumour board? BMJ Health Care Inform. 2023;30(1):e100775.
Available from: https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/9870847. Accessed 13 Oct 2023.