Evaluating large language models on medical evidence summarization.


Journal

NPJ digital medicine
ISSN: 2398-6352
Titre abrégé: NPJ Digit Med
Pays: England
ID NLM: 101731738

Informations de publication

Date de publication:
24 Aug 2023
Historique:
received: 25 05 2023
accepted: 03 08 2023
medline: 25 8 2023
pubmed: 25 8 2023
entrez: 24 8 2023
Statut: epublish

Résumé

Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.

Identifiants

pubmed: 37620423
doi: 10.1038/s41746-023-00896-7
pii: 10.1038/s41746-023-00896-7
pmc: PMC10449915
doi:

Types de publication

Journal Article

Langues

eng

Pagination

158

Subventions

Organisme : NLM NIH HHS
ID : R01 LM014306
Pays : United States
Organisme : NLM NIH HHS
ID : R00 LM013001
Pays : United States
Organisme : NLM NIH HHS
ID : R01 LM009886
Pays : United States
Organisme : NCATS NIH HHS
ID : KL2 TR001874
Pays : United States
Organisme : NCI NIH HHS
ID : P30 CA013696
Pays : United States

Commentaires et corrections

Type : UpdateOf

Informations de copyright

© 2023. Springer Nature Limited.

Références

Eur Radiol. 2023 Oct 5;:
pubmed: 37794249
Cochrane Database Syst Rev. 2015 Jan 28;1:CD004250
pubmed: 25629215
Cochrane Database Syst Rev. 2022 Sep 28;9:CD011968
pubmed: 36169558
Nature. 2023 Aug;620(7972):172-180
pubmed: 37438534
PLoS Med. 2013;10(4):e1001419
pubmed: 23585737
NPJ Digit Med. 2023 Apr 26;6(1):75
pubmed: 37100871
Proc Conf Assoc Comput Linguist Meet. 2022 May;2022:359-368
pubmed: 36339656
Cochrane Database Syst Rev. 2023 Feb 3;2:CD013267
pubmed: 36738471
Cochrane Database Syst Rev. 2021 Dec 17;12:CD013304
pubmed: 34918337

Auteurs

Liyan Tang (L)

School of Information, The University of Texas at Austin, Austin, TX, USA.

Zhaoyi Sun (Z)

Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.

Betina Idnay (B)

Department of Biomedical Informatics, Columbia University, New York, NY, USA.

Jordan G Nestor (JG)

Department of Medicine, Columbia University, New York, NY, USA.

Ali Soroush (A)

Department of Medicine, Columbia University, New York, NY, USA.

Pierre A Elias (PA)

Department of Biomedical Informatics, Columbia University, New York, NY, USA.

Ziyang Xu (Z)

Department of Medicine, Massachusetts General Hospital, Boston, MA, USA.

Ying Ding (Y)

School of Information, The University of Texas at Austin, Austin, TX, USA.

Greg Durrett (G)

Department of Computer Science, The University of Texas at Austin, Austin, TX, USA.

Justin F Rousseau (JF)

Departments of Population Health and Neurology, Dell Medical School, The University of Texas at Austin, Austin, TX, USA. justin.rousseau@utsouthwestern.edu.
Department of Neurology, University of Texas Southwestern Medical Center, Dallas, TX, USA. justin.rousseau@utsouthwestern.edu.

Chunhua Weng (C)

Department of Biomedical Informatics, Columbia University, New York, NY, USA. cw2384@cumc.columbia.edu.

Yifan Peng (Y)

Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA. yip4002@med.cornell.edu.

Classifications MeSH