Testing and Evaluation of Health Care Applications of Large Language Models

JAMA Network October 15, 2024
Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, Nigam H. Shah

A Systematic Review

Key Points

Question
How are health care applications of large language models (LLMs) currently evaluated?

Findings
In this systematic review of 519 studies published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. Administrative tasks such as writing prescriptions and natural language processing and natural language understanding tasks such as summarization were understudied; accuracy was the predominant dimension of evaluation, while fairness, bias, and toxicity assessments were less studied.

Meaning
Results of this systematic review suggest that current evaluations of LLMs in health care are fragmented and insufficient, and that evaluations need to use real patient data, quantify biases, cover a wider range of tasks and specialties, and...

Today's Sponsors

Today's Sponsor

Topics: AI (Artificial Intelligence), Provider, Survey / Study, Technology, Trends

Share This Article

Testing and Evaluation of Health Care Applications of Large Language Models

Today's Sponsors

Today's Sponsor

Share This Article