Conference Coverage

Can Large Language Models Deliver Accurate and Readable Postoperative Instructions After Total Knee Arthroplasty?

Key Highlights

GPT-4o and Claude achieved perfect accuracy scores.
All models demonstrated strong consistency across responses.
Gemini produced less readable text than other models.
GPT-4o outperformed Bard and Claude in ease-of-reading scores.

In a presentation at the ANESTHESIOLOGY annual meeting in San Antonio, TX, researchers evaluated the performance of four large language models (LLMs)—GPT-4o, Claude 3.7 Sonnet, DeepSeek R1, and Gemini 2.0 Flash—to determine which artificial intelligence (AI) tool generated the most accurate, relevant, and consistent postoperative care instructions for patients who underwent total knee arthroplasty (TKA) under general anesthesia.

In their study, Dhruv Nagesh, BS, and colleagues found that GPT-4o and Claude achieved perfect medical accuracy and relevance, while all models showed strong consistency. Significant differences were identified in readability, with GPT-4o scoring the highest in ease of understanding and Gemini producing the least readable content.

As AI becomes increasingly integrated into health care communication, ensuring the reliability of generated instructions is essential. The goal of the study was to determine whether these models could produce content aligned with established guidelines and tailored to TKA-specific postoperative needs. To perform this evaluation, researchers prompted each model to “generate detailed postoperative care instructions for a patient who has undergone general anesthesia for total knee arthroplasty, focusing on pain management, common side effects, activity restrictions, and mobilization.”

The outputs were evaluated on five criteria—medical accuracy, clarity, relevance, consistency, and readability—using a 3-point scale (0 = does not meet recommendations, 1 = partially meets, 2 = fully meets). Medical accuracy was assessed against Enhanced Recovery After Surgery (ERAS) Society recommendations, American Society of Anesthesiologists (ASA) practice guidelines, and UpToDate recommendations. Statistical analyses included the Kruskal-Wallis test for categorical data and t tests for readability.

Claude, GPT-4o, and DeepSeek demonstrated superior accuracy compared with Gemini (P < .0001), with Claude and GPT-4o reaching perfect scores (2.0 ± 0.00). No significant differences in clarity were observed among the models (F = 3.17, P = .08).

Relevance scores were highest for Gemini, Claude, and GPT-4o (2.00 ± 0.00), while DeepSeek scored slightly lower (1.89 ± 0.19; P > .0083). All models achieved perfect consistency (2.00 ± 0.00).

In terms of readability, significant differences were observed (F = 6.77, P = .009). Gemini produced text at a higher grade level (10.67 ± 0.40) than Claude (9.20 ± 0.69; P < .0001), GPT-4o (9.40 ± 0.20; P = .001), and DeepSeek (9.00 ± 0.55; P = .002). GPT-4o (44.80 ± 0.46) also achieved better ease-of-reading scores than Bard (37.80 ± 4.05; P = .002) and Claude (38.10 ± 4.44; P = .001), while DeepSeek (43.93 ± 4.12) outperformed Bard (P = .004).

“LLMs can generate medically accurate and relevant postoperative instructions with high consistency, underscoring their potential as supplementary tools in anesthesiology patient education,” the researchers concluded. “Prioritizing clarity and optimizing readability, potentially through targeted prompts specifying grade levels, could enhance clinical utility. As LLMs evolve, understanding their integration of anesthesia-specific care protocols will be imperative to improve patient outcomes and education.”

Reference:
Nagesh D, Keating D, Divakaruni R, Beutel B. Evaluating large language models (gpt-4, claude, deepseek, and bard) in anesthesia-specific post-operative care: instructions for total knee arthroplasty (TKA). Presented at: ASA Annual Meeting; 2025; Chicago, IL. Available at: https://www.asahq.org/annualmeeting/attend.

Current Consultant Issue

Previous Issues

Early View

Research Summaries

Research Summary

Research Summary

GLP-1 Use Associated With Lower Breast Cancer Incidence

06/02/2026

Ashton L. Stahl

Researchers assessed whether GLP-1 exposure was associated with breast cancer incidence among women undergoing screening.

06/02/2026

FDA Alert

FDA Alert

FDA Approves Bulevirtide-gmod (Hepcludex) for Chronic Hepatitis Delta Virus Infection

06/01/2026

Ashton L. Stahl

FDA approval gives physicians the first authorized treatment option for adults with chronic hepatitis delta virus infection.

06/01/2026

Timeline Snapshot

Notable FDA Approvals in May

05/29/2026

May 2026 included several FDA drug and biologic approvals across a range of therapeutic areas. This timeline highlights the key alerts our publication covered throughout the month, giving readers a quick,...

05/29/2026

Conference Coverage

Conference Coverage

Eptinezumab Improved Migraine-Related Cognitive Symptoms at 6 Months in INFUSE

05/28/2026

Anthony Calabro, MA

In interim INFUSE data, ≥50% of adults with prior anti-CGRP preventive failure reported cognitive symptom improvement after eptinezumab.

05/28/2026

Research Summary

Research Summary

Azithromycin Shows No Wheezing-Symptom Benefit in Preschool Children Treated in EDs

05/28/2026

Ashton L. Stahl

A recent multicenter trial assessed azithromycin for moderate-to-severe wheezing in preschool children treated in emergency departments.

05/28/2026

Trending Articles

Disease State Pillar

Disease State Pillar

HIV Management: A Practical Guide for Primary Care

12/09/2025

With modern antiretroviral therapy (ART), human immunodeficiency virus (HIV) is a chronic, manageable condition—but outcomes hinge on rapid treatment start, durable viral suppression, and proactive primary...

12/09/2025

Disease State Pillar

Disease State Pillar

Chronic Obstructive Pulmonary Disease: A Comprehensive Guide for Primary Care

11/05/2025

What is COPD? Chronic obstructive pulmonary disease (COPD) is a progressive, preventable condition marked by airflow limitation and respiratory symptoms such as dyspnea, chronic cough, and sputum...

11/05/2025

Disease State Pillar

Disease State Pillar

Agitation in Alzheimer Disease: A Comprehensive Guide for Primary Care

09/08/2025

Agitation is a common and challenging behavioral symptom in dementia that primary care providers frequently encounter. Characterized by increased motor activity, restlessness, verbal or physical aggression,...

09/08/2025

Slideshow

Slideshow

Consultant360’s Summer Playlist: Expert Perspectives on Emerging Clinical Issues

09/15/2025

This summer, Consultant360 brought you thought-provoking conversations with leading clinicians on some of the most pressing issues in patient care—from the latest Advisory Committee on Immunization...

09/15/2025

Meeting Coverage

Meeting Coverage

What Clinicians Need to Know About ACIP’s June 2025 Meeting

09/26/2025

Key Highlights Clesrovimab approved for infants under 8 months to prevent respiratory syncytial virus during their first season, and added to the Vaccines for Children program for free access. FluMist...

09/26/2025