Can Large Language Models Deliver Accurate and Readable Postoperative Instructions After Total Knee Arthroplasty?
Key Highlights
- GPT-4o and Claude achieved perfect accuracy scores.
- All models demonstrated strong consistency across responses.
- Gemini produced less readable text than other models.
- GPT-4o outperformed Bard and Claude in ease-of-reading scores.
In a presentation at the ANESTHESIOLOGY annual meeting in San Antonio, TX, researchers evaluated the performance of four large language models (LLMs)—GPT-4o, Claude 3.7 Sonnet, DeepSeek R1, and Gemini 2.0 Flash—to determine which artificial intelligence (AI) tool generated the most accurate, relevant, and consistent postoperative care instructions for patients who underwent total knee arthroplasty (TKA) under general anesthesia.
In their study, Dhruv Nagesh, BS, and colleagues found that GPT-4o and Claude achieved perfect medical accuracy and relevance, while all models showed strong consistency. Significant differences were identified in readability, with GPT-4o scoring the highest in ease of understanding and Gemini producing the least readable content.
As AI becomes increasingly integrated into health care communication, ensuring the reliability of generated instructions is essential. The goal of the study was to determine whether these models could produce content aligned with established guidelines and tailored to TKA-specific postoperative needs. To perform this evaluation, researchers prompted each model to “generate detailed postoperative care instructions for a patient who has undergone general anesthesia for total knee arthroplasty, focusing on pain management, common side effects, activity restrictions, and mobilization.”
The outputs were evaluated on five criteria—medical accuracy, clarity, relevance, consistency, and readability—using a 3-point scale (0 = does not meet recommendations, 1 = partially meets, 2 = fully meets). Medical accuracy was assessed against Enhanced Recovery After Surgery (ERAS) Society recommendations, American Society of Anesthesiologists (ASA) practice guidelines, and UpToDate recommendations. Statistical analyses included the Kruskal-Wallis test for categorical data and t tests for readability.
Claude, GPT-4o, and DeepSeek demonstrated superior accuracy compared with Gemini (P < .0001), with Claude and GPT-4o reaching perfect scores (2.0 ± 0.00). No significant differences in clarity were observed among the models (F = 3.17, P = .08).
Relevance scores were highest for Gemini, Claude, and GPT-4o (2.00 ± 0.00), while DeepSeek scored slightly lower (1.89 ± 0.19; P > .0083). All models achieved perfect consistency (2.00 ± 0.00).
In terms of readability, significant differences were observed (F = 6.77, P = .009). Gemini produced text at a higher grade level (10.67 ± 0.40) than Claude (9.20 ± 0.69; P < .0001), GPT-4o (9.40 ± 0.20; P = .001), and DeepSeek (9.00 ± 0.55; P = .002). GPT-4o (44.80 ± 0.46) also achieved better ease-of-reading scores than Bard (37.80 ± 4.05; P = .002) and Claude (38.10 ± 4.44; P = .001), while DeepSeek (43.93 ± 4.12) outperformed Bard (P = .004).
“LLMs can generate medically accurate and relevant postoperative instructions with high consistency, underscoring their potential as supplementary tools in anesthesiology patient education,” the researchers concluded. “Prioritizing clarity and optimizing readability, potentially through targeted prompts specifying grade levels, could enhance clinical utility. As LLMs evolve, understanding their integration of anesthesia-specific care protocols will be imperative to improve patient outcomes and education.”
Reference:
Nagesh D, Keating D, Divakaruni R, Beutel B. Evaluating large language models (gpt-4, claude, deepseek, and bard) in anesthesia-specific post-operative care: instructions for total knee arthroplasty (TKA). Presented at: ASA Annual Meeting; 2025; Chicago, IL. Available at: https://www.asahq.org/annualmeeting/attend.
