ChatGPT 4.0 Falls Short in Diagnosing Orbital Floor Fractures
Key Highlights
- ChatGPT 4.0 showed statistically significant differences compared with an oral and maxillofacial surgeon when diagnosing orbital floor fractures.
- AI identified orbital floor fractures in 63.5% of cases, but its diagnostic accuracy was neither reliable nor accurate.
- Findings suggest orbital floor fracture diagnosis should remain surgeon-driven, with AI results requiring clinical verification.
In a retrospective cohort study presented at the 107th American Association of Oral and Maxillofacial Surgeons Annual Meeting, Scientific Sessions and Exhibition, researchers evaluated whether ChatGPT 4.0 could serve as an accurate and reliable diagnostic tool for orbital floor fractures. Despite its potential, results showed the artificial intelligence (AI) system’s performance differed significantly from that of a specialist surgeon, disproving its immediate clinical utility.
The emergence of large language models, including ChatGPT, has created excitement about their application in radiology and surgical fields due to their pattern-recognition capabilities. However, there remains a lack of robust evidence supporting their use in clinical diagnosis. Oral and maxillofacial surgeons are particularly interested in AI’s potential to improve diagnostic efficiency in acute trauma settings, prompting this investigation.
The study was performed on 30 cases of orbital floor fractures selected from a trauma database. Computed tomography scans (sagittal and coronal views) were reviewed, and ChatGPT’s diagnostic responses were compared with those of an oral and maxillofacial surgeon, who served as the gold standard. Descriptive analysis and a student’s t-test were applied to determine statistical significance between groups.
The study population had a mean age of 68.4 years, with 66.6% male and 33.3% female patients. ChatGPT identified orbital floor fractures in 19 of 30 cases (63.5%) and no fractures in 11 of 30 cases (36.6%). Statistical analysis demonstrated a significant difference between AI and surgeon responses (P = .03). These findings indicate that ChatGPT tended to classify more cases as fractures but not in a manner consistent with true diagnostic accuracy.
“Orbital floor fractures still require diagnosis by an oral and maxillofacial surgeon, ensuring that surgeons can verify AI-generated responses before using the tool in clinical settings,” the study authors concluded. “Further studies are needed to challenge the clinical applications of this software, particularly given its increasing use among oral and maxillofacial surgeons.”
Reference:
Ferrer JC, Caicedo AJH, Peña-Ruiz AO, Bermudez F. ChatGPT(4.0) has the capacity to diagnose orbital floor fractures: Fiction or reality? Presented at: American Association of Oral and Maxillofacial Surgeons Annual Meeting; September 15-20, 2025; Washington, DC. https://aaoms-annual-meeting-2025.eventscribe.net/.
