2026, Number 2
<< Back Next >>
Acta Med 2026; 24 (2)
Comparative performance between artificial intelligence models and medical residents on an ABIM-style clinical exam
Nieves PCA, Rodríguez WFL, Molina OMC, Núñez HJC, Rivera TA, Rojas MA, Corona DA, Díaz GEJ
Language: Spanish
References: 16
Page: 118-124
PDF size: 334.25 Kb.
ABSTRACT
This study evaluated the academic performance of four artificial intelligence language models (ChatGPT-4, Gemini 2.5, Claude 3.7, and DeepSeek R1) and Internal Medicine residents in solving an ABIM-style clinical examination. Mean accuracy rates were compared across groups, and within-group consistency was also assessed. Gemini 2.5 achieved the highest score (98.3%, SD = 1.76), followed by Claude 3.7 (93.3%, SD = 2.11), ChatGPT-4 (92.7%, SD = 2.00), and DeepSeek R1 (90.7%, SD = 3.06). In contrast, residents achieved a significantly lower mean score (60.4%, SD = 12.04). All AI models significantly outperformed residents; Gemini 2.5 also showed statistically significant differences compared with the other AI models. The lower standard deviations observed among AI models indicate greater response consistency relative to the wide variability in the human group.
REFERENCES
Kaul V, Enslin S, Gross SA. History of artificial intelligence in medicine. Gastrointest Endosc. 2020; 92: 807-812.
Hirani R, Noruzi K, Khuram H, Hussaini AS, Aifuwa EI, Ely KE et al. Artificial intelligence and healthcare: a journey through history, present innovations, and future possibilities. Life (Basel). 2024; 14 (5): 557.
Al Kuwaiti A, Nazer K, Al-Reedy A, Al-Shehri S, Al-Muhanna A, Subbarayalu AV et al. A review of the role of artificial intelligence in healthcare. J Pers Med. 2023; 13 (6): 951.
Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S et al. Artificial intelligence in healthcare: Past, present and future. Stroke Vasc Neurol. 2017; 2: 230-243.
Khan B, Fatima H, Qureshi A, Kumar S, Hanan A, Hussain J et al. Drawbacks of artificial intelligence and their potential solutions in the healthcare sector. Biomed Mater Devices. 2023: 1-8.
Chakraborty C, Bhattacharya M, Pal S, Lee SS. From machine learning to deep learning: Advances of the recent data-driven paradigm shift in medicine and healthcare. Curr Res Biotechnol. 2024; 7: 100164.
Katz Katz U, Cohen E, Shachar E, Somer J, Fink A, Morse E et al. GPT versus resident physicians — a benchmark based on official board scores. NEJM AI. 2024; 1 (5): AIdbp2300192.
Suwala S, Szulc P, Guzowski C, Kaminska B, Dorobiala J, Wojciechowska K et al. ChatGPT-3.5 passes Poland's medical final examination-Is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med. 2024; 12: 20503121241257777.
Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W et al. Evaluating the efficacy of chatgpt in navigating the spanish medical residency entrance examination (MIR): promising horizons for ai in clinical medicine. Clin Pract. 2023; 13 (6): 1460-1487.
Yaneva V, Baldwin P, Jurich DP, Swygert K, Clauser BE. Examining ChatGPT performance on USMLE sample items and implications for assessment. Acad Med. 2024; 99 (2): 192-197.
Meyer A, Riese J, Streichert T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: observational study. JMIR Med Educ. 2024; 10: e50965.
Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023; 13 (1): 16492.
OpenAI. Introducing OpenAI o1 [Internet]. 2024. Available in: https://openai.com/index/introducing-openai-o1-preview/
OpenAI. Learning to reason with LLMs [Internet]. 2024. Available in: https://openai.com/index/learning-to-reason-with-llms/
Wu S, Koo M, Blum L, Black A, Kao L, Fei Z et al. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI. 2024; 1 (2): AIdbp2300092.
Liu M, Okuhara T, Dai Z, Huang W, Okada H, Furukawa E et al. Performance of advanced large language models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese medical licensing examination: A comparative study [Internet]. medRxiv; 2024.