2026, Number 2
<< Back Next >>
Acta Med 2026; 24 (2)
Comparative performance between artificial intelligence models and medical residents on an ABIM-style clinical exam
Nieves PCA, Rodríguez WFL, Molina OMC, Núñez HJC, Rivera TA, Rojas MA, Corona DA, Díaz GEJ
Language: Spanish
References: 16
Page: 118-124
PDF size: 330.29 Kb.
ABSTRACT
This study evaluated the academic performance of four
artificial intelligence language models (ChatGPT-4, Gemini
2.5, Claude 3.7, and DeepSeek R1) and Internal Medicine
residents in solving an ABIM-style clinical examination. Mean
accuracy rates were compared across groups, and withingroup
consistency was also assessed. Gemini 2.5 achieved
the highest score (98.3%, SD = 1.76), followed by Claude
3.7 (93.3%, SD = 2.11), ChatGPT-4 (92.7%, SD = 2.00), and
DeepSeek R1 (90.7%, SD = 3.06). In contrast, residents
achieved a significantly lower mean score (60.4%, SD = 12.04).
All AI models significantly outperformed residents; Gemini 2.5
also showed statistically significant differences compared with
the other AI models. The lower standard deviations observed
among AI models indicate greater response consistency relative
to the wide variability in the human group.
REFERENCES
Kaul V, Enslin S, Gross SA. History of artificial intelligence in medicine.Gastrointest Endosc. 2020; 92: 807-812.
Hirani R, Noruzi K, Khuram H, Hussaini AS, Aifuwa EI, Ely KE etal. Artificial intelligence and healthcare: a journey through history,present innovations, and future possibilities. Life (Basel). 2024; 14(5): 557.
Al Kuwaiti A, Nazer K, Al-Reedy A, Al-Shehri S, Al-Muhanna A,Subbarayalu AV et al. A review of the role of artificial intelligence inhealthcare. J Pers Med. 2023; 13 (6): 951.
Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S et al. Artificial intelligencein healthcare: Past, present and future. Stroke Vasc Neurol. 2017; 2:230-243.
Khan B, Fatima H, Qureshi A, Kumar S, Hanan A, Hussain J et al.Drawbacks of artificial intelligence and their potential solutions inthe healthcare sector. Biomed Mater Devices. 2023: 1-8.
Chakraborty C, Bhattacharya M, Pal S, Lee SS. From machine learningto deep learning: Advances of the recent data-driven paradigm shiftin medicine and healthcare. Curr Res Biotechnol. 2024; 7: 100164.
Katz Katz U, Cohen E, Shachar E, Somer J, Fink A, Morse E et al. GPTversus resident physicians — a benchmark based on official boardscores. NEJM AI. 2024; 1 (5): AIdbp2300192.
Suwala S, Szulc P, Guzowski C, Kaminska B, Dorobiala J,Wojciechowska K et al. ChatGPT-3.5 passes Poland’s medical finalexamination-Is it possible for ChatGPT to become a doctor in Poland?SAGE Open Med. 2024; 12: 20503121241257777.
Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-BrunR, Onambele L, Ortega W et al. Evaluating the efficacy of chatgpt innavigating the spanish medical residency entrance examination (MIR):promising horizons for ai in clinical medicine. Clin Pract. 2023; 13(6): 1460-1487.
Yaneva V, Baldwin P, Jurich DP, Swygert K, Clauser BE. ExaminingChatGPT performance on USMLE sample items and implications forassessment. Acad Med. 2024; 99 (2): 192-197.
Meyer A, Riese J, Streichert T. Comparison of the performance ofGPT-3.5 and GPT-4 with that of medical students on the writtenGerman medical licensing examination: observational study. JMIRMed Educ. 2024; 10: e50965.
Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW etal. Comparing ChatGPT and GPT-4 performance in USMLE soft skillassessments. Sci Rep. 2023; 13 (1): 16492.
OpenAI. Introducing OpenAI o1 [Internet]. 2024. Available in: https://openai.com/index/introducing-openai-o1-preview/
OpenAI. Learning to reason with LLMs [Internet]. 2024. Available in:https://openai.com/index/learning-to-reason-with-llms/
Wu S, Koo M, Blum L, Black A, Kao L, Fei Z et al. Benchmarkingopen-source large language models, GPT-4 and Claude 2 on multiplechoicequestions in nephrology. NEJM AI. 2024; 1 (2): AIdbp2300092.
Liu M, Okuhara T, Dai Z, Huang W, Okada H, Furukawa E et al.Performance of advanced large language models (GPT-4o, GPT-4,Gemini 1.5 Pro, Claude 3 Opus) on Japanese medical licensingexamination: A comparative study [Internet]. medRxiv; 2024.