Comparative performance between artificial intelligence models and medical residents on an ABIM-style clinical exam

César Adolfo Nieves Pérez; Federico Leopoldo Rodríguez Weber; Miguel Cuauhtémoc Molina Obana; Juan Carlos Núñez Hernández; Alejandro Rivera Tapia; Alejandro Rojas Montaño; Axel Corona Deschamps; Enrique Juan Díaz Greene

2026, Number 2

<< Back Next >>

Acta Med 2026; 24 (2)

Comparative performance between artificial intelligence models and medical residents on an ABIM-style clinical exam

Nieves PCA, Rodríguez WFL, Molina OMC, Núñez HJC, Rivera TA, Rojas MA, Corona DA, Díaz GEJ

Full text

How to cite this article

10.35366/122614

Language: Spanish
References: 16
Page: 118-124
PDF size: 330.29 Kb.

ABSTRACT

This study evaluated the academic performance of four artificial intelligence language models (ChatGPT-4, Gemini 2.5, Claude 3.7, and DeepSeek R1) and Internal Medicine residents in solving an ABIM-style clinical examination. Mean accuracy rates were compared across groups, and withingroup consistency was also assessed. Gemini 2.5 achieved the highest score (98.3%, SD = 1.76), followed by Claude 3.7 (93.3%, SD = 2.11), ChatGPT-4 (92.7%, SD = 2.00), and DeepSeek R1 (90.7%, SD = 3.06). In contrast, residents achieved a significantly lower mean score (60.4%, SD = 12.04). All AI models significantly outperformed residents; Gemini 2.5 also showed statistically significant differences compared with the other AI models. The lower standard deviations observed among AI models indicate greater response consistency relative to the wide variability in the human group.

REFERENCES

Kaul V, Enslin S, Gross SA. History of artificial intelligence in medicine.Gastrointest Endosc. 2020; 92: 807-812.
Hirani R, Noruzi K, Khuram H, Hussaini AS, Aifuwa EI, Ely KE etal. Artificial intelligence and healthcare: a journey through history,present innovations, and future possibilities. Life (Basel). 2024; 14(5): 557.
Al Kuwaiti A, Nazer K, Al-Reedy A, Al-Shehri S, Al-Muhanna A,Subbarayalu AV et al. A review of the role of artificial intelligence inhealthcare. J Pers Med. 2023; 13 (6): 951.
Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S et al. Artificial intelligencein healthcare: Past, present and future. Stroke Vasc Neurol. 2017; 2:230-243.
Khan B, Fatima H, Qureshi A, Kumar S, Hanan A, Hussain J et al.Drawbacks of artificial intelligence and their potential solutions inthe healthcare sector. Biomed Mater Devices. 2023: 1-8.
Chakraborty C, Bhattacharya M, Pal S, Lee SS. From machine learningto deep learning: Advances of the recent data-driven paradigm shiftin medicine and healthcare. Curr Res Biotechnol. 2024; 7: 100164.
Katz Katz U, Cohen E, Shachar E, Somer J, Fink A, Morse E et al. GPTversus resident physicians — a benchmark based on official boardscores. NEJM AI. 2024; 1 (5): AIdbp2300192.
Suwala S, Szulc P, Guzowski C, Kaminska B, Dorobiala J,Wojciechowska K et al. ChatGPT-3.5 passes Poland’s medical finalexamination-Is it possible for ChatGPT to become a doctor in Poland?SAGE Open Med. 2024; 12: 20503121241257777.
Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-BrunR, Onambele L, Ortega W et al. Evaluating the efficacy of chatgpt innavigating the spanish medical residency entrance examination (MIR):promising horizons for ai in clinical medicine. Clin Pract. 2023; 13(6): 1460-1487.
Yaneva V, Baldwin P, Jurich DP, Swygert K, Clauser BE. ExaminingChatGPT performance on USMLE sample items and implications forassessment. Acad Med. 2024; 99 (2): 192-197.
Meyer A, Riese J, Streichert T. Comparison of the performance ofGPT-3.5 and GPT-4 with that of medical students on the writtenGerman medical licensing examination: observational study. JMIRMed Educ. 2024; 10: e50965.
Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW etal. Comparing ChatGPT and GPT-4 performance in USMLE soft skillassessments. Sci Rep. 2023; 13 (1): 16492.
OpenAI. Introducing OpenAI o1 [Internet]. 2024. Available in: https://openai.com/index/introducing-openai-o1-preview/
OpenAI. Learning to reason with LLMs [Internet]. 2024. Available in:https://openai.com/index/learning-to-reason-with-llms/
Wu S, Koo M, Blum L, Black A, Kao L, Fei Z et al. Benchmarkingopen-source large language models, GPT-4 and Claude 2 on multiplechoicequestions in nephrology. NEJM AI. 2024; 1 (2): AIdbp2300092.
Liu M, Okuhara T, Dai Z, Huang W, Okada H, Furukawa E et al.Performance of advanced large language models (GPT-4o, GPT-4,Gemini 1.5 Pro, Claude 3 Opus) on Japanese medical licensingexamination: A comparative study [Internet]. medRxiv; 2024.