medigraphic.com
SPANISH

Acta Médica Grupo Angeles

Órgano Oficial del Hospital Angeles Health System
  • Contents
  • View Archive
  • Information
    • General Information        
    • Directory
  • Publish
    • Instructions for authors        
    • Manuscript submission
    • Policies
    • Names and affiliations of the Editorial Board
  • About us
    • Data sharing policy
    • Stated aims and scope
  • medigraphic.com
    • Home
    • Journals index            
    • Register / Login
  • Mi perfil

2026, Number 2

<< Back Next >>

Acta Med 2026; 24 (2)

Comparative performance between artificial intelligence models and medical residents on an ABIM-style clinical exam

Nieves PCA, Rodríguez WFL, Molina OMC, Núñez HJC, Rivera TA, Rojas MA, Corona DA, Díaz GEJ
Full text How to cite this article 10.35366/122614

DOI

DOI: 10.35366/122614
URL: https://dx.doi.org/10.35366/122614

Language: Spanish
References: 16
Page: 118-124
PDF size: 334.25 Kb.


Key words:

artificial intelligence, ChatGPT, large language models, academic performance, ABIM.

ABSTRACT

This study evaluated the academic performance of four artificial intelligence language models (ChatGPT-4, Gemini 2.5, Claude 3.7, and DeepSeek R1) and Internal Medicine residents in solving an ABIM-style clinical examination. Mean accuracy rates were compared across groups, and within-group consistency was also assessed. Gemini 2.5 achieved the highest score (98.3%, SD = 1.76), followed by Claude 3.7 (93.3%, SD = 2.11), ChatGPT-4 (92.7%, SD = 2.00), and DeepSeek R1 (90.7%, SD = 3.06). In contrast, residents achieved a significantly lower mean score (60.4%, SD = 12.04). All AI models significantly outperformed residents; Gemini 2.5 also showed statistically significant differences compared with the other AI models. The lower standard deviations observed among AI models indicate greater response consistency relative to the wide variability in the human group.


REFERENCES

  1. Kaul V, Enslin S, Gross SA. History of artificial intelligence in medicine. Gastrointest Endosc. 2020; 92: 807-812.

  2. Hirani R, Noruzi K, Khuram H, Hussaini AS, Aifuwa EI, Ely KE et al. Artificial intelligence and healthcare: a journey through history, present innovations, and future possibilities. Life (Basel). 2024; 14 (5): 557.

  3. Al Kuwaiti A, Nazer K, Al-Reedy A, Al-Shehri S, Al-Muhanna A, Subbarayalu AV et al. A review of the role of artificial intelligence in healthcare. J Pers Med. 2023; 13 (6): 951.

  4. Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S et al. Artificial intelligence in healthcare: Past, present and future. Stroke Vasc Neurol. 2017; 2: 230-243.

  5. Khan B, Fatima H, Qureshi A, Kumar S, Hanan A, Hussain J et al. Drawbacks of artificial intelligence and their potential solutions in the healthcare sector. Biomed Mater Devices. 2023: 1-8.

  6. Chakraborty C, Bhattacharya M, Pal S, Lee SS. From machine learning to deep learning: Advances of the recent data-driven paradigm shift in medicine and healthcare. Curr Res Biotechnol. 2024; 7: 100164.

  7. Katz Katz U, Cohen E, Shachar E, Somer J, Fink A, Morse E et al. GPT versus resident physicians — a benchmark based on official board scores. NEJM AI. 2024; 1 (5): AIdbp2300192.

  8. Suwala S, Szulc P, Guzowski C, Kaminska B, Dorobiala J, Wojciechowska K et al. ChatGPT-3.5 passes Poland's medical final examination-Is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med. 2024; 12: 20503121241257777.

  9. Guillen-Grima F, Guillen-Aguinaga S, Guillen-Aguinaga L, Alas-Brun R, Onambele L, Ortega W et al. Evaluating the efficacy of chatgpt in navigating the spanish medical residency entrance examination (MIR): promising horizons for ai in clinical medicine. Clin Pract. 2023; 13 (6): 1460-1487.

  10. Yaneva V, Baldwin P, Jurich DP, Swygert K, Clauser BE. Examining ChatGPT performance on USMLE sample items and implications for assessment. Acad Med. 2024; 99 (2): 192-197.

  11. Meyer A, Riese J, Streichert T. Comparison of the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination: observational study. JMIR Med Educ. 2024; 10: e50965.

  12. Brin D, Sorin V, Vaid A, Soroush A, Glicksberg BS, Charney AW et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023; 13 (1): 16492.

  13. OpenAI. Introducing OpenAI o1 [Internet]. 2024. Available in: https://openai.com/index/introducing-openai-o1-preview/

  14. OpenAI. Learning to reason with LLMs [Internet]. 2024. Available in: https://openai.com/index/learning-to-reason-with-llms/

  15. Wu S, Koo M, Blum L, Black A, Kao L, Fei Z et al. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI. 2024; 1 (2): AIdbp2300092.

  16. Liu M, Okuhara T, Dai Z, Huang W, Okada H, Furukawa E et al. Performance of advanced large language models (GPT-4o, GPT-4, Gemini 1.5 Pro, Claude 3 Opus) on Japanese medical licensing examination: A comparative study [Internet]. medRxiv; 2024.




Figure 1
Table 1
Table 2
Table 3

2020     |     www.medigraphic.com

Mi perfil

C?MO CITAR (Vancouver)

Acta Med. 2026;24