Tabla 3: Comparaciones Games-Howell entre modelos de IA

y subgrupos de residentes (R1-R4).

(I) Nuevo

(J) Nuevo

Diferencia de

medias (I-J)

EE

p

IC95%

ChatGPT 4

Gemini 2.5

–5.66667

0.86781

< 0.001

–8.6384 - –2.6949

Claude 3.7

–0.66633

0.66667

0.963

–3.2270 - 1.8943

DeepSeek R1

2

1.27657

0.762

–2.4626 - 6.4626

R1

35.04897

2.62252

< 0.001

25.7565 - 44.3415

R2

34.11467

3.95452

< 0.001

19.1239 - 49.1055

R3

32.21444

4.34475

< 0.001

15.2329 - 49.1960

R4

23.03333

5.85874

0.091

–3.9323 - 49.9989

Gemini 2.5

ChatGPT 4

5.66667

0.86781

< 0.001

2.6949 - 8.6384

Claude 3.7

5.00033

0.55556

< 0.001

2.8665 - 7.1342

DeepSeek R1

7.66667

1.22222

< 0.001

3.3238 - 12.0095

R1

40.71564

2.59650

< 0.001

31.4615 - 49.9698

R2

39.78133

3.93731

< 0.001

24.7981 - 54.7645

R3

37.88111

4.32909

< 0.001

20.8989 - 54.8633

R4

28.70000

5.84714

0.039

1.6972 - 55.7028

Claude 3.7

ChatGPT 4

0.66633

0.66667

0.963

–1.8943 - 3.2270

Gemini 2.5

–5.00033

0.55556

< 0.001

–7.1342 - –2.8665

DeepSeek R1

2.66633

1.08866

0.321

–1.5152 - 6.8478

R1

35.71531

2.53637

< 0.001

26.5351 - 44.8955

R2

34.78100

3.89792

< 0.001

19.8093 - 49.7527

R3

32.88078

4.29330

< 0.001

15.8918 - 49.8698

R4

23.69967

5.82068

0.083

–3.3921 - 50.7914

DeepSeek R1

ChatGPT 4

–2

1.27657

0.762

–6.4626 - 2.4626

Gemini 2.5

–7.66667

1.22222

< 0.001

–12.0095 - –3.3238

Claude 3.7

–2.66633

1.08866

0.321

–6.8478 - 1.5152

R1

33.04897

2.76014

< 0.001

23.5009 - 42.5971

R2

32.11467

4.04709

< 0.001

17.0584 - 47.1709

R3

30.21444

4.42918

0.001

13.2151 - 47.2138

R4

21.03333

5.92162

0.124

–5.7488 - 47.8154

R1

ChatGPT 4

–35.04897

2.62252

< 0.001

–44.3415 - –25.7565

Gemini 2.5

–40.71564

2.5965

< 0.001

–49.9698 - –31.4615

Claude 3.7

–35.71531

2.53637

< 0.001

–44.8955 - –26.5351

DeepSeek R1

–33.04897

2.76014

< 0.001

–42.5971 - –23.5009

R2

–0.93431

4.65048

1.000

–17.0252 - 15.1565

R3

–2.83453

4.98654

0.999

–20.5372 - 14.8681

R4

–12.01564

6.34930

0.592

–38.1539 - 14.1226

R2

ChatGPT 4

–34.11467

3.95452

< 0.001

–49.1055 - –19.1239

Gemini 2.5

–39.78133

3.93731

< 0.001

–54.7645 - –24.7981

Claude 3.7

–34.78100

3.89792

< 0.001

–49.7527 - –19.8093

DeepSeek R1

–32.11467

4.04709

< 0.001

–47.1709 - –17.0584

R1

0.93431

4.65048

1.000

–15.1565 - 17.0252

R3

–1.90022

5.79881

1.000

–21.8803 - 18.0799

R4

–11.08133

7.00530

0.751

–37.6942 - 15.5315

R3

ChatGPT 4

–32.21444

4.34475

< 0.001

–49.1960 - –15.2329

Gemini 2.5

–37.88111

4.32909

< 0.001

–54.8633 - –20.8989

Claude 3.7

–32.88078

4.29330

< 0.001

–49.8698 - –15.8918

DeepSeek R1

–30.21444

4.42918

0.001

–47.2138 - –13.2151

R1

2.83453

4.98654

0.999

–14.8681 - 20.5372

R2

1.90022

5.79881

1.000

–21.8803 - 18.0799

R4

–9.18111

7.23276

0.891

–36.2745 - 17.9122

R4

ChatGPT 4

–23.03333

5.85874

0.091

–49.9989 - 3.9323

Gemini 2.5

–28.70000

5.84714

0.039

–55.7028 - –1.6972

Claude 3.7

–23.69967

5.82068

0.083

–50.7914 - 3.3921

DeepSeek R1

–21.03333

5.92162

0.124

–47.8154 - 5.7488

R1

12.01564

6.34930

0.592

–14.1226 - 38.1539

R2

11.08133

7.00530

0.751

–15.5315 - 37.6942

R3

9.18111

7.23276

0.891

–17.9122 - 36.2745

Diferencias de medias, error estándar (EE), significancia (p) e intervalos de confianza al 95% (IC95%) para comparaciones par a par entre modelos de IA y subgrupos de residentes (R1-R4) en el examen tipo ABIM (American Board of Internal Medicine).

Las diferencias estadísticamente significativas (p < 0.05) están marcadas con negritas.