Can an Artificial Intelligence Model Pass an Examination for Medical Specialists?

Fuentes-Martín, Álvaro; Cilleruelo-Ramos, Ángel; Segura-Méndez, Bárbara; Mayol, Julio

doi:10.1016/j.arbres.2023.03.017

Archivos de Bronconeumología

ISSN: 0300-2896

Archivos de Bronconeumologia is an international journal that publishes original studies whose content is based upon results of research initiatives dealing with several aspects of respiratory medicine including epidemiology, respiratory physiology, pathophysiology of respiratory diseases, clinical management, thoracic surgery, pediatric lung diseases, respiratory critical care, respiratory allergy and translational research. Other types of articles such as editorials, reviews, and different types of letters are also published in the journal. Additionally, the journal expresses the voice of the following scientific societies: the Spanish Respiratory Society of Pneumology and Thoracic Surgery (SEPAR; https://www.separ.es/), the Latin American Thoracic Society (ALAT; https://alatorax.org/), and the Iberian American Association of Thoracic Surgery (AIACT; http://www.aiatorax.com/).

It is a monthly journal in which all manuscripts are sent to peer-review and handled by the editor or an associate editor from the team and the final decision is made on the basis of the comments from the expert reviewers and the editors. The journal is published solely in English. All the published data is composed of novel manuscripts not previously published in any other journal and not being in consideration for publication in any other journal..

The journal is indexed at Science Citation Index Expanded, Medline/Pubmed, Embase and SCOPUS. Access to any published article is possible through the journal's web page as well as from Pubmed, ScienceDirect, and other international databases. Furthermore, the journal is also present in X, Facebook and Linkedin. Manuscripts can be submitted electronically using the following web site: https://www.editorialmanager.com/ARBR/.

Indexed in:

Medline, Science Citation Index Expanded (SCIE)

ChatGPT, Chat Generative Pre-Trained Transformer,1 is an artificial intelligence (AI) language model trained on a massive set of internet text data using machine learning and natural language processing (NLP) algorithms. It can generate human-like automatic responses to a variety of questions and stimuli in multiple languages and subject areas.2 The main aim of this study was to evaluate how ChatGPT operates when responding to multiple-choice questions in a highly specialized area of state-of-the-art medical expertise.

We conducted a descriptive analysis of the performance of ChatGPT (OpenAI, San Francisco; Version: 9) in the 2022 competitive exams for the post of Specialist in thoracic surgery announced by the Andalusian Health Service.3 This particular exam was chosen because it uses a multiple-choice format with 4 possible answers, only one of which is correct. Participants answer 2 sets of questions: one, consisting of 100 direct questions, is theoretical, and the other, consisting of 50 questions, is practical and addresses clinical scenarios focused on critical reasoning.

ChatGPT answered the questions on its online platform between 10/02/2023 and 15/02/2023, in response to the following wording: “ANSWER THE FOLLOWING MULTIPLE-CHOICE QUESTION:” Separate sessions were used for each of the theoretical questions, while the practical questions were answered in the same session, using the memory retention bias of the artificial intelligence model to increase its performance. The definitive official template published by the public administration3 was used as a model answer. The examination consisted of 146 questions (theoretical section: 98/practical section: 48) after the Andalusia Health Service excluded 7 questions and included another 3 reserve questions.

ChatGPT answered 58.90% (86) of the answers correctly: inferential analysis revealed significant differences with respect to the rate of correct answers that could be attributed to chance (25%) with a level of statistical significance of 99% (p<0.001). The pass rate for the theoretical section was 63.2% (62) compared to 50% for the practical section (24). Scoring criteria were applied for each correct question, including a penalty of −0.25 for each incorrect answer and weighting criteria for each section of the examination phase. The threshold specified in the official call for passing the exam was 60% of the average of the best 10 scores.3 A pass mark of 40 points was set. The artificial intelligence model would therefore have passed this part of the access examination for thoracic surgery physician/specialist with a score of 45.79 points.

Our results are in line with the existing literature on the potential of ChatGPT for completing question-answer tasks in different areas of knowledge, including the medical field. For example, evidence has been published of a correct response rate of over 60% for the United States Medical Licensing Examination (USMLE) Step 1, over 57% for the USMLE Step-2,4 and over 50% in the access examination for specialist residency posts in Spain in 2022.5

In our study, the AI tool performed worse when answering practical questions compared to theoretical questions, suggesting that it has difficulties in responding to clinical practice scenarios that require critical reasoning. As limitations of the study, it should be noted that we did not analyze the model's ability to respond correctly depending on the way the questions were formulated, nor did we analyze the justification that the AI model gave for each correct or incorrect answer (Fig. 1).

Fig. 1.

Example of questions from the competitive exam and responses of the ChatGPT model. (A) Theoretical section. (B) Practical section.

In conclusion, the ChatGPT model was capable of passing a competitive exam for the post of specialist in thoracic surgery, although differences were observed in its performance in areas in which critical reasoning is required. The emergence of AI tools that can resolve a variety of questions and tasks, including in the health field, their potential for development, and their incorporation into our training and daily clinical practice are a challenge for the scientific community.

Conﬂict of Interests

The authors state that they have no conﬂict of interests.

Appendix A

Supplementary data

The following are the supplementary data to this article:

References

[1]

ChatGPT [Web]. https://openai.com/blog/chatgpt/2023 [accessed 18.02.23].

[2]

Scott K. Microsoft teams up with Open AI to exclusively license GPT-3 language model. The Official Microsoft Blog [Web]. https://blogs.microsoft.com/blog/2020/09/22/microsoft-teams-up-with-openai-to-exclusively-license-gpt-3-language-model/2020 [accessed 18.02.23].

[3]

Boletín Oficial de la Junta de Andalucía (2021, June 22). 118 – Tuesday, June 22, 2021. Depósito Legal: SE-410/1979. ISSN: 2253-802X.

[4]

A. Gilson, C.W. Safranek, T. Huang, V. Socrates, L. Chi, R.A. Taylor, et al.

How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment.

JMIR Med Educ, 9 (2023), pp. e45312

http://dx.doi.org/10.2196/45312

[5]

J.P. Carrasco, E. García, D.A. Sánchez, E. Porter, L. De La Puente, J. Navarro, et al.

¿Es capaz “ChatGPT” de aprobar el examen MIR de 2022? Implicaciones de la inteligencia artificial en la educación médica en España.

Rev Esp Edu Med, 1 (2023), pp. 55-69

http://dx.doi.org/10.6018/edumed.556511revistas.um.es/edumed