Abstract
The introduction of generative artificial intelligence (AI) has revolutionized healthcare and education. These AI systems, trained on vast datasets using advanced machine learning (ML) techniques and large language models (LLMs), can generate text, images, and videos, offering new avenues for enhancing surgical education. Their ability to produce interactive learning resources, procedural guidance, and feedback post-virtual simulations makes them valuable in educating surgical trainees. However, technical challenges such as data quality issues, inaccuracies, and uncertainties around model interpretability remain barriers to widespread adoption. This review explores the integration of generative AI into surgical training, assessing its potential to enhance learning and teaching methodologies. While generative AI has demonstrated promise for improving surgical education, its integration must be approached cautiously, ensuring AI input is balanced with traditional supervision and mentorship from experienced surgeons. Given that generative AI models are not yet suitable as standalone tools, a blended learning approach that integrates AI capabilities with conventional educational strategies should be adopted. The review also addresses limitations and challenges, emphasizing the need for more robust research on different AI models and their applications across various surgical subspecialties. The lack of standardized frameworks and tools to assess the quality of AI outputs in surgical education necessitates rigorous oversight to ensure accuracy and reliability in training settings. By evaluating the current state of generative AI in surgical education, this narrative review highlights the potential for future innovation and research, encouraging ongoing exploration of AI in enhancing surgical education and training.
Keywords
Artificial Intelligence, AI, education, training
INTRODUCTION
Artificial intelligence (AI) has been transformative in the healthcare and education sectors by enhancing workflow efficiency through the automation of tasks[1]. One notable form of AI that has garnered considerable attention is generative AI, encompassing models that autonomously create novel content, including text, images, audio, and video[2]. Generative AI tools achieve this by leveraging machine learning (ML) techniques, particularly large language models (LLMs) and generative adversarial networks (GANs)[3-5]. LLMs are trained on vast quantities of textual data from an array of sources, enabling them to learn associations between lexical items and syntactic patterns to produce contextually specific responses[5-7]. Among the most widely recognized LLMs is ChatGPT by OpenAI (San Francisco, USA), which rapidly attracted over 100 million users within the first two months of its inception[8]. GANs, another subset of generative AI, specialize in producing realistic visual data. These models utilize two neural networks to create visuals, with one network generating images and the other evaluating their realism[9].
The ability to generate new content has piqued the interest of those within the surgical field for its potential applications in enhancing surgical education[10,11]. Traditionally, surgical training has involved a blend of theoretical instruction, observation of procedures, and supervised practice[12,13]. However, these conventional approaches encounter challenges such as limited access to diverse training scenarios. Additionally, to maximize the benefits of these modes of teaching, regular quality and structured feedback is required for trainees to refine their techniques and enhance their performance, which is often difficult due to time constraints and reduced opportunities for senior supervision[14,15]. LLMs demonstrate significant promise in overcoming these barriers by delivering real-time personalized feedback, owing to their ability to comprehend and generate text that mirrors natural human writing[16]. Furthermore, by integrating AI with surgical simulators, feedback in the form of text and data can be produced, allowing trainees to gain valuable insights into their performance outside the operating room[17]. Generative AI tools can also serve as educational aids for trainees. Through their chatbot-like interface, LLMs can be employed to answer surgical queries, create study materials including practice questions and case studies, and add interactivity by enabling dialogue and discussion. Meanwhile, GANs can potentially develop anatomical, pathological, and procedural images[18].
The intersection between AI technology and surgical education has become a topical area of research, leading to the publication of multiple studies on this topic in recent years[19-22]. However, limited reviews specifically focus on generative AI models’ applications in surgical education. As such, this narrative review aims to bridge this gap by providing an overview of existing applications, limitations, and future directions to foster continued development in this area.
METHODS
From the databases’ inception until July 2024, two authors independently conducted an extensive literature search encompassing PubMed (January 1996), Scopus (March 2004), World of Science (1997), and Cochrane Library (April 1996) databases. The search strategy employed included: (“generative artificial intelligence” OR “generative AI” OR “AI-generat*” OR “AI generat*” OR “ChatGPT” OR “Dall-E” OR “Sora” OR “text-to-image” OR “text to image” OR “text-to-video” OR “text to video” OR “artificial intelligence” OR “AI” OR “AI technolog*” OR “AI model*” OR “AI system*” OR “AI technique*” OR “machine intelligence” OR “computer vision” OR “computer vision system*” OR “computer reasoning” OR “neural network*” OR “neural network model*” OR “computer neural network*” OR “large language model*” OR “LLM” OR “natural language processing” OR “generative adversarial network*” OR “machine learning” OR “machine learning algorithm*” OR “deep learning” OR “deep learning model*”) AND (“surgical training” OR “surgical education” OR “surgical competence*” OR “surgical trainee*” OR “ surgical expertise” OR “surgical resident*” OR “surgical registrar*” OR “surgical fellow*” OR “ surgical learning” OR “surgical curriculum*” OR “surgical preparation” OR “surgical exam*” OR “surgical skill*” OR “surgical technique*”). Titles and abstracts were initially screened, followed by a full-text review to assess eligibility. Figure 1 shows the PRISMA flow diagram of selected studies.
Inclusion criteria:
1. Primary research published in peer-reviewed journals, incorporating both experimental studies such as randomized controlled trials (RCTs) and non-randomized trials, as well as observational studies including cohort and case-control studies.
2. Studies focusing on generative AI systems capable of creating novel content and outputs.
3. Studies with clear applications to surgical training, including improving educational methods, surgical techniques, or the development of surgical skills.
Exclusion criteria:
1. Studies not published in the English language.
2. Review articles, pre-prints, case reports, conference proceedings, conference abstracts, and letters or editorial opinions.
3. Studies on non-generative AI systems, e.g., predictive models, diagnostic tools, and traditional ML algorithms.
4. Studies that do not discuss generative AI in the context of applications to surgical training.
Due to the significant heterogeneity between the studies included in our review, a formal meta-analysis could not be performed. The variability in study designs, AI models employed, educational outcomes measured, and surgical subspecialties investigated contributed to this heterogeneity. However, we extracted and presented the data in a flowchart and tabular format to provide a comprehensive overview of the existing evidence. This approach allows for a more precise comparison of the outcomes analyzed in each study, highlighting the current literature’s strengths and limitations [Table 1]. The tabulated data of included studies [Table 2] also serve as a valuable resource for identifying trends and gaps in the research, which could guide future investigations in this rapidly evolving field.
GENERATING INTERACTIVE EDUCATIONAL MATERIALS AND LEARNING RESOURCES
With ongoing advancements in surgery and the increasing volume of knowledge to grasp, LLMs may be adopted to enhance learning efficiency. By combining rapid response times with advanced natural language capabilities, these tools can serve as dynamic resources capable of answering surgical questions and creating customized learning materials[23]. Brennan et al. investigated using ChatGPT to optimize otolaryngology education by guiding trainees through procedures[24]. Although the LLM provided procedural steps for a tonsillectomy, reviewers noted that the response was more suitable for junior trainees, as ChatGPT struggled with the more nuanced details of the procedure. Similarly, Mohapatra et al. observed that AI-generated surgical protocols missed crucial information, often leading to confusion among residents[25]. This issue was further highlighted in a study by Lebhar et al., revealing Plastics and Reconstructive Surgery residents were able to identify multiple inaccuracies in ChatGPT-generated procedural steps for a Fisher cleft repair, showing a preference for protocols written by experienced craniomaxillofacial surgeons[26]. These findings underscore the importance of integrating AI tools with expert oversight to ensure the accuracy and reliability of surgical education materials.
However, LLMs were more successful in generating interactive case studies to supplement surgical teaching and consolidate key concepts. ChatGPT was used to create a case study consisting of hypothetical patient data, clinical examination results, differential diagnoses, and a treatment plan, achieving a score of 100% from reviewers for its usefulness and accuracy[25]. However, a less specific prompt received a score of 43.33%. These results suggest that ChatGPT can generate relevant case scenarios for study, though only under conditions where a prompt was well-engineered[27]. Sevgi et al. determined that simulated case reports generated by ChatGPT were realistic in terms of their examination findings, investigations, and management[28]. Collectively, these studies emphasize that LLMs are better suited for trainees and medical students requiring a simplified but high-yield overview of a topic, given responses from ChatGPT are often concise and logical. However, the levels of detail and precision may be inadequate for more advanced trainees who may already have an extensive knowledge base.
In addition to outputting text, images from text prompts can also be produced via GANs. GANs hold the potential to produce images of anatomical structures and pathological features for learning, overcoming issues of privacy and confidentiality involved with using real patient images[10]. In an experimental study, Seth et al. investigated using AI models to artificially create images of skin ulcers, comparing the performance of DALL-E2 (Open AI, San Francisco USA), Midjourney (Midjourney, San Francisco USA), and Blue Willow (LimeWire, San Francisco USA) in performing this task[10]. Out of these three GANs, DALL-E2 was the most successful, avoiding issues such as overly stylized or completely irrelevant images. Although capable of mimicking realistic human skin, the images produced still lacked crucial details such as depth and the color of ulcers. Hence, in its current state, GAN technology cannot accurately create and depict medical images.
Integrating generative AI into surgical education holds great promise, yet significant challenges must be addressed before it can be widely adopted. From a technical perspective, generative AI models face a commonly recognized phenomenon known as “hallucination”, which refers to the models generating nonsensical or fictitious information, including false references to non-existent literature[29]. Multiple studies uncovered incorrect answers that were supported by seemingly logical justifications, which were later found to be false[25,30,31]. This phenomenon poses major risks to surgical education, especially in studies where junior trainees have difficulty identifying mistakes in AI-generated responses and may even prefer them for their clarity[26]. There are further concerns surrounding the inherent quality of the data that these models are trained on, which could lead to biases, inaccuracies and errors in the output[29]. When combined with the failure of AI models to verify information or cite sources consistently, there are significant risks in propagating misinformation. Finally, AI models face the “black box problem” where there is insufficient understanding surrounding the model’s inner workings, leading to a lack of transparency around the mechanisms in which responses are generated and, thus, mistrust toward the system[32]. Given these ongoing challenges, it is paramount that medical professionals thoroughly assess and fact check AI-generated resources before implementing them in educational settings.
USE OF LLMS IN SURGICAL EXAMINATIONS
LLM technology has potential applications in surgical curriculums and can be utilized by educators to develop exam questions, evaluate the clarity of questions, and mark examinations[24,28]. These roles can reduce the burden of work on educators, allowing more time to provide feedback and practical supervision to trainees. Sevgi et al. elucidated the ability of ChatGPT to design sample questions suitable for the level of a neurosurgery board exam, along with answers and relevant explanations[28]. While the LLM developed two appropriate sample questions, a third question was deemed unsuitable as it included two correct answers, underscoring the need to review AI-generated content before implementation in surgical examinations.
Several studies explored the accuracy of LLMs in answering questions. In one study, ChatGPT-4 achieved an 83.0% accuracy level on questions from a bariatric surgery textbook, with the highest success on definition and evaluation questions[33]. Another study assessing performance on the Korean Surgical Society and the Korean Academy of Medical Science (KAMS) board certification exam, found that GPT-4 achieved a consistently high level of accuracy across several subspecialties, with an overall rate of 76.4%[34]. Along with answers, ChatGPT provided justifications for each question, although some justifications were factually incorrect despite appearing logically sound. While LLMs are not yet suitable for marking surgical examinations in their current state, they can still be utilized to assess and improve the clarity of questions to optimize the writing of surgical exams.
Beyond their applications in developing and evaluating exam questions, LLMs hold potential as tools for exam preparation. An RCT by Wu et al. leveraged ChatGPT as an interactive exam preparation tool for hepatobiliary surgery[35]. Traditionally, interns received handouts, textbook readings, lectures, and clinical skills teaching. In the experimental group, these materials were supplemented with ChatGPT. Instead of passive reading, ChatGPT offered an interactive platform for developing questions, participating in simulated dialogues, summarizing literature reviews, and clarifying surgical steps for various hepatobiliary procedures. A subsequent theoretical exam and clinical skills assessment showed that interns in the experimental group performed significantly higher than those with traditional teaching, highlighting the advantages of interactive learning for surgical knowledge.
In contrast, a crossover study by Araji and Brooks provided a different perspective on the efficacy of ChatGPT for surgical exam preparation[31]. Participants completed two standardized assessments on general surgery topics, first using either a Google search or ChatGPT, and then switching to the other resource for the second assessment. Interestingly, no difference in scores was observed between the two resources. A post-assessment survey revealed that only 26% of the 19 medical students were likely to use ChatGPT in their surgical rotations, citing issues such as fabricated references, lack of images and diagrams, and inaccurate information. Conversely, a Google search allowed for the comparison of multiple resources and the screening of reliable sources. While preparation with ChatGPT yielded results comparable to Google, factors such as accuracy and the lack of graphics, such as concept maps, should be considered.
FEEDBACK GENERATION IN SIMULATED AND CLINICAL SCENARIOS
Acquiring feedback from experienced surgeons is not always possible, and as such, alternative methods for receiving feedback have been explored, with AI offering promising results. One such system is the Virtual Operative Assistant (VOA), an “AI tutoring system”[36]. The VOA adopts a supervised ML algorithm that classifies learner performance based on pre-defined metrics representative of surgical performance. The VOA then integrates with NeuroVR (NeuroVR, Netherlands), a tumor resection virtual simulator that provides a realistic visual and tactile experience while simultaneously recording user metrics such as tool positioning, forces applied, and acceleration when manipulating simulated instruments[36]. Its generative AI component lies in its ability to create detailed audiovisual feedback after evaluating user metrics, outlining the user’s performance as a percentage score, generating a graph comparing user performance to that of an expert, and outputting a written statement on actionable steps to improve. Feedback is further enhanced by the delivery of a 60-second video showcasing an expert demonstration[36].
In a randomized clinical trial by Fazlollahi et al., 70 medical students performed multiple simulated subpial resections on the NeuroVR[17]. Those who received feedback from the VOA in between sessions demonstrated significantly higher performance scores assessed by a deep learning algorithm, compared to those who received traditional instructor feedback or no feedback at all[17]. A subsequent retrospective cohort study by Fazlollahi et al. following up on the RCT further emphasized the utility of AI-generated feedback[37]. Participants in the VOA group displayed significant improvements from baseline across 32 metrics by the conclusion of their fifth simulated tumor resection compared to controls who received no feedback between attempts. The most pertinent metrics included a reduced rate of healthy tissue removal and improved instrument control, as evidenced by a reduced divergence of instruments that matched expert benchmarks[37]. However, these improvements also led to inadvertent effects of a significant decrease in dominant hand velocity and acceleration in addition to the rate of tumor removal, highlighting the need for a balanced approach that integrates AI-generated feedback with human guidance to minimize unwanted consequences.
A similar generative AI feedback system was explored by Ma et al. in the context of assessing needle handling and needle-driving skills while performing a simulated vesicourethral anastomosis on a da Vinci surgical robot[38]. Rather than obtaining live user metrics as with the VOA, a video of the simulated session was instead recorded and processed by an AI algorithm. Feedback was then delivered via an interface displaying selected video clips from the user side by side with an expert reference video, with a textual teaching point statement appearing below, e.g., “use a smooth, continuous motion”. Significant improvements in needle handling skills were observed from users compared to controls, although improvements in needle driving skills failed to reach statistical significance, possibly because needle driving inherently requires more practice[38]. Though still a prototype, this AI-feedback system could perhaps shorten learning curves and provide further opportunities to practice surgical skills outside the operating room. All generative AI applications in surgical education can be seen in Figure 2.
The findings from Yang and Shulruf also corroborate those of Fazlollahi et al. and Ma et al.[17,37-39]. Medical interns were tasked with practicing suturing and ligature skills and assigned to the WKS-2RII system that utilizes an AI algorithm to analyze data collected from embedded sensors within a simulated silicon skin suturing pad and a webcam. Parameters such as the forces applied to the tissue, tension, distance between sutures, and wound dehiscence were assessed, allowing real-time live feedback in the form of visual data, images, and reference parameters to be generated. Students who undertook feedback from the WKS-2RII demonstrated higher performance at their surgical Objective Structured Clinical Examination (OSCE) than those led by conventional tutoring, with higher self-reported confidence in suturing and ligature skills[39].
Although a less sophisticated method with limited assessable metrics, feedback can be formulated by providing LLMs with postoperative details and outcomes of a procedure. A study by Jarry Trujillo et al. evaluated ChatGPT’s ability to identify errors and provide feedback using this approach[40]. Surgical residents assessed the usefulness and quality of ChatGPT’s responses in identifying and explaining errors in laparoscopic cholecystectomy scenarios. ChatGPT correctly identified the errors, with residents finding the AI responses useful 96.43% of the time and comparable in quality to those of experienced surgeons. However, it is essential to note that laparoscopic cholecystectomy is a highly standardized procedure with abundant literature available for the LLM to access. Furthermore, prompts were carefully crafted through multiple rounds of experimentation in a process known as “prompt engineering”[27]. Unlike systems such as the VOA, procedural details had to be manually translated into narrative text, a time-consuming process that may introduce bias[40]. Since the effectiveness of LLMs relies heavily on prompts, surgical trainees could benefit from guidelines on using tools such as ChatGPT effectively.
PERFORMANCE COMPARISON OF LLMS
In recent years, several LLMs have gained prominence, from OpenAI’s ChatGPT to Microsoft’s Bing. ChatGPT offers two available models: ChatGPT-3.5 and ChatGPT-4. While ChatGPT-3.5, released in 2022, is publicly accessible and available at no cost, ChatGPT-4, released the following year, is available at a monthly subscription cost and boasts improvements in functionality, memory, and performance[41]. On the KAMS board certification exam, ChatGPT-3.5 managed an accuracy of 46.8%, while ChatGPT-4 scored 76.4%[34]. The significant improvement in accuracy from ChatGPT-3.5 to ChatGPT-4 showcases the rapid advancement of generative AI and its potential for even greater performance in the future.
Interestingly, significant performance discrepancies were noted when comparing ChatGPT-4 with Google’s Bard (Google, California) and Microsoft’s Bing (Microsoft, Washington). In a study by Lee et al., ChatGPT-4 demonstrated improvements over its predecessor and other models, achieving an accuracy of 83% on textbook questions related to bariatric surgery, compared to Bard’s 76% and Bing’s 65%[33,34]. These results establish ChatGPT-4 as the most accurate and reliable LLM currently available for surgical education. Guthrie et al. evaluated the efficacy of their specialty-specific LLM, the Operating and Anaesthetic Reference Assistant (OARA, Texas USA), which was trained using a comprehensive dataset of peer-reviewed articles on surgery and anesthetics[42]. In their study, experts rated responses from OARA as 65.3% accurate, 14.7% inaccurate, and 20.0% precise partially across 150 prompts in the surgery and anesthesia domains. These findings demonstrate the capabilities of specialty-specific LLMs, which can be regularly updated with current medical research and guidelines to enhance their accuracy and relevance.
FUTURE DIRECTION
Future research could benefit from incorporating larger sample sizes to enhance the validity and generalisability of the findings. Exploring other AI models is also crucial for developing a more comprehensive understanding of their capabilities, particularly since different models may be better suited for specific surgical education applications based on their training data. While most studies have focused on ChatGPT, it is important to consider other advanced models such as Llama 3.1 (Meta, California, USA) and Claude 3.5 Sonnet (Anthropic, San Francisco, USA)[43]. Current research has also primarily focused on a limited number of surgical subspecialties, emphasizing the generation of text for learning resources and feedback. Broadening the integration of LLMs across a wider range of specialties could identify where these models are most effective and reveal other applications in surgical education.
Additionally, there is no standardized evaluation framework or tool specific for AI-generated outputs in surgical education. While organizations such as the National Institute of Standards and Technology are developing metrics and methodologies for assessing AI technologies, such as accuracy and robustness, individual studies often rely on custom criteria and tools to evaluate content quality[44]. Given the increasing role of AI in surgical education, establishing a standardized protocol for assessing the validity and accuracy of AI-generated outputs should be a priority for future research.
While the potential benefits of integrating generative AI into surgical education are promising, it is crucial to consider the associated risks, particularly the potential negative consequences of over-reliance on AI systems [44]. One significant concern is the possibility of diminished clinical judgment and decision-making skills among trainees. As generative AI becomes more advanced and accessible, there is a risk that surgical trainees may begin to rely excessively on AI-generated guidance and feedback, potentially leading to a decline in the development of critical thinking and problem-solving skills that are essential in the operating room[43]. The nuances of surgical decision-making often require an understanding of context, patient-specific factors, and the ability to adapt to unexpected challenges - skills that may not be fully nurtured if AI tools are overly dependent upon.
Furthermore, over-reliance on AI could result in the erosion of traditional mentorship and the apprenticeship model of surgical training, which has long been the cornerstone of surgical education. The interpersonal exchange between trainee and mentor, where experiential knowledge and tacit understanding are passed down, is irreplaceable by AI. There is also the concern that the use of AI might lead to the standardization of training experiences, where trainees are exposed to a narrower set of scenarios generated by AI, rather than the broad spectrum of real-world cases that can only be experienced through hands-on practice and observation. Additionally, the “black box” nature of many AI models raises transparency and trust issues[44]. Suppose trainees cannot fully understand the underlying mechanisms of AI decision-making. In that case, they may struggle to critically evaluate AI-generated recommendations, potentially leading to the acceptance of incorrect or suboptimal guidance. This lack of transparency could also foster a false sense of security, where the authority of AI is trusted implicitly without the necessary scrutiny, thereby increasing the risk of medical errors.
Finally, ethical and legal implications must be considered, particularly in the context of accountability. In cases where AI-generated recommendations lead to adverse outcomes, the delineation of responsibility between the AI system, the trainee, and the supervising surgeon becomes blurred. This ambiguity could complicate legal proceedings and raise concerns about the appropriate level of human oversight required when integrating AI into surgical practice. Given these risks, it is imperative that the integration of AI in surgical training is approached with caution. There must be a deliberate effort to balance AI-assisted learning and traditional training methods, ensuring that AI serves as an adjunct to, rather than a replacement for, the essential components of surgical education. Ongoing research and the development of comprehensive guidelines will be crucial in mitigating these risks and ensuring that AI enhances, rather than detracts from, the quality of surgical training.
CONCLUSION
Generative AI tools offer the potential to generate tailored interactive learning resources and exam preparation material, along with feedback post virtual simulations. However, with the technical challenges of AI models, further development of the technology may be required before more widespread adoption. In its current state, the integration of generative AI should be approached with caution and balanced with traditional supervision from experienced surgeons, utilizing a blended learning approach. Furthermore, their application should be focused on clearly defined and well-documented topics, guided by high-quality prompts to ensure accuracy and relevance. Nonetheless, ongoing research is still necessary to determine the feasibility of generative AI use across surgical subspecialties and explore other potential uses.
DECLARATIONS
Authors’ contributions
Methodology, literature search, data extraction, manuscript writing and editing: Yang E, Rao L, Dissanayake S
Manuscript writing and editing, and supervision: Seth I, Cuomo R, Rozen WM
All authors have made substantial contributions to the study and agree with the final version of the manuscript.
Availability of data and materials
All materials utilized in this review are accessible through PubMed, Scopus, Clarivate Web of Sciences, Scopus, and Cochrane Library.
Financial support and sponsorship
None.
Conflicts of interest
Rozen WM and Cuomo R are on the Editorial Board of the Plastic Aesthetic Research Journal, while the other authors declare that there are no conflicts of interest. Rozen WM and Cuomo R are guest editors for Artificial Intelligence in Plastic Surgery. Ishith Seth is the Guest Editor Assistant for Artificial Intelligence in Plastic Surgery.
Ethical approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Copyright
© The Author(s) 2024.
Comments
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.