Article Text
Abstract
Background Manual chart review using validated assessment tools is a standardised methodology for detecting diagnostic errors. However, this requires considerable human resources and time. ChatGPT, a recently developed artificial intelligence chatbot based on a large language model, can effectively classify text based on suitable prompts. Therefore, ChatGPT can assist manual chart reviews in detecting diagnostic errors.
Objective This study aimed to clarify whether ChatGPT could correctly detect diagnostic errors and possible factors contributing to them based on case presentations.
Methods We analysed 545 published case reports that included diagnostic errors. We imputed the texts of case presentations and the final diagnoses with some original prompts into ChatGPT (GPT-4) to generate responses, including the judgement of diagnostic errors and contributing factors of diagnostic errors. Factors contributing to diagnostic errors were coded according to the following three taxonomies: Diagnosis Error Evaluation and Research (DEER), Reliable Diagnosis Challenges (RDC) and Generic Diagnostic Pitfalls (GDP). The responses on the contributing factors from ChatGPT were compared with those from physicians.
Results ChatGPT correctly detected diagnostic errors in 519/545 cases (95%) and coded statistically larger numbers of factors contributing to diagnostic errors per case than physicians: DEER (median 5 vs 1, p<0.001), RDC (median 4 vs 2, p<0.001) and GDP (median 4 vs 1, p<0.001). The most important contributing factors of diagnostic errors coded by ChatGPT were ‘failure/delay in considering the diagnosis’ (315, 57.8%) in DEER, ‘atypical presentation’ (365, 67.0%) in RDC, and ‘atypical presentation’ (264, 48.4%) in GDP.
Conclusion ChatGPT accurately detects diagnostic errors from case presentations. ChatGPT may be more sensitive than manual reviewing in detecting factors contributing to diagnostic errors, especially for ‘atypical presentation’.
- Chart review methodologies
- Diagnostic errors
- Artificial Intelligence
Data availability statement
Data are available upon reasonable request. The data sets used in the current study will be made available from the corresponding author upon request.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Manual chart review using a standardised assessment tool is a reliable method for judging the presence or absence of diagnostic errors. However, manual chart reviews require significant human resources, which may limit studies of diagnostic excellence.
WHAT THIS STUDY ADDS
This study investigated the performance potential of ChatGPT in detecting diagnostic errors and the factors contributing to diagnostic errors by reviewing case presentation texts from case reports of diagnostic errors. ChatGPT correctly detected diagnostic errors in most cases. ChatGPT can also detect a larger number of contributing factors to diagnostic errors than physicians.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE, OR POLICY
This study suggests that ChatGPT may reduce the efforts and costs of manual chart reviews to judge diagnostic errors, enabling diagnostic excellence and diagnostic safety in healthcare.
Introduction
Recent advances in artificial intelligence (AI) technology have accelerated its implementation in diagnostic processes in clinical practice and research on diagnostic processes. Since its introduction for public use in November 2022, several studies have been conducted to evaluate the performance of ChatGPT in clinical diagnosis. ChatGPT is an AI chatbot developed based on large language models (generative pre-trained Transformer 3.5 and 4). ChatGPT produces accurate and detailed text-based responses to written prompts.
Positive data for ChatGPT have been reported in the field of clinical diagnosis. Previous studies have shown that ChatGPT can answer questions testing medical knowledge correctly at a pass level for national medical license examinations.1 2 Some authors have suggested that ChatGPT can be used to support clinical decisions.3 Some studies have shown that ChatGPT exhibits high performance in developing the final diagnosis as the top differential diagnosis in common and complex cases.4–8 In addition, a previous study suggested that the accuracy of ChatGPT’s differential diagnosis increased as more clinical context was provided and that its accuracy was not associated with the patient’s age, gender and case acuity.4 Therefore, although the diagnostic performance of ChatGPT in cases with higher risks of diagnostic errors, including uncommon diseases or atypical presentations,9–13 is still unknown,4 7 ChatGPT may have the ability to assess diagnostic processes based on the high accuracy in clinical diagnosis.
ChatGPT may be useful for research about diagnostic errors and diagnostic excellence. ChatGPT can be used for the classification of clinical texts by specifying definitions or rules for classification. Some studies have attempted to use ChatGPT for specific classifications based on clinical case descriptions and found that although the quality and internal reliability of classification by ChatGPT were not yet optimal,14–16 it outperformed humans in terms of classification speed.15 Reviewing the clinical charts and records of patients using standardised assessment tools is one of the most commonly used methods in research and quality improvement actions regarding diagnostic errors; however, this practice requires considerable human resources, effort and time. This problem may be resolved if ChatGPT assists humans in reviewing clinical charts and records. Therefore, a pilot study is needed to evaluate the potential of ChatGPT in assessing the diagnostic process by reviewing texts describing cases. However, as entering patient information into ChatGPT is not acceptable because of concerns about personal information security, case reports are suitable data resources for such studies because they provide concise case presentations, include confirmed diagnoses with a high level of certainty and have few concerns regarding personal information security problems.
We previously conducted a systematic review of case reports containing diagnostic errors17; we collected data about the final diagnosis, commonality of diseases, typicality of presentation and contributing factors of diagnostic errors in each case. Using the database, we conducted this study to evaluate the performance of ChatGPT to assess the diagnostic process by reviewing case descriptions.
Methods
Study design
This study used ChatGPT and case reports that contained diagnostic errors.
Target case reports and data used
The precise selection of case reports is described in our previous study.17 In brief, we searched PubMed using search terms related to diagnostic errors, namely ‘diagnostic errors’, ‘delayed diagnosis’, ‘misdiagnosis’ and ‘wrong diagnosis’. We retrieved case reports with diagnostic errors that described only one patient and were published until 31 December 2021 from eight countries: Australia, Canada, Germany, Italy, Japan, the Netherlands, the UK and the USA. A total of 563 case reports of diagnostic errors were obtained after two-stage screening as follows. In the first step, two reviewers screened the case reports by reading the titles and abstracts, and in the second step, 2 of 11 reviewers screened the case reports by reading the full texts. We excluded 18 case reports written in languages other than English to avoid the heterogeneity of outputs from ChatGPT due to language differences. In total, 545 case reports were included in this study. We extracted the following data from these case reports from a previous systematic review: the final diagnoses, commonality of the final diagnoses (common or uncommon), typicality of presentation (typical or atypical), most important codes and all codes of Diagnosis Error Evaluation and Research (DEER),18 Reliable Diagnosis Challenges (RDC)19 and Generic Diagnostic Pitfalls (GDP) taxonomies.20 The detailed process for generating these data has been previously described.17
ChatGPT use
We used ChatGPT (GPT-4, 3 August 2023) between 15 and 23 August 2023. The prompts used in this study were developed by referencing those used in a previous study.6 We tested the prototype prompts using five case reports of diagnostic errors not included in this study and finalised them after editing some parts to improve the outputs. We used a total of five prompts in the same chat per case. The first part instructs ChatGPT how to classify cases into four categories based on disease commonality and typicality of presentation: typical presentation of a common disease, atypical presentation of a common disease, typical presentation of an uncommon disease and atypical presentation of an uncommon disease. The criteria for determining common or uncommon diseases and typical or atypical presentations are also included in this part. The second part describes the case. After imputing the prompt, ChatGPT outputs the classification of disease commonality and the typicality of presentation (1=typical presentation of common disease; 2=atypical presentation of common disease; 3=typical presentation of uncommon disease; and 4=atypical presentation of uncommon disease). The third part asks ChatGPT to judge whether diagnostic errors occurred in the presented case (1=diagnostic errors occurred; 0=no diagnostic errors occurred) based on the definition of diagnostic errors as ‘the failure to (a) establish an accurate and timely explanation of the patient’s health problem(s) or (b) communicate that explanation to the patient’. The fourth part asks ChatGPT to output all relevant codes of the DEER, RDC and GDP taxonomies, as well as the most important codes of these taxonomies specific to the presented case by displaying the respective taxonomy lists. The fifth part asked ChatGPT to summarise the code output in the seventh part. Examples of prompts and responses by ChatGPT are provided in the online supplemental file 1.
Supplemental material
Outcomes
We assessed the rate of diagnostic errors determined by ChatGPT as the primary outcome. As the secondary outcomes, we assessed the distribution of the breakdowns of the DEER, RDC and GDP taxonomies coded by ChatGPT; and the rates of common diagnosis and typical presentation judged by ChatGPT, as well as the distribution of the classification (typical presentation of common disease, atypical presentation of common disease, typical presentation of uncommon disease and atypical presentation of uncommon disease).
Statistical analysis
Continuous and ordinal data were presented as medians with IQRs and compared using the Mann-Whitney U test. Categorical data are presented as percentages and were compared using the χ2 test. We calculated the inter-rater agreement between ChatGPT and humans using Cohen’s kappa statistics for all outcomes. A p value<0.05 was considered significant. All statistical analyses were conducted using R V.4.1.0 (R Foundation for Statistical Computing, Vienna, Austria).
Patient and public involvement
It was not appropriate to involve patients or the public in the design, or conduct, or reporting, or dissemination plans of our research.
Results
ChatGPT’s assessment of the commonality of the final diagnosis and typicality of presentation
ChatGPT assessed that the final diagnosis was common in 120/544 (22.1%) and the presentation was typical in 251/544 (46.1%) of the cases. Overall, ChatGPT classified six cases (1.1%) into the typical presentation of common disease, 114 (21.0%) into the atypical presentation of common disease, 245 (45.0%) into the typical presentation of uncommon disease and 179 (32.9%) into the atypical presentation of uncommon disease group. The inter-rater agreement of human and ChatGPT was 0.46 (0.36–0.56) for the commonality of the final diagnosis, 0.08 (0.00–0.16) for the typicality of presentation and 0.13 (0.07–0.20) for the classification by commonality and typicality.
ChatGPT’s assessment of diagnostic errors and factors contributing to diagnostic errors
ChatGPT detected that diagnostic errors occurred in 519/545 cases (95.0%). The number of factors contributing to diagnostic errors per case coded by ChatGPT (median 5; IQR 6 and 16) was statistically higher than those coded by humans (median 1; IQR 3 and 11) for DEER (p<0.001), RDC (median 4; IQR 6 and 18 by ChatGPT and median 2; IQR 4 and 9 by humans; p<0.001) and GDP (median 4; IQR 5 and 10 by ChatGPT and median 1; IQR 2 and 5 by humans; p<0.001). The most common breakdowns for DEER, RDC and GDP coded by ChatGPT were ‘failure/delay in considering the diagnosis’ (510, 93.6%), ‘atypical presentation’ (513, 94.1%) and ‘atypical presentation’ (539, 98.9%), which were same as the most common breakdowns coded by human (tables 1–3).
The most important DEER, RDC and GDP breakdowns coded by ChatGPT were ‘failure/delay in considering the diagnosis’ (315, 57.8%), ‘atypical presentation’ (365, 67.0%) and ‘atypical presentation’ (264, 48.4%), while by humans were ‘failure/delay in considering the diagnosis’ (234, 42.9%), ‘findings masking/mimicking another diagnosis’ (108, 19.8%) and ‘limitations of a test or exam finding not appreciated’ (160, 29.4%). The inter-rater agreements for the most important breakdown between ChatGPT and humans were 0.15 (0.09–0.21) for DEER, 0.04 (0.01–0.07) for RDC and 0.12 (0.07– 0.17) for GDP.
Discussion
We found that first, ChatGPT (GPT-4) correctly detected diagnostic errors in 95% of case reports by reading only the case description. Second, there was a large discrepancy between ChatGPT and physicians in assessing the commonality of final diagnosis and typicality of presentation. Third, ChatGPT raised more contributing factors to diagnostic errors using DEER, RDC and GDP taxonomies than physicians, and compared with physicians, weighed more on atypical presentation as the most important contributing factor for diagnostic errors.
This study indicates that ChatGPT can support research on diagnostic errors using published case reports. ChatGPT detected the presence of diagnostic errors in 95% of the case reports by reading only case descriptions, suggesting that ChatGPT its high sensitivity for detecting diagnostic errors. Owing to this, ChatGPT can be a useful tool for collecting case reports that include diagnostic errors, facilitating their further study. Furthermore, ChatGPT can also be used to screen diagnostic errors in clinical practice by imputing a written summary of care for a patient, which can facilitate an effective feedback system for improving the diagnostic process in each institution through the timely detection of possible cases of diagnostic errors. The construction of effective feedback systems for the diagnostic process is recommended to improve clinical diagnosis.21–24 Detection of possible cases of diagnostic errors with less effort is a fundamental requirement for the development of these feedback systems.22 25 Even with the current medical record systems, the use of trigger events or some calculated scores to detect a high-risk population for diagnostic errors (eg, a clinical visit followed several days later by an unplanned hospitalisation or subsequent visit to the emergency department, patients with discrepancies in diagnosis between admission and discharge) has been proposed to screen for possible cases of diagnostic errors effectively.25–28 However, these triggers may miss some patients with diagnostic errors (low sensitivity).25 Moreover, manual reviewing of all cases is time-consuming and impractical. Therefore, ChatGPT can be used to screen for possible cases of diagnostic errors by entering only a summary of cases to aid the detection of diagnostic errors in daily clinical practice. Therefore, this study proposes a new method to implement AI to improve diagnosis.
The use of ChatGPT for research or quality improvement for diagnostic safety has some issues. Accurate assessment of the diagnostic process and detection of the cause of diagnostic errors are vital in research and quality improvement actions for diagnostic safety. In this study, there were large discrepancies between ChatGPT and physicians in assessing the commonality of disease, typicality of presentation and taxonomies of contributing factors to diagnostic errors, such as DEER, RDC and GDP. In particular, ChatGPT tended to judge more cases as atypical presentations and code more contributing factors per case than physicians. These results indicate that ChatGPT may consider normal variances as atypical and non-significant variances of the diagnostic process as contributing factors to diagnostic errors. A previous study assessing the performance of fracture classification using ChatGPT also showed that although ChatGPT classified significantly faster than humans, its classification performance was inferior.15 In another study, ChatGPT evaluated patient neuro-examination descriptions using well-established neurological assessment scales; however, its accuracy was reduced when confronted with incomplete or vague descriptions.16 In addition, using GPT3.5 model, a previous study indicated that the performance of ChatGPT in processing medical text classification tasks with few samples may still be far from optimal.14 Therefore, well-formatted case descriptions and fine-tuning of prompts with sufficient samples are needed to enhance the ability of ChatGPT to assess the details of the diagnostic process and the cause of diagnostic errors. Nevertheless, the high sensitivity of ChatGPT to detect atypical presentations and the contributing factors of diagnostic errors make it suitable for screening before manual assessment.
This study’s results about the high sensitivity of ChatGPT to detect diagnostic errors should be interpreted with caution in several aspects. First, sensitivity alone does not assure the performance of a tool to determine the presence or absence of a target outcome. Because this study used case reports exclusive to diagnostic errors, there was a selection bias in the used cases and we could not also assess the specificity of ChatGPT to detect diagnostic errors. One might assume that the high sensitivity of ChatGPT in this study may reflect on the very low specificity (ie, ChatGPT may judge almost all cases as diagnostic errors). Therefore, future studies are needed to assess both the sensitivity and specificity of ChatGPT to detect diagnostic errors. However, case reports do not seem suitable resources for such kinds of studies. Case reports focused on diagnosis can be published only when they include some teaching points related to the diagnostic process; therefore, finding case reports free from diagnostic errors may be difficult. Second, published case reports usually include mostly relevant information without noise. Noise refers to any irrelevant or misleading data that diminishes the clarity of the clinical signal. Due to the increased noise in clinical charts from the real world, ChatGPT’s proficiency to detect and classify diagnostic errors can decline in the real world. Considering these two issues, in the next steps, ChatGPT’s performance in detecting and classifying diagnostic errors should be evaluated using ‘live’ medical records with or without diagnostic errors, compared with the judgement of human expert reviewers. Nevertheless, integrating ChatGPT into the real world to detect diagnostic errors may have another challenge: a lack of context-specific knowledge. Because contextual information such as available diagnostic resources (eg, human, equipment, time, cost) and patient perspectives are typically gone unrecorded in medical records. Consequently, ChatGPT may rely on the ideal diagnostic process and outcome as the reference standard to judge the presence or absence of diagnostic errors, potentially leading to overly sensitive error detection. Indeed, in this study, items related to fundamentally ‘human’ tasks in the diagnostic process, such as history taking and physical examinations, were more frequently coded as contributing factors to diagnostic errors by ChatGPT than by ‘human’ physician researchers. We assume that this result derives from the fact that human researchers may judge the issues related to history-taking and physical examination case by case by moving the thresholds considering background contexts (eg, settings, time restriction) and clinical relevance to diagnostic errors. In contrast, ChatGPT may judge based only on ‘texts’ in the prompts that are not flexible, and the issues judged as contributing factors to diagnostic errors cannot be clinically relevant in some cases. To address this issue, similar to the approach suggested for human expert reviews of potential diagnostic errors,22 29 developing the practical method to input contextual knowledge into ChatGPT is crucial. This strategy would help mitigate the risk of excessively sensitive error judgements by ChatGPT. Until the method is developed, human experts’ adjustment with contextual knowledge to the judgement by ChatGPT remains necessary. Another solution may be adding more detailed explanations to taxonomies such as DEER, RDC and GDP to tell ChatGPT what types of issues are clinically relevant to diagnostic errors in the real world.
This study had several limitations. First, we used prompts only once for each case; imputing the same prompts another time could have produced different results. However, humans may also produce different outputs when performing the same tasks iteratively. Therefore, this limitation does not reduce the value of this study; rather, it indicates that a double check by ChatGPT or humans is needed to validate ChatGPT responses on diagnostic errors and their contributing factors. Second, ChatGPT assessed the presence or absence of diagnostic errors and their contributing factors based only on case descriptions and definitions of diagnostic errors, commonality of disease, typicality of presentation and the three taxonomies. In contrast, physician researchers judge the presence or absence of diagnostic errors and their contributing factors based on the entire case report, including the abstract, introduction, discussion and conclusions. The difference between ChatGPT and human assessments can be attributed to this. Considering this discrepancy, the ability of ChatGPT to correctly detect 96% of diagnostic errors and identify additional contributing factors for diagnostic errors based only on case descriptions its utility as a screening tool. Third, it is unclear how many case reports included in this study were used for training of ChatGPT. Therefore, ChatGPT could have generated the correct diagnosis for the cases that were part of its pre-training ChatGPT. Fourth, although GPT-4 is a multimodal AI model that is inherently capable of inputting images and tables, this study excluded them. Therefore, in cases where images and tables provided key information in the diagnosis, the quality of ChatGPT could have been low.
Conclusions
ChatGPT can be a useful tool to screen possible cases with diagnostic errors and shortlist factors contributing to diagnostic errors in researching diagnostic errors using case reports. However, owing to the limitation of ChatGPT, such as judging normal variances as abnormal, its responses must be validated by a physician.
Data availability statement
Data are available upon reasonable request. The data sets used in the current study will be made available from the corresponding author upon request.
Ethics statements
Patient consent for publication
Ethics approval
Not applicable.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
X @wataritari1
Contributors YH conceptualised this study. All authors contributed to data collection. YH drafted the manuscript, and all authors contributed to revision. YH acted as the guarantor.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.