Article Text

Achieving high inter-rater reliability in establishing data labels: a retrospective chart review study
  1. Guosong Wu1,
  2. Cathy Eastwood1,
  3. Natalie Sapiro1,
  4. Cheligeer Cheligeer2,
  5. Danielle A Southern1,
  6. Hude Quan1,
  7. Yuan Xu1,3,4
  1. 1Centre for Health Informatics, Department of Community Health Sciences, University of Calgary, Calgary, Alberta, Canada
  2. 2Alberta Health Services, Calgary, Alberta, Canada
  3. 3Department of Surgery, University of Calgary, Calgary, Alberta, Canada
  4. 4Department of Oncology, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
  1. Correspondence to Dr Yuan Xu; yuxu{at}


Background In medical research, the effectiveness of machine learning algorithms depends heavily on the accuracy of labeled data. This study aimed to assess inter-rater reliability (IRR) in a retrospective electronic medical chart review to create high quality labeled data on comorbidities and adverse events (AEs).

Methods Six registered nurses with diverse clinical backgrounds reviewed patient charts, extracted data on 20 predefined comorbidities and 18 AEs. All reviewers underwent four iterative rounds of training aimed to enhance accuracy and foster consensus. Periodic monitoring was conducted at the beginning, middle, and end of the testing phase to ensure data quality. Weighted Kappa coefficients were calculated with their associated 95% confidence intervals (CIs).

Results Seventy patient charts were reviewed. The overall agreement, measured by Conger's Kappa, was 0.80 (95% CI: 0.78-0.82). IRR scores remained consistently high (ranging from 0.70 to 0.87) throughout each phase.

Conclusion Our study suggests the detailed manual for chart review and structured training regimen resulted in a consistently high level of agreement among our reviewers during the chart review process. This establishes a robust foundation for generating high-quality labeled data, thereby enhancing the potential for developing accurate machine learning algorithms.

  • Adverse events, epidemiology and detection
  • Patient safety
  • Chart review methodologies

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


Machine learning rapidly transforms medical and health services research by extracting valuable insights from extensive datasets through computational algorithms. The accuracy and effectiveness of supervised algorithms heavily depend on the quality and representativeness of labelled data.1 A widely acknowledged approach for generating labelled data is through chart review. Inter-rater reliability (IRR), which measures agreement among reviewers, becomes a key factor in ensuring the quality of data extraction.2

This study aimed to examine IRR during a large retrospective chart review designed to create labelled data on comorbidities and adverse events (AEs). This high-quality labelled data will be used to improve the performance of AE case identification and classification by developing machine learning algorithms.


Six registered nurses with a minimum of three years of clinical practice, specialising in surgery, general internal medicine, intensive care, oncology or cardiology, were recruited and underwent rigorous training to independently review inpatient electronic hospital charts, including cover pages, discharge summaries, trauma and resuscitation records, admission, consultation, diagnostic, surgery, pathology and anaesthesia reports, and multidisciplinary daily progress notes. They extracted data related to 20 predefined comorbidities and 18 AEs.3 The definitions of these comorbidities and AEs were explicitly outlined in the Manual for Chart Review (online supplemental appendix), guiding the review process. The chart review data collection was performed using REDCap, a secure web-based software platform.4

Supplemental material

All reviewers underwent comprehensive training aimed to enhance accuracy and foster consensus. This training method has been employed and validated in a previous study.5 The foundation phase involved an initial review of the same 16 charts, focusing on encouraging discussion and resolving discrepancies through consensus. Results, including data extraction accuracy and agreement levels, were discussed during an in-person meeting after this review period. Subsequently, the training phase encompassed four iterative rounds, each involving the review of 10 charts, aimed at enhancing consistency and refining reviewer skills in data extraction. On achieving an overall excellent agreement level (k>0.8), the testing phase (official data collection, n=11 000) commenced. Periodic monitoring followed with IRR involving 30 more charts at the review’s beginning (n=3k), middle (n=5k) and end (n=9k) of the primary study to ensure consistent data extraction quality. The weighted kappa coefficient (Conger, Brennan and Prediger, Gwet and Fleiss) and its associated 95% CI were calculated for comorbidities and AEs using STATA V.16.2 6


A total of 70 patient charts were selected and reviewed by each reviewer independently over the training and testing phases. The overall agreement measured by Conger’s kappa was 0.80 (95% CI 0.78 to 0.82). Other agreement metrics also demonstrated consistently high levels of agreement, including per cent agreement at 0.94 (95% CI 0.94 to 0.95), Brennan and Prediger’s kappa at 0.88 (95% CI 0.87 to 0.90), Gwet’s kappa at 0.92 (95% CI 0.91 to 0.93) and Fleiss’ kappa at 0.80 (95% CI 0.78 to 0.82).

IRR scores were consistently high (ranging from 0.70 to 0.87) at each phase (figure 1). Rater agreement (table 1) on comorbidities was robust (k=0.84, 95% CI 0.81 to 0.86), with higher agreement levels observed for diabetes (k=0.94, 95% CI 0.89 to 0.99) and psychosis at (k=0.92, 95% CI 0.74 to 1.0). IRR was slightly lower for peptic ulcer disease (k=0.65, 95% CI 0.51 to 0.80) and renal disease (k=0.63, 95% CI 0.53 to 0.74). Rater agreement on AEs was 0.77 (95% CI 0.74 to 0.80). Notably, agreement on pressure injury (k=0.90, 95% CI 0.81 to 0.98), fall (k=0.87, 95% CI 0.79 to 0.96) and thromboembolic event (k=0.86, 95% CI 0.77 to 0.94) was robust. Reviewers encountered challenges achieving consensus on adverse drug events (k=0.44, 95% CI 0.33 to 0.55) and retained surgical item (0.40, 95% CI 0.39 to 0.40).

Table 1

Overall inter-rater reliability among six reviewers for comorbidities and adverse events

Figure 1

Inter-rater reliability changes across training and testing phases. Inter-rater reliability was measured by Conger’s kappa and its 95% CI.


In our study, the detailed Manual for Chart Review and structured training regimen resulted in a consistently high level of agreement among our reviewers during the chart review process. The increase in kappa agreement during the testing phase may be attributed to reviewers’ increased exposure to a larger number of charts, leading to greater familiarity with key definitions and fostering a common understanding of content knowledge. While excellent agreement on most items implies meticulous and well-documented patient charts, instances of disagreement among reviewers indicate potential incompleteness in chart information, highlighting inherent imperfections in chart documentation. Additionally, subjective judgements from reviewers during the review process may lead to variations in assessments and contribute to disparities. However, given the iterative training and detailed guidance from the Manual for Chart Review, the impact should be minimal. Overall, this high level of agreement establishes a robust foundation for generating high-quality labelled data, enhancing the potential for developing accurate machine learning algorithms.

Ethics statements

Patient consent for publication

Ethics approval

Research ethics was approved by the Conjoint Health Research Ethics Board at the University of Calgary (REB21-0416).


We express our gratitude to Hannah Qi, Noopur Swadas, Chris King, Olga Grosu and Jennifer Crotts for their dedicated efforts in conducting the chart review.


Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • X @GuosongWu

  • Contributors YX, HQ, CE, DAS and GW conceived and designed the study. NS led the chart review team, and GW conducted the linkage and data analysis. GW created the tables and figures and drafted the manuscript. All authors contributed to data interpretation and manuscript revision.

  • Funding Canadian Institutes of Health Research (CIHR, funding number: DC0190GP, funder website: The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

  • Competing interests GW is supported by the Canadian Institutes of Health Research (CIHR) Fellowship, the O’Brien Institute for Public Health Postdoctoral Scholarship and Cumming School of Medicine Postdoctoral Scholarship at the University of Calgary. CE and YX are supported by the CIHR grant (grant number: DC0190GP). The other authors have no conflicts of interest to report.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.