Methods
The present study
We used the method developed by Dean and Barber to assess the potential severity of MAEs.5 The method was described in detail in a published protocol.8
In brief, 37 healthcare professionals (doctors, nurses and pharmacists) from Brazil were invited to participate in scoring 50 cases of MAEs gathered from an original UK study5 regarding their potential harm to patients on a scale of 0–10. By looking at 10 cases (out of 50) scored twice, reliability was assessed, and potential sources of variability in scoring were assessed depending on the error case, the occasion when the scores were given, the scorer, their profession and the interactions among these variables. We chose to have just 10 cases assessed twice to avoid excessive workload for the judges and reduce the risk of non-response. This number was also sufficient in the earlier study.5
Generalisability theory was used to analyse the data.
The previous methodology
When creating their method, Dean and Barber chose 50 medication error cases from the literature in nearly equal numbers showing minor, moderate and severe potential clinical outcomes; in 16 of these cases, the patient outcome was already known. These cases were then sent to 30 different healthcare professionals (10 physicians, 10 nurses and 10 pharmacists). These judges were asked to score the potential clinical significance on a Visual Analogue Scale ranging from 0 to 10 (with 0 corresponding to ‘no harm’ and 10 corresponding to ‘patient death’). Specifically, this error severity classification involves: (1) Minor—very unlikely that the patient will develop any adverse event; (2) Moderate—likely to cause an adverse event in the patient or interfere with the therapeutic goal, but very unlikely to cause death or harm lasting more than a week and (3) Serious—an error that could lead to permanent harm or death to the patient. A subset of 10 cases was evaluated on a second occasion by all judges. The data were analysed using generalisability theory.
Judge recruitment
Judges were recruited from eight large general hospitals. Hospitals were chosen to give geographical diversity, since Brazil is a country of continental dimensions with potential for regional differences in practice. Four of the five regions of Brazil were represented.
The heads of service at these hospitals were contacted to identify doctors (internists, generalists and clinical specialists), nurses and pharmacists with more than 3 years of experience willing to evaluate the potential severity of the 50 medication errors.
After the heads of service’s acceptance, 37 professionals were contacted, invited by email and the consent forms and letters describing the scoring process guidelines, the objectives of the study and practical examples of how to perform the scoring were sent to them, regardless of the location. No incentive was offered to professionals to participate in this study.
Scoring process
The invited professionals received a file with the descriptions of 50 cases of EAMs and were instructed to score the cases according to their potential clinical significance, using the scale proposed by Dean and Barber. The scores provided by these professionals were then analysed. Two weeks after receiving the severity ratings based on the 50 cases, each respondent received a random sample of 10 of the 50 cases for rescoring; selected using the RV.BINOM function within SPSS 29.0.
That way, it was possible to measure whether the occasion on which the cases were scored was an important source of variance in the responses obtained.
Raters were instructed to record the time spent evaluating all 50 cases and to make relevant comments about the scoring process in a specific space on the form, in addition to completing a brief questionnaire about demographic details, including their occupation and number of years of professional experience, along with any comments about the scoring process).
Translation and contextualisation of the cases
The 50 cases were translated by the principal investigator into Portuguese, updated (if the drugs were no longer available or not in routine use), and adapted to the Brazilian context (making necessary adjustments regarding drugs, doses, concentrations, units of measurement, pharmaceutical forms and available presentations to ensure that all were medications routinely used in clinical practice in Brazilian hospitals) (online supplemental appendix A).
The translated and adapted versions were double-checked by two experienced hospital pharmacists to ensure that the meaning of each case of MAEs remained unchanged. We used the same cases performed in the UK and Germany5 6 to allow comparison with the current study.
Generalisability theory
Cronbach et al9 developed generalisability theory, a method that systematically allows the effect of multiple sources of variance and their interactions on scores to be measured simultaneously in a single study, based on the premise that in any assessment procedure, variance in scores can be attributed to different identifiable sources.
Generalisability theory also emphasises the estimation of variance components. Once the variance attributed to each source is calculated, the most efficient method for reducing unwanted variations can be determined. The results can be used to identify methods for improving the reliability of a test.10
The application of generalisability theory occurs in stages. First, generalisability analysis begins with the specification of a universe of admissible observations through the identification of different sources of variation. In the second stage, the generalisability study (‘G-study’), estimates the variance components of this universe. This involves creating an appropriate research design, collecting data and determining the extent to which each variable influences scores. Different coefficients of variation are calculated to represent the different situations. For example, a coefficient can be calculated to show the extent to which the score assigned to a case by a physician can be generalised to that assigned to the same case by a pharmacist. The final step is a decision study (‘D-study’) associated with a prespecified universe of generalisability.9 10 Broadly speaking, D-studies emphasise the estimation, use and interpretation of variance components for decision-making, with well-specified measurement procedures.11 Perhaps the most important D-study to consider is the specification of a universe of generalisation, where the universe to which a decision-maker wants to generalise is based on the results of a D-study using a particular measurement procedure.10 From the estimated variance, the effect of a change in the number of observations on the generalisability coefficient can be explored.
Reliability analysis
Universe of observations
We used the same approach to analysis as that used previously.5 The sources of variance in the process of assessing the MAEs errors were those inherent in the cases themselves (‘case’), the occasion on which they are assessed (‘occasion’), the evaluator (‘judge’), the professional background of the judge (‘profession’) and the interactions among these. Since each judge is a member of a single profession, the judge factor was considered to be nested within the profession factor (‘judge: profession’).
Because the scores for the 50 cases of errors were obtained on two occasions for a sample of 10 cases, there are two models for conducting the G-study, depending on the data set used:
Model 1: occasion×case×judge (using the 10 cases scored twice).
Model 2: case×judge: profession (using all 50 cases).
Models 1 and 2 ignored the effect of profession and occasion, respectively. A model that considers all sources of variance for the 10 cases with repeated scores, occasion×case× judge: profession, was not used because the variance per case was anticipated to be too high to perform an analysis of variance (ANOVA).
Generalisability study
The data were evaluated using models 1 and 2 to determine the contribution of each factor to the variance in scores. First, repeated measures of variance analysis was performed and seven sources of variance estimated for model 1: case, occasion, occasion×case, judge, judge×case, occasion×judge and judge×case×occasion. For model 2, the sources of variance were profession, judge ‘nested’ in profession, case, case×profession and residual variance (case×judge: profession). Online supplemental appendix B provides the equations used to calculate the generalisability coefficients.
The resulting mean square values were then used to calculate the attributable variance for each source, using equations for the mean squares based on those described by Streiner and Norman12 and Cronbach et al.9 When the estimated variance components were computed as negatives, a value of 0 was assumed.13 The overall generalisability coefficient, coefficients equivalent to inter-rater reliability and test–retest reliability were computed.
Decision study
The D-study was based on the G-study results and obtained the necessary decision-making information for the reliable use of generalised scoring scales. D-study was also used with model 1 to calculate G coefficients that identified the required number of judges to achieve sufficient reliability of the scale’s usage.
In the D-study, the effects of different modifications in the evaluation procedure on the generalisability coefficient were investigated, and the accuracy of the obtained measurement results evaluated. Therefore, different scenarios based on the results of the G-study were investigated. The same model as in the G-study was used to calculate the generalisability coefficients for different numbers of judges and occupations. This allowed for identifying the number of judges needed to obtain a reliable average score. The D-study also investigated whether judges needed to be from a different or similar profession. Generalisability coefficients for different numbers of judges and test occasions were calculated using the formula described by Streiner and Norman.12 As in previous studies, a generalisability coefficient greater than 0.8 was taken to represent acceptable reliability.5
Validity analysis
Sixteen cases (out of 50) with known harmful outcomes were included to assess the validity of the scoring process. These cases were the same as those described in Dean and Barber’s assessment method: five MAEs reported in the literature that resulted in minor outcomes (no noticeable adverse effects), five with moderate outcomes (some adverse effects but no lasting impairment) and six with severe outcomes (death or lasting impairment). The mean scores assigned by the 30 judges to these 16 cases were examined to assess whether they reflected the severity. Thus, it was possible to test the method’s validity by comparing the scores assigned by the 30 raters to the 16 MAEs with previously established scores.
Test–retest reliability
The test–retest agreement was assessed by performing the Bland-Altman test, as an enhancement to the earlier studies. The Bland-Altman plot helps visualise and interpret the test–retest agreement. By definition, 95% of the differences between repeated measures must be within agreed-upon limits. We used the Bland-Alltman method as an alternative way to test the reliability.
Comparison between the analyses performed in Germany, the UK and Brazil
The mean scores of the studies in Brazil were first compared with those in the UK5 and second with those in the Germany6 studies. A paired sample t-test was performed, with a significance level set at 95% CI.
Analyses
All analyses were conducted using the R programming language, V.4.0.3 and SPSS 29.0