Article Text

Download PDFPDF

Comparison of staff and resident health status ratings in care homes
  1. Tim Benson1,2,
  2. Clive Bowman3
  1. 1R-Outcomes Ltd, Thatcham, Berkshire, UK
  2. 2Institute of Health Informatics, UCL, London, UK
  3. 3Health Sciences, City University School of Health Sciences, London, UK
  1. Correspondence to Dr Tim Benson; tim.benson{at}


Background Many care home residents cannot self-report their own health status. Previous studies have shown differences between staff and resident ratings. In 2012, we collected 10 168 pairs of health status ratings using the howRu health status measure. This paper examines differences between staff and resident ratings.

Method HowRu is a short generic person-reported outcome measure with four items: pain or discomfort (discomfort), feeling low or worried (distress), limited in what you can do (disability) and require help from others (dependence). A summary score (howRu score) is also calculated. Mean scores are shown on a 0–100 scale. High scores are better than low scores. Differences between resident and staff reports (bias) were analysed at the item and summary level by comparing distributions, analysing correlations and a modification of the Bland-Altman method.

Results and conclusions Distributions are similar superficially but differ statistically. Spearman correlations are between 0.55 and 0.67. For items, more than 92.9% of paired responses are within one class; for the howRu summary score, 66% are within one class. Mean differences (resident score minus staff score) on 0–100 scale are pain and discomfort (−1.11), distress (0.67), discomfort (1.56), dependence (3.92) and howRu summary score (1.26). The variation is not the same for different severities. At higher levels of pain and discomfort, staff rated their discomfort and distress as better than residents. On the other hand, staff rated disability and dependence as worse than did residents. This probably reflects differences in perspectives. Red amber green (RAG) thresholds of 10 and 5 points are suggested for monitoring changes in care home mean scores.

  • healthcare quality improvement
  • nursing homes
  • quality measurement
  • surveys
  • patient reported outcome measures

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.


The role of care homes is to provide care and community for a population of people defined by various combinations of mental and physical dependency. Clearly, care home effectiveness should be monitored from a resident perspective whereas typically it is presumed to be good if various processes and regulatory standards are met. Using a simple person-reported outcome measure (PROM) may provide valuable insight but, with typically over 70% of residents having significant cognitive impairment, frailty or being in terminal decline,1 acquiring survey data is challenging.

An alternative is to ask the staff familiar with the residents to rate them as a proxy. Previous studies have shown that paired assessments by staff and by residents and by staff and relatives give varied results.2–6 These studies have been small, with varying levels of dementia and did not examine the differences in detail.

The aim of this paper is to assess how well staff and residents agree about perceptions of health status, based on a large sample of paired assessments by staff proxies and residents.7


Data were collected as part of the 2012 Bupa census, which reported on 24 506 residents in 395 care homes in UK, Australia and New Zealand.

This paper covers 10 168 matched assessments of health status by staff and residents using the howRu health status measure.8 This is a companion to our previous paper, which examined the construct validity of using howRu, rated by staff proxies in care homes, to assess resident health status.9 Full details of the data collection method using optically mark readable forms are provided in that paper.

HowRu is a short generic measure of health-related quality of life or health status. It forms part of a large family of PROMs and person-reported experience measures, completed by patients (or care home residents) and by staff.10HowRu has been validated for use at the individual patient level,11 and for construct validity in ambulatory care in comparison with EQ-5D,12 13 and SF-12.8

Resident assessments were collected at the same time as staff assessments and shared the same bar-code identifier. The resident form is shown in figure 1. It also includes a measure of resident experience (howRwe),14 and a version of the Net Promoter Score,15 but these are not discussed further here.

HowRu asks the question How are you today? referring to the past 24 hours; this is the question answered by residents. The question answered by staff is How is the resident today?. HowRu has four items:

  • Pain or discomfort—physical symptoms.

  • Feel low or worried—distress and emotional symptoms.

  • Limited in what I (he/she) can do—disability, activities of daily living and leisure activities.

  • Require help from others—dependency and self-care.

Each item has four possible responses: Extreme, Quite a lot, A little and None. At the individual level, these are scored from 0 (Extreme) to 3 (None). The summary howRu score is the sum of the item scores, giving a scale with 13 possible values with a range from 0 (4 × Extreme) to 12 (4 × None).

At the aggregate level, used here, all scores are transformed to a scale from 0 to 100. Individual item scores are multiplied by 100 and divided by 3; individual summary scores are multiplied by 100 and divided by 12. Using a common 0–100 scale aids understanding and comparison.

This analysis uses all the returns with complete paired ratings for all four howRu domains.

Responses for each region were collated regionally and forwarded to a central scanning bureau for data entry. Data for both staff and resident forms were entered centrally by scanning the optically marked forms. The data were imported into a database and exported to Excel and the JASP statistical package (version 0.11) for analysis.16

We examined the overall distribution of results for staff and resident ratings. Differences in mean scores were tested using the Wilcoxon signed rank test. Correlations between staff and resident ratings were assessed using the Spearman rank correlation (rs) and Cohen’s kappa coefficient (κ). Kappa is a measure interobserver agreement that takes into account that raters will sometimes agree by chance.17 No adjustments for multiple testing were made. The level of agreement was measured in terms of exact and ±one class.

Bias is the difference between two methods measuring the same things, such as a rating by staff (S) and self-rating by the resident (R). Here, bias is defined as resident score minus staff score (R−S).

Bland and Altman, in a highly cited paper,18 point out that reliance on mean scores and correlation does not mean that two methods agree. For example, mean scores may agree if bias is positive at high values and negative at low values; high correlations may be found when one measure is biassed consistently throughout its range or if bias is directly associated with value. They propose a method that plots bias (the difference between the two methods, (R − S)) against the average of the two methods (R+S)/2.

Our data differ from that envisaged by Bland and Altman. (1) We have a large number of paired measurements, which means that it is not feasible to plot individual points. (2) Our data are categorical ordinal data, with a limited number of categories (not interval or ratio continuous data), so that individual categories contain hundreds or thousands of instances. However, we find that mean scores, and hence mean bias, can be treated as if they are interval with few problems.

We plot the overall mean bias between the two methods (mean (R − S)) on the y-axis, against the actual average scores ((R+S)/2) for each instance on the x-axis. The number of categories on the x-axis is (2m−1), where m is the number of possible categories for each measure. For example, each item has four possible values, so the number of possible average scores is 7. The howRu summary score has 13 possible values (0−12 inclusive), so there are 25 possible average scores. The bias for the floor and ceiling average scores is always zero.

To summarise:

  • Mean bias=mean (R − S).

  • Actual average score = (R+S)/2.

In addition, we also show the number of responses for each actual average score. This distribution is not normal, because the number of paired ratings showing exact agreement for any item is larger than the numbers showing non-agreement.

Ethics statement

We carried out secondary analysis of data collected as part of a routine census of care home residents. The data were anonymous and undertaken to evaluate the current services without randomisation, so ethics approval was not required or sought. No data were collected about identifiable people and there was no risk to individual residents.19

Patient and public involvement

Care home residents and staff collected the data as part of a census to collect management information. All data were anonymous.


The census covered 24 506 residents in 395 care homes across UK, Australia and New Zealand. A total of 19 438 responses were received, of which 18 615 had health status data completed by staff and 10 712 by resident self-report. Ten thousand one hundred sixty eight responses (54.6% of staff ratings) included complete health status data rated by staff and resident self-report. This paper uses this data set.

Table 1 shows the distribution for each paired howRu item of staff proxy (S) and resident self-report (R) ratings. The distributions of discomfort and distress are broadly similar but differ considerably from those of disability and dependence, which are also broadly similar.

Table 1

Overall distribution of staff (S) and resident (R) paired ratings of howRu items (n=10 168)

The distribution of howRu summary scores for staff and resident ratings is shown in table 2 and figure 2. Staff and residents generated the same summary score for 39.1% of residents and gave the same scores on all four items for 32.9%.

Table 2

Distribution of howRu summary scores for staff proxy and resident self-report (n=10 168)

Figure 2

Distribution of howRu summary scores rated by staff proxies and residents.

Table 3 shows the Spearman correlation, kappa and percentage of exact and plus or minus one class agreement between paired staff and resident self-ratings for each howRu item and the summary howRu score.

Table 3

Spearman’s correlation, Cohen’s kappa and levels of agreement between staff and resident ratings for howRu items and summary score (n=10 168)

Spearman correlations for both items and the summary score are between rs=0.54 and rs=0.67, which may be interpreted as moderate or strong. For items, kappa is between κ=0.43 and κ=0.53, which may be interpreted as being moderate. For the summary score, κ=0.31, which may be interpreted as being fair. For items, the percentage of exact agreement is between 59.8% and 68.9% and agreement within one class is between 92.9% and 95.9%. For the summary score, exact agreement is 39.1% and agreement within one class is 66.0%. This is acceptable given 12 df.

However, distributions, correlations and exact agreement do not tell the whole story.

Table 4 shows for each howRu item and the summary score the mean, SD, SE of the mean (SEM) and 95% confidence limits. We show staff proxy ratings (S), resident self-ratings (R), mean bias (R−S) and mean of staff and resident scores ((R+S)/2).

Table 4

Mean scores, SD, SE of the mean (SEM) and confidence limits (CL) for staff (S), resident (R), mean bias (R−S) and mean score (R+S)/2 for howRu items and summary score (n=10 168)

Figure 3 shows the mean bias (R–S) for each mean score for each of the four items (left hand axis), together with the percentage of responses for each mean score value (right-hand axis).

Figure 3

Bias and distribution of ratings for the howRu summary score (n=10 168).

Residents score Pain and discomfort worse than staff when it is bad, but not when they have little or no pain or discomfort. The average bias (R−S) is −1.11. Residents rate Feeling low or worried somewhat worse than staff do when it is bad but somewhat better than staff when happier. Overall, the average bias (R−S) is 0.67 (not significant, Wilcoxson signed-rank test p=0.054). Residents rate Limited in what you can do as somewhat higher (better) than do staff. The average bias (R−S) is 1.56. Residents rate Require help from others as substantially higher (better) than do staff. The average bias (R−S) is 3.92. The health status summary score is higher for residents than for staff. The average bias (R−S) is 1.26.

Figure 4 shows the mean bias (R–S) for the howRu summary score (left-hand axis), together with the percentage of responses for each mean score value (right-hand axis). Residents tend to rate themselves as having somewhat better health status than do staff, although the picture varies across the range. At the lower (worse) end, residents tend to score themselves lower than staff, while at the higher (better) end, residents score themselves as better than staff do.

Figure 4

Bias and distribution of ratings for the howRu summary score (n=10 168).


This is the largest study (n=10 168) of matched ratings by care home staff and residents (or patients) that we are aware of. The size of the data set means that our estimates for mean scores for this population are quite precise.

Correlations are moderate or high and levels of exact agreement are satisfactory. We found differences in the distribution of bias for each item and the overall summary score.

The distribution of health status ratings by staff differs from resident self-rating overall; these differences also differ for each dimension of health status. Bland and Altman’s contention, that differences in mean scores correlations and exact agreement rates can miss important aspects of bias such as an association with value,18 is shown to be valid in the case of care home residents.

The probable explanations differ for each item.

Assessing how another person is feeling in terms of pain and distress is difficult. Residents may appear free of pain and distress, for example, when engaged in an activity yet suffer badly from night cramps or simply be low in mood or feel unhappy about loss of independence. Care home staff build their assessments from direct interaction and more general observation as well as from more formal assessment questioning. While we cannot determine which perspective should prevail, it may be that systematic robust staff observations using a PROM could avert unnecessary medication and consequent side effects.

Care home staff have broad day-to-day experience and judge disability and dependence in the context of people outside the care home. This may be more realistic, in a broader context, than the views of residents, who may reference their disability and dependence against that of other residents. This may lead them to believe that they can do more for themselves than they really can. Other residents may have little awareness of their limitations as a consequence of cognitive impairment.

The minimally important difference (MID) provides a measure of the smallest change that people regard as important.20 Ideally, an anchor-based approach is most appropriate, but in the absence of a suitable anchor, MID can be estimated using a distribution-based method. At the individual level, half a SD is a widely used criterion.21 For populations, the 95% CI = ±1.96(SD/√n). Sample size (n) is a critical factor. In practical terms, these tools are likely to be used to monitor the performance of care homes, or units within larger homes. For example, if a care home has 25 residents and SD=25 (see table 4), then the 95% CI is approximately ±10 and an appropriate MID threshold is ±5.

Red amber green (RAG) rating is widely used in quality control and improvement work. It could be used with howRu as follows. If a care home is monitoring staff-reported howRu scores on a weekly or monthly basis, a change of less than five points in the mean score would be rated green. Between 5 and 10 points would be rated amber and should be reviewed. More than 10 points should be rated red and trigger immediate investigation to understand what is going on.

This study has used secondary analysis of data to examine the relationship between staff and resident self-report ratings of health status in care homes. The study was not originally conceived for this purpose. This analysis excludes residents who staff considered could not or should not self-complete the ratings, such as people with advanced dementia or close to end of life. A possible risk was that staff encouraged residents to give the same answers as themselves for all four items. The central team had little control over this. This took place for 32.9% of residents, which does not seem too high.

The distributions of staff and resident ratings are similar superficially but differ in detail. Correlations between matched ratings for item and summary scores are moderate or strong. For items, more than 92.9% of paired responses are within plus or minus one class; for the howRu summary score, 66% are within plus or minus one class. Mean bias (resident minus staff scores) on 0−100 scale are discomfort (−1.11), distress (0.67), discomfort (1.56), dependence (3.92) and for summary howRu score (1.26).


We have demonstrated the differences between resident and staff assessments of their health status at scale using howRu.

Residents rated discomfort and distress lower (worse) than staff at severe levels, with bias associated with value. Residents rated their own disability and dependence as higher (better) than did staff, with bias not associated with value.

Staff may be better able to assess care home resident health status than can most residents, but may need training and take care not to underestimate severe pain and distress.

Tracking individual resident scores may provide a means to support residents proactively. RAG thresholds may provide a simple method to monitor changes in care home performance from a resident perspective.


The authors are grateful for the support of Bupa Care Services, which funded the census, and to the care home staff and managers who took part. They are grateful to Henry Potts of the UCL Institute of Health informatics for statistical advice and for introducing them to the Bland-Altman method.



  • Twitter @timbenson

  • Contributors TB designed the surveys with CB and wrote the first draft of the paper. TB performed the analyses. Both authors managed the data collection, contributed to the final text, read and approved the final manuscript.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests TB is a director and shareholder of R-Outcomes, which provides quality improvement and evaluation services in the health and social care sectors using howRu. CB was previously medical director of Bupa Care Services and is a non-executive director of AKARI Care Homes, FINCCH and Invatech Health, all of which have interests in care homes and social care.

  • Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.

  • Patient consent for publication Not required.

  • Ethics approval The authors carried out secondary analysis of data collected as part of a routine census of care home residents. The data were anonymous and undertaken to evaluate the current services without randomisation, so ethics approval was not required or sought. No data were collected about identifiable people and there was no risk to individual residents.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available upon reasonable request.