Introduction

During the last decade and in particular since the participation of patients in different Outcome Measures in Rheumatology (OMERACT) activities [13] there has been growing interest in the assessment of rheumatoid arthritis (RA) from the patient's perspective. Apart from patient-reported outcomes traditionally evaluated during the current standard assessment of RA, namely patient assessment of pain, functional disability and/or patient global assessment [4, 5], other health domains are also important for the patient such as fatigue, wellbeing and sleep pattern [68]. Under the umbrella of the European League Against Rheumatism (EULAR), a patient-reported composite index, the Rheumatoid Arthritis Impact of Disease (RAID) score has been proposed and validated [912]

This composite index includes seven domains (pain, function, fatigue, physical and psychological wellbeing, sleep disturbance and coping). Each domain is evaluated using a single question answered by a 0 to 10 numerical rating scale. Each domain also has a specific weight assigned by a patient survey. The RAID score is a continuous variable ranging from 0 (best) to 10 (worst)(Table 1).

Table 1 Rheumatoid Arthritis Impact of Disease questionnaire*

In general, the results of clinical studies and trials are reported at group level, for example by the mean change from baseline and this makes determination of the relevance of the results for an individual patient challenging. In order to make interpretation easier, data may be presented at individual level by considering the proportion of patients with an improvement above a threshold of an important change from baseline. Moreover, apart from the concept of improvement ('feeling better), the concept of status ('feeling good') has become increasingly important [13]. In order to assess these individual outcomes, continuous outcome measures (absolute value or change in RAID score) for each patient must be converted into a dichotomous variable (that is, change from baseline above a clinically relevant cutoff defining an important improvement from the patient's perspective, or absolute value below a clinically relevant cutoff defining an acceptable or good condition from the patient's perspective). These cutoffs have been called Minimal Clinically Important Improvement (MCII) for improvement and Patient Acceptable Symptom State (PASS) for status [1417].

Three types of technique have been proposed to determine thresholds. According to the first simple empirical method, an absolute change of at least 1 or 2 points on a 0 to 10 scale [18, 19] or a relative change of at least 20, 30 or 50% [20, 21] have been proposed as thresholds for several patient-reported outcomes in rheumatic disorders. The second technique considers a change to be relevant when it exceeds the measurement error [22]. The third technique uses a gold standard anchor (usually the patient's global assessment) to determine the threshold from the best ratio between sensitivity and specificity using the receiver operating characteristic (ROC) curve [23], correct classification probabilities [24] or the 75th percentile [14, 15].

To our knowledge, despite the recognition of the validity of the RAID questionnaire, no formal threshold has been proposed to present results at the individual level. Furthermore, although several techniques have been used to establish threshold values as detailed above, no comparison has been performed between these techniques.

We were therefore prompted to conduct a study in order to define and evaluate the validity of cutoffs for the RAID score using the different techniques described above.

Materials and methods

Study design

This study was a multi-center, open-label, single-arm trial with a screening visit, baseline visit (assuming patient disease activity was stable across the two visits) and visits after 4 and 12 weeks of etanercept therapy (clinicaltrials.gov allocated number NCT 00768053). For each patient, written informed consent was obtained according to the Declaration of Helsinki. The study was approved by the Institutional Review Board of Cochin Hospital, Paris, France.

Inclusion criteria

To be eligible for the study, patients had to have definite RA fulfilling the 1987 criteria of the American College of Rheumatology [21]. The disease had to be active according to the following definition: Disease Activity Score 28-erythrocyte sedimentation rate (DAS28-ESR) > 3.2 and at least one of the following: ≥ 4 swollen joints or C-reactive protein (CRP) ≥ 10 mg/l or ESR ≥ 28 mm/1st H, and the patient had to be eligible for tumor-necrosis factor (TNF) blocker therapy as recommended by the French Society of Rheumatology [25].

Collected data

Patients' age, gender and disease characteristics (duration, anti-citrullinated protein antibody (ACPA) status) were collected at screening. The DAS28 [26], modified health assessment questionnaire (mHAQ) [27] and RAID questionnaire were collected at screening, baseline and after 4 and 12 weeks of etanercept therapy. In addition, after 4 and 12 weeks of etanercept therapy, the patients assessed their condition by answering the following dichotomous question: 'If you were to remain during the next few months as you were during the last 48 hours, would this be acceptable to you: yes - no?' [15].

At weeks 4 and 12, patients assessed their change from baseline by answering the following questions: 'Think about all the ways your rheumatoid arthritis has affected you during the last 48 hours. Compared to when you started the study, how have you been during the last 48 hours? a) Improved, b) No change, c) Worse. If you answered "improved" to the previous question, how important is this improvement for you? a) Very important, b) Moderately important, c) Slightly important, d) Not important at all.'

Statistical analyses

Statistical analysis was conducted in two steps. We first determined thresholds that may be used to define responders (that is, improved patients) and patients in an acceptable condition (that is, good/acceptable status). The improvement threshold was determined by different techniques:

  1. 1.

    An empirical technique based on proposals in the rheumatology scientific literature (for example, an absolute change of at least 1 and 2 in the 0 to 10 RAID score; a relative change of at least 20, 30 and 50% versus baseline) [1820].

  2. 2.

    A technique based on the reliability of the RAID score considering that a relevant change at the individual level should be at least superior to the measurement error of the technique. For this purpose, the data collected at screening and baseline (interval during which the disease activity was considered stable) were used to assess the relative reliability by calculating the intra-class coefficient of correlation (ICC) with its 95% confidence interval (CI), the absolute reliability based on Bland and Altman plots, presenting the 95% limits of agreement on a graph [22], and the proposed threshold as the smallest detectable change defined as 1.96 × SD of the changes/√2 [28].

  3. 3.

    The third technique was an anchored method based on the patient's perspective. The external anchor was the general question on patient perception of change in comparison to baseline. The threshold RAID score for improvement was determined in three different ways. Firstly the RAID-MCII threshold was determined as the 75th percentile of the distribution of changes in RAID score for patients perceiving a slight or moderate improvement [14]. For this purpose, we considered as a potential threshold the score for which 75% of the patients in the targeted category had a value below this score. The second analysis used the correct classification probabilities [24]. For this purpose, we calculated the sensitivity (percentage of patients with a measured change in RAID score below the threshold for patients considering their condition to be at least slightly or moderately improved) and specificity (percentage of patients with a change in RAID score above the threshold for patients considering their condition to be at least slightly or moderately improved). This was done for a range of possible cutoffs and this was plotted on a graph. The choice of proposed cutoff for this analysis was based on maximal sensitivity and specificity using the graphic representation of correct classification probabilities. The third analysis used the nonparametric ROC curves [23]. The optimal cutoff was determined by minimizing the number of misclassified patients. Such evaluations were performed for the absolute and relative changes after 4 and 12 weeks of etanercept therapy.

Several similar techniques were also used to determine thresholds describing patients with an acceptable status/condition according to the RAID score:

  1. 1.

    Empirical method with thresholds ≤ 1, ≤ 2 and ≤ 3 according to previous proposals in the rheumatology scientific literature.

  2. 2.

    The anchored method based on the patient's perspective using, as an external gold standard, the general question about patients' perceptions of their condition during the 48 hours before the visit. The RAID threshold for acceptable status was determined using the three different analyses described for the improvement (for example, the 75th percentile, the correct classification properties and the ROC curve technique), with the patients considering their condition as acceptable used as the gold standard. All analyses were performed after 4 and 12 weeks of etanercept therapy.

We then evaluated the validity of all the proposed thresholds. Two external anchors were chosen for this purpose. The first was the gold standard from the patient's perspective: for each proposed MCII threshold, we calculated the percentage of patients with a change above the threshold and who considered their condition to be at least slightly improved, among all patients who considered their condition to be at least slightly improved (for example, sensitivity) and also the percentage of patients with a change below the threshold and who considered their condition to be either worse, unchanged or only slightly improved, among all patients who considered their condition to be worse, unchanged or only slightly improved (for example, specificity). The positive likelihood ratio (LR) was then calculated. A LR greater than one indicates an increased probability that the targeted disorder is present. In our study, the use of LR was transposed to express performance of the RAID threshold in reflecting patients' perspectives. Higher values are indicative of better-performing thresholds. All analyses were conducted at 4 and 12 weeks after initiation of etanercept therapy.

A similar analysis was conducted to evaluate the validity of the proposed thresholds of the RAID score to define an acceptable status. Here, the gold standard anchor was based on the patient's perspective by analyzing the patients considering (or not) their situation during the last 48 hours as acceptable. The second external anchor used to evaluate the validity of the proposed thresholds was based on the DAS28-ESR. This composite index is considered to be relatively physician-oriented as it comprises one laboratory measure (ESR) and information collected at physical examination (number of swollen and tender joints) as well as a patient-reported outcome (patient's global assessment). Similar analyses (that is, evaluation of positive LR for the different proposed thresholds evaluating the concept of improvement and status) were conducted as described above.

Results

Patients and study course

Of the 120 patients screened, 108 entered the study and received at least one etanercept injection. During the 12 weeks of the trial, one patient was lost to follow-up and ten withdrew because of side effects. The main characteristics of the 108 recruited patients were as follows: age (mean ± SD), 54 ± 13 years; 75% female; 61% ACPA-positive; disease duration, 8 ± 7 years; CRP, 18 ± 30 mg/l, DAS28-ESR, 5.4 ± 0.8.

Determination of thresholds

Relevant improvement threshold

The reliability of the RAID score between screening and baseline was very high (ICC = 0.85, 95% CI 0.79 to 0.90). The Bland and Altman graphic representation is illustrated in Figure 1. Using this technique, the smallest detectable difference (SDD) and the smallest detectable change (SDC) in the RAID score were 1.8 and 1.3 respectively.

Figure 1
figure 1

Reliability of the rheumatoid arthritis impact of disease (RAID) score shown by Bland & Altman graphic representation. *Mean of RAID score values between screening and baseline; **difference in RAID score between screening and baseline. The data lines represent the 95% confidence interval resulting in a smallest detectable change of 1.3 (for example, smallest detectable difference = 1.8).

A graphic representation of correct classification probabilities was obtained, based on patient's opinion for the absolute changes in RAID score after 4 weeks of etanercept therapy (Figure 2). The sensitivity and specificity for clinically relevant change was obtained for each measured difference in RAID score (0.1 per 0.1). This made it possible to obtain the best RAID threshold with maximal true positive and minimal false negative results, which was 1.0. A similar analysis was performed after 12 weeks of etanercept therapy and also for the relative change after 4 and 12 weeks of etanercept therapy, resulting in potential thresholds of 2.5, 25% and 42% respectively (Table 2).

Figure 2
figure 2

Correct classification probability curve showing absolute change in rheumatoid arthritis impact of disease (RAID) score at week 4.

Table 2 Elaboration and evaluation of the external validity of the different potential thresholds defining a relevant improvement in the rheumatoid arthritis impact of disease (RAID) score

Figure 3 shows the ROC curves for absolute changes in RAID score after 4 weeks of therapy, resulting in an optimal threshold of 1.6. Similar analyses were performed for the absolute changes after 12 weeks of therapy and for the relative changes after 4 and 12 weeks of therapy, resulting in potential thresholds of 3, 17% and 35% respectively (Table 2).

Figure 3
figure 3

Receiver operating characteristic curve showing absolute change in rheumatoid arthritis impact of disease (RAID) score at week 4.

Figure 4 plots the distribution of absolute change from baseline in RAID score after 4 weeks of etanercept therapy among the 30 patients considering their condition to be slightly or moderately improved. Using this technique, 75% of these patients had a RAID score below 0.2, and therefore 0.2 was proposed as a potential optimal threshold (Figure 5). Similar analyses were performed to evaluate the absolute changes after 12 weeks and also the relative changes after 4 and 12 weeks, resulting in potential thresholds of 1.3, 6% and 25% respectively (Table 2).

Figure 4
figure 4

Distribution of the absolute changes in rheumatoid arthritis impact of disease (RAID) score from baseline to week 4 in patients considering their condition to be slightly or moderately improved. EULAR, European League Against Rheumatism.

Figure 5
figure 5

Proposals and evaluation of different thresholds for defining a clinically meaningful improvement in an absolute change in the rheumatoid arthritis impact of disease (RAID) score. a. Proposals of threshold according to different techniques and different times of evaluation. b. Evaluation of external validity (versus DAS28-ESR) of different proposed thresholds. c. Evaluation of external validity (versus patient's global assessment) of different proposed thresholds. *Thresholds, proposal based on the following techniques and time of evaluation: a, 75th percentile technique at week 4; b, empirical technique and correct classification at week 4; c, smallest detectable change and 75th percentile at week 12; d, ROC technique at week 4; e, empirical technique and correct classification at week 4; f, correct classification at week 12; g, ROC technique at week 12. +Positive likelihood ratio (higher values are indicative of better performing thresholds. See Methods for further explanation). DAS28-ESR, Disease Activity Score 28-erythrocyte sedimentation rate; ROC, receiver operating characteristic.

Table 2 and Figures 5 and 6 summarize the proposed thresholds resulting from these different techniques. These ranged from 0.2 (75th percentile technique at week 4) to 3 (ROC technique at week 12) for defining a minimum clinically important improvement in the absolute change in RAID score and ranged from 6% (75th percentile technique at week 4) to 50% (empirical technique) for a MCII in the relative changes in the RAID score.

Figure 6
figure 6

Proposals and evaluation of different thresholds for defining a clinically meaningful improvement in a relative change in the rheumatoid arthritis impact of disease (RAID) score. a. Proposals of threshold according to different techniques and different times of evaluation. b. Evaluation of external validity (versus DAS28-ESR) of different proposed thresholds. c. Evaluation of external validity (versus patient's global assessment) of different proposed thresholds. * Thresholds, proposal based on the following technique and time of calculation: a, 75th percentile technique at week 4; b, ROC technique at week 4; c, empirical technique; d, correct classification at week 4 and 75th percentile at week 12; e, 75th percentile at week 12; f, ROC technique at week 12; g, correct classification at week 12; h, empirical technique. +Positive likelihood ratio (higher values are indicative of better performing thresholds. See Methods for further information). DAS28-ESR, Disease Activity Score 28-erythrocyte sedimentation rate; ROC, receiver operating characteristic.

Threshold for defining an acceptable status

Table 3 and Figure 7 summarize the thresholds proposed by the different techniques used, ranging from a minimal score of 1 (empirical technique) to 4.2 (75th percentile and ROC technique at week 4) for the definition of a patient-acceptable symptom state in the RAID score.

Table 3 Elaboration and evaluation of the external validity of the different potential thresholds defining an acceptable status in the rheumatoid arthritis impact of disease (RAID) score
Figure 7
figure 7

Proposals and evaluation of different thresholds for defining an acceptable status according to the rheumatoid arthritis impact of disease (RAID) score. a. Proposals of threshold according to different techniques and different times of evaluation. b. Evaluation of external validity (versus DAS28-ESR) of different proposed thresholds. c. Evaluation of external validity (versus patient's global assessment) of different proposed thresholds. * Thresholds, proposal based on the following technique and time of calculation: a, 75th percentile technique at week 4; b, ROC technique at week 4; c, empirical technique; d, correct classification at week 4 and 75th percentile at week 12; e, 75th percentile at week 12; f, ROC technique at week 12; g, correct classification at week 12; h, empirical technique. +Positive likelihood ratio (higher values are indicative of better performing thresholds. See Methods for further information). DAS28-ESR, Disease Activity Score 28-erythrocyte sedimentation rate; ROC, receiver operating characteristic.

Evaluation of proposed thresholds

Evaluation of improvement thresholds

Table 2 summarizes the sensitivity, specificity and positive LR for each proposed threshold for the two external anchors (for example, patient's perspective and DAS28-ESR). These analyses showed that the positive LR was above 1 for all the proposed thresholds (Figure 5). However, the highest values were observed for a threshold of 3 for the absolute change (with a corresponding positive LR of 6.9 and 2.0 for the patient's perspective and DAS28-ESR external gold standards respectively, at week 12). Concerning the relative changes, the highest positive LR (4.9 and 1.9 for the patient's perspective and DAS28-ESR external gold standards respectively, at week 12) were observed for an improvement of at least 50% (Figure 6).

Evaluation of acceptable status thresholds

Table 3 summarizes the sensitivity, specificity and positive LR for each proposed threshold and for the 2 external anchors (for example, patient's perspective and DAS28-ESR). As for the improvement thresholds, all the thresholds proposed for defining an acceptable symptom-state had a corresponding positive LR > 1 (see Figure 7). For a maximum score of 2, the corresponding positive LR had the highest positive LR (for example, 10.1 and 3.4 for the patient's perspective and DAS28-ESR external gold standards respectively, at week 12).

Discussion

This study clearly shows that the threshold value for a continuous variable, defining a relevant improvement or an acceptable symptom state, closely depends on the measurement technique. This first observation prompted us to perform a systematic validation of the proposed thresholds. In this study, we evaluated the validity of each proposed threshold by calculating the probability of being considered in good condition, by using external gold standards for the group of patients that were below or above the proposed threshold. We used two external gold standards reflecting both the patient's perspective (the patient's global assessment) and the physician's perspective (the DAS28-ESR) and calculated the positive LR: the best threshold was considered to be that with the highest observed positive LR. Using this methodology, we were able to propose an absolute change of at least 3, a relative change of at least 50%, and a maximum score of 2, as optimal thresholds for the RAID score, to define an absolute and relative MCII, and an acceptable symptom-state respectively.

This study has some weaknesses, but also several strengths. The very wide range of threshold values proposed using different methodologies raises the question of the optimal way to address this issue. All the techniques used in this study have been previously adopted, though no consensus has been reached in the field of clinical epidemiology [1021]. This can be easily explained by the different rationales of each technique: the empirical technique involves asking physicians to propose relevant thresholds based on the simplicity of their proposal or their experience [20]. The aim of another technique is to avoid proposing a value below the measurement error of the outcome measure, as any interpretation of results using a threshold below the noise due to this measurement error is hazardous [22, 23]. Finally, the techniques using an external anchor are also very relevant [14, 15]. Although the validity of this external anchor may be questioned (here we used the previously reported gold standard MCII and PASS questions, which might raise the issue of a circular reasoning), these techniques make it possible to select the optimal threshold based on the arguments for and against, favoring sensitivity (for example, 75th percentile technique [14, 15]), sensitivity and specificity (for example, ROC curve and correct probability technique [23, 24]). In this study therefore, we decided to use all the different techniques in a uniform group of patients (for example, active definite RA requiring a TNF blocker) receiving the same TNF blocker (etanercept). Despite this fact, we observed a very wide variability in the thresholds proposed by these analyses. From our point of view, such variability justifies a systematic evaluation of the validity of any proposed threshold and the main question is to define the optimal methodology for evaluating such validity. In this study, we approached this question by calculating the capacity of a proposed threshold to adequately classify a patient by considering previously validated external anchors from both a patient's perspective and a physician's perspective. The MCII and PASS questions were considered to be a gold standard anchor for the patient's perspective [14, 15]. Because we also used the MCII and PASS questions for the elaboration of such thresholds, one might be concerned by the potential circular reasoning of this approach. This is why we decided to use not only this external anchor but also another one (the DAS28), which is considered a physician's perspective [26], while evaluating a patient. We then calculated the positive LR. This approach, using two different external anchors resulted in quite good concordance between the two analyses for each proposed threshold, strengthening our finding. This agrees with the results of a previous study suggesting that the PASS corresponds to moderate disease activity [29]. The data presented in the figures suggest also that the most stringent thresholds are also the most valid, at least with regard to our definition of external validity.

A weakness of this study was the fact that we were unable to evaluate the discriminant capacity of the proposed thresholds in order to validate them. Another potential weakness is that the proposed thresholds were defined in a single study with a relatively small sample size. On the other hand, the strength of this study is that all these different analyses of the definition of thresholds for a continuous variable were performed on a uniform group of patients. Despite these points, using our methodology and calculating the positive LR using two external anchors, we found a difference between the different thresholds, so that we were able to propose an absolute change of 3 points and a relative change of 50% for defining a clinically relevant improvement, and a maximum score of 2 for defining an acceptable status. Further studies in different patient populations, evaluating different facets of validity (including for example, the evaluation of discriminatory capacity), are necessary to confirm these proposals.