Introduction

The economic evaluation of health programs is often and increasingly a prerequisite in obtaining funding from third-party payers seeking to get the best value from a limited health budget. Where treatment is expected to impact on health-related quality of life (HRQoL), selecting an appropriate outcome measure frequently entails a trade-off between the sensitivity of available instruments for the disease or condition under study and the comparability (and therefore policy-relevance) of study results. Leaving aside the question of whether disease-specific outcome measures really are more sensitive than more generic measures, a number of difficulties arise in selecting a comparable outcome measure for use in economic evaluation.

While the minimal clinically significant improvement on a descriptive measure such as the SF-36, NIHSS or Barthel could be used to partition the trial population into responders and non-responders before expressing findings in terms of cost per additional responder, such an approach would not achieve comparability of findings even in the event that every other evaluation was also to express results in terms of responders. Because descriptive measures lack weak interval properties, there is no guarantee that a 10 point improvement at the upper end of the scale is equivalent to a 10 point improvement at the lower end of the scale. The weak interval property simply requires that a given numerical change along a scale should have the same meaning regardless of the direction and location of that change [1]. Descriptive measures such as the SF-36, NIHSS and Barthel provide an interval scale only by coincidence because items receive either an ad hoc or equal weighting when calculating subscale or dimension scores (and subscales or dimensions typically receive either an ad hoc or equal weighting when calculating scale scores). Or, as Gold et al. [2] put it, descriptive measures "assume that the number of items on each dimension provides an adequate reflection of the importance of the various domains contained in the questionnaire. ...simply summing numerical weightings across questions on a scale does not guarantee that changes in scores will coincide with changes in health status that are seen as better or worse by patients or the general public" (p97–98).

To achieve comparability across interventions and across disease-areas, cost-effectiveness analysis is increasingly eschewed in favour of cost-utility analysis with the quality adjusted life year (QALY) providing a common metric for the valuation of mortality and relevant dimensions of HRQoL. Richardson [1] describes the conditions under which QALY-weights can be considered to have strong and weak interval properties. Selecting a comparable outcome measure for use in economic evaluation then reduces to a choice between alternative methods of obtaining QALY-weights that reflect preferences over health states observed in the study population [2, 3]. QALY-weights could, for example, be directly elicited from study participants using a preference-based scaling technique such as the time trade-off (TTO) to value their own health state, or by using a preference-based multi-attribute utility instrument such as the EQ5D to assign a 'stock' QALY-weight (obtained from another population during scaling) to questionnaire responses describing each participant's own health state [4].

There are, however, many circumstances when – because of timing, lack of foresight or cost considerations – only descriptive (rather than preference-based) measures of quality of life are available and some other means of obtaining QALY-weights becomes necessary. In such circumstances, the use of regression-based transformations or mappings can circumvent the failure to elicit QALY-weights from study participants by allowing predicted scores for preference-based measures such as the EQ5D or TTO to proxy for directly observed EQ5D or TTO scores. This regression-based approach to estimating a statistical transformation or exchange rate from a descriptive measure of HRQoL to a preference-based measure of HRQoL has been dubbed 'Transfer to Utility' (TTU) regression [5]. Given the development of a suitable regression-based transformation, TTU regression permits conversion of outcomes commonly used in clinical trials into the common metric of QALYs. While this constitutes a second best approach, it represents an extremely useful technique in the absence of the widespread use of preference-based measures in the conduct of clinical trials.

The principle underlying the TTU approach is that both descriptive and preference-based health outcome instruments estimate the effect of the intervention with respect to one or more relevant dimensions of HRQoL. To the extent that the coverage and sensitivity of the two instruments corresponds, the difference between instruments arises due to out-right errors that might be reflected in the reliability of each instrument (or lack thereof) and/or due to any between-instrument difference in the weights placed on each dimension. In an attempt to close the gap between a descriptive measure and a preference-based measure, regression-based algorithms discard the equal or ad hoc weighting of descriptive measures and instead weight each item, subscale or scale entering the regression according to the magnitude and direction of association with a preference-based regressand. While the coverage and sensitivity of any two given instruments is unlikely to correspond purely by chance, previous applications of the TTU approach have demonstrated that there is enough commonality between generic descriptive measures and generic preference-based measures to derive a transformation with adequate predictive validity for between-group comparisons [610].

For the majority of descriptive condition-specific outcome measures, there is no preference-based alternative with comparable sensitivity and coverage. It is therefore possible that the evidence for generic to generic transformations may not be applicable in the case of condition-specific to generic transformations. Transformation of descriptive condition-specific measures to a generic preference-based measure would typically require mapping from a detailed description of a relatively narrow area of HRQoL space to a general description of the entire HRQoL domain. We might therefore expect a condition-specific to generic transformation to be relatively poor when compared against a generic to generic transformation. However, the validity of this a priori expectation is yet to be tested for stroke-specific outcome measures and the extent of any additional error when transforming from descriptive stroke-specific measures to preference-based measures has yet to be quantified.

The purpose of the present study is to demonstrate the feasibility and value of TTU regression in stroke by deriving a transformation from two descriptive stroke-specific measures and a generic measure of health status to a preference-based measure of HRQoL in a sample of Australians with a diagnosis of acute stroke. This will allow quantification of the additional error associated with a condition-specific to generic transformation as compared to a generic to generic transformation in stroke. The resulting transformations will provide a valuable tool for investigators evaluating stroke interventions, potentially widening the set of descriptive stroke-specific measures of HRQoL that can be transformed to preference-based measures for the purposes of economic evaluation.

Materials and methods

Data

Data were obtained from the North East Melbourne Stroke Incidence Study (NEMESIS) [11]. The sample for the present study included 926 persons with a diagnosis of acute stroke under the World Health Organization (WHO) definition [12], drawn from a defined area of 22 postcodes in inner northeast Melbourne, Australia during the period May 1, 1996 to April 30, 1999. Further details regarding the study population and case ascertainment are provided elsewhere [11]. The average age of respondents in the study sample was 73.4 years (SD = 13.51), with 51.7% of respondents being female. The NEMESIS study protocol scheduled repeated observations on respondents, with observations available at up to six time points in our 926 respondents. Due to missing data, an AQoL index score paired with a valid scale, subscale or index score on at least one of the SF-36, NIHSS and Barthel could not be derived for all 926 respondents. The 859 participants with a valid AQoL index score for at least one time point paired with a valid scale, subscale or index score on at least one of the SF-36, NIHSS and Barthel for the same time point provided 2570 observations for analysis. Larger or smaller sub-samples were available for the derivation and validation of each algorithm depending on the extent of missing data for the SF-36, NIHSS and Barthel.

Measures

The preference-based 'target' measure chosen was the Assessment of Quality of Life (AQoL) instrument [13, 14] – the only generic preference-based measure of HRQoL that has been scaled and validated in Australia for use in the general population [13, 14] and for use in people with stroke [15]. The AQoL descriptive system includes 5 dimensions: illness, independent living, social relationships, physical senses and psychological well-being. Four of the five dimensions and 12 of the 15 items contribute to the preference-based index score, with the illness dimension and associated items excluded because they are indicative of an underlying health condition rather than the impact of that health condition on HRQoL. The AQoL index score varies from -0.04 to 1.00 where unity designates full health, zero designates death, negative scores designate states worse than death, and the lower bound of -0.04 designates the AQoL's 'all worst health state'.

Three descriptive 'base' measures that are commonly used in stroke trials were available for analysis in the present study: the SF-36v1, the National Institutes of Stroke Scale (NIHSS) and the Barthel Index. The SF-36v1 [16, 17] is a generic measure of functional health status. It comprises 36 questions in eight subscales or dimensions: Physical Functioning (PF), Role Physical (RP), Bodily Pain (BP), General Health (GH), Vitality (VI), Social Function (SF), Role Emotional (RE) and Mental Health (MH). Each of the eight dimensions is separately scored, using item weighting and additive scaling, to yield a 0–100 point scale. These eight dimensions can be combined into two summary measures – physical function (PCS index) and mental health (MCS index), each on a 0–100 point scale with population means ± standard deviations (SD) equal to 50 ± 10 [17].

The NIHSS [18] measures the severity of physical impairment associated with stroke via a neurological examination across 15 items: level of consciousness (three items), eye movements (one item), visual fields (one item), facial weakness (one item), motor arm strength (two items), motor leg strength (two items), limb ataxia (one item), sensory function (one item), language (one item), articulation (one item), and extinction/inattention (neglect) (one item). Each item is scored from zero (lowest severity) to a maximum of two, three or four (highest severity), and item scores are summed over all items to provide an index of stroke severity that varies from zero (lowest severity) to 42 (highest severity) [18]. The Barthel Index [19] measures disability or functional status based on patient or proxy completion of ten items related to activities of daily living (ADL): feeding, dressing, grooming, bathing, toilet use, transfer, stairs, mobility, bladder, and bowels. Each item is scored from zero (lowest functional status) to a maximum of two), three, or four (highest functional status), and item scores are summed over all items to provide an index of disability on a zero (highest functional status) to 20 (lowest functional status) scale [19].

Data analysis

We randomly selected approximately 50% of observations available for each algorithm into an estimation set (SF-36 = 1288 observations, NIHSS = 1302 observations, Barthel = 1316 observations), and retained remaining observations in a validation set (SF-36 = 1256 observations, NIHSS = 1268 observations, Barthel = 1252 observations) to allow 'post-sample' but 'within-context' tests of predictive validity. We found no significant difference between estimation and validation sets for SF-36, NIHSS or Barthel datasets with respect to gender (Pearson's chi-square χ2 ≤ 0.50, p ≥ 0.48), age (FSF-36 = 0.41, p ≥ 0.52; FNIHSS = 0.10, p ≥ 0.76; FBarthel = 1.57, p ≥ 0.21), health status as measured by the SF-36 MCS (FSF-36 = 0.04, p ≥ 0.84), SF-36 PCS (FSF-36 = 1.68, p ≥ 0.195), Barthel Index (FBarthel = 0.87, p ≥ 0.350), NIHSS (FNIHSS = 0.63, p ≥ 0.426), or health-related quality of life as measured by the AQoL (FSF-36 = 0.30, p ≥ 0.59; FNIHSS = 0.86, p ≥ 0.35; FBarthel = 0.73, p ≥ 0.39) where F statistics were obtained from one-way analysis of variance.

We first estimated the relationship between AQoL index scores and the three descriptive measures across the full range of stroke severity using multiple linear regression modelling (the 'all stroke' models). In an attempt to obtain further improvements in predictive validity, we subsequently re-estimated the best of our 'all stroke' models after partitioning the estimation set into NIHSS = 0–6 and NIHSS ≥ 6 subgroups ('severity-specific' models). For item-based algorithms, AQoL utility scores were regressed onto item scores. The inclusion of second-order and interaction terms in the item-based regressions was not practical given degrees of freedom constraints and the large number of first-order terms. In the case of item-based algorithms, we retained first-order terms in the item-based model solely on the basis of their contribution to the regression; as evaluated by the probability of F (enter p ≤ 0.05, remove p ≥ 0.10). For the subscale-, scale- or index-based algorithms, we regressed AQoL utility scores on subscale or scale scores plus interactions and second-order terms in the case of the SF-36, and on index scores plus second-order terms in the case of the NIHSS and Barthel algorithms. For all algorithms, we retained interaction and second-order terms where they made a significant individual or joint contribution to the regression based on the probability of F (enter p ≤ 0.05, remove p ≥ 0.10).

Some previous studies estimating scale- or subscale-based algorithms have retained all first-order terms for reasons of theoretical consistency – irrespective of their individual contributions to the model [9]. We identified some collinearity between SF-36 scale scores in our estimation sample (Pearson's r = 0.085, p < 0.000) but deemed PCS and MCS scores to be sufficiently orthogonal to follow precedent and retain both first-order terms for the scale-based regression. Likewise, index scores for the Barthel and NIHSS algorithms were retained irrespective of their individual contributions to the model. In contrast, the eight SF-36 subscales were highly collinear in the estimation sample such that the omission of one or more subscales from the subscale-based algorithm is consistent with theory. We therefore retained first-order terms in subscale-based regressions solely based on their contribution to the regression as evaluated by the probability of F (enter p ≤ 0.05, remove p ≥ 0.10).

In the survey sample, observations are clustered by respondent such that residuals might be independent between clusters but may not be independent within clusters. The robust Huber/White sandwich estimator is frequently used to adjust for clustering of the residuals in situations where the intra-cluster correlation coefficient is significantly greater than zero. While this approach delivers robust standard errors suitable for calculating confidence intervals, it does not render an inconsistent model (due, for example, to failure to control for respondent-specific effects) consistent [20]. The random effects model explicitly accounts for cluster-specific effects under the assumption that they are independent of other regressors (index, scale, subscale or item scores from the descriptive measure) within the range of the data. The fixed effects error components model controls for respondent specific effects but relaxes the assumption that the cluster-specific effects are uncorrelated with other regressors. A variance partition coefficient: ρ = , can be obtained from the random and fixed effects models to quantify the proportion of residual variance attributable to respondent-specific effects [21]. We used the population-average model where results suggested that respondent-specific effects were quantitatively unimportant. When our results suggested the presence of quantitatively important respondent-specific effects, we chose between fixed and random effects models using Hausman's specification test [[20], p576].

We identify the 'correct' specification within each class of algorithm using standard diagnostic tests. Following Harvey [22], the 'correctness' of each algorithm was evaluated against the criteria of parsimony, identifiability, goodness of fit, theoretical consistency and predictive power. In the present context, theoretical consistency is concerned with (a) obtaining non-negative coefficients on all items, subscales and scales (when coded so that higher item, subscale and scale scores reflect higher levels of HRQoL) and (b) restricting predicted AQoL scores to the -0.04 to 1.0 domain of the target construct. Evaluating the predictive validity of competing algorithms is much more complex than evaluating theoretical consistency but is (minimally) concerned with: (i) strength of association between predicted and observed AQoL scores in the validation sample at the individual-level, (ii) deviation between predicted and observed AQoL scores at the individual level in the validation sample, (iii) deviation between predicted and observed AQoL scores at the group level in the validation sample.

With regards to (i), the higher the strength of association, the better the algorithm is able to predict variation along the scale. Note, however, that "two measures can be perfectly correlated but have poor agreement" [[23], p977]. We might be relatively confident that a high score on the predicted AQoL scale would be mirrored by a high score on the observed AQoL scale but there is no guarantee that the two scales are compressed between the same limits. With regards to (ii), a summary measure of the deviation between predicted and observed scores at the individual level such as the mean absolute difference (MAD) indicates the average precision with which we can predict an individual's AQoL score. We calculated MADs by taking the absolute difference between predicted and observed scores for each individual, summing over all individuals, and dividing through by the total number of observations.

While a high degree of precision in predicting AQoL scores at the individual level would imply a high level of precision with respect to other criteria, such precision might not be necessary for the sort of between-group comparisons that form the basis for estimates of both treatment effects and health-state utilities. Specifically, errors at the individual level might not translate into errors at the group level such that minimising the deviation between predicted and observed AQoL utility scores at the group level is all that is required. For the purposes of evaluating precision at the group level in the present study, we split the study sample into three sub-groups defined by stroke severity on the NIHSS (0; 1–5; and ≥ 6). While (iii) is the most relevant test of predictive validity in measuring group-level treatment effects and health-state utilities, we report findings on all three criteria to provide a more complete evaluation of the strengths and weaknesses of our transformations. We conducted the analyses reported here using SPSS 15.0 for Windows [24] and STATA/SE 8.2 for Windows [25].

Results

Table 1 describes the demographic characteristics for observations (rather than respondents) and the distribution of AQoL, NIHSS, SF-36 and Barthel scores for the study sample used to derive and validate each algorithm. The mean AQoL score across all observations was 0.47 (SD = 0.34), demonstrating the vastly poorer health-related quality of life of people with stroke as compared with the population norm of 0.83 in the Australian non-institutionalised population [13]. Model fit, estimated coefficients and post-sample tests of predictive validity are summarised below for 'all stroke' and 'severity-specific' algorithms.

Table 1 Descriptive statistics on observations

Conversion of SF-36 scale scores to QALY-weights

Table 2 summarises parameter estimates and model fit for the fixed effects, scale-based SF36 algorithm. The intra-cluster correlation coefficient for AQoL scores in the estimation sample (ICC = 0.733, 95%CI: 0.69, 0.77) suggested that some adjustment should be made for clustering by individual. Results from the fixed effects error components model confirm that a significant proportion of variation is attributable to respondent-specific effects (ρ = 0.706) and that respondent-specific fixed effects are significantly greater than zero (F = 2.85, df = (639,431), p < 0.000) [21]. The Hausman specification test for the appropriateness of the random effects estimator rejected the null hypothesis of no systematic differences between coefficients from fixed and random effects models (χ2 = 68.77, df = 3, p < 0.000), implying that the additional assumptions required by the random effects model were not met in the estimation sample.

Table 2 Regression algorithms for converting SF-36 scores into AQoL scores

Post-sample tests of predictive validity for fixed effects, scale-based SF36 to AQoL algorithm are reported in Table 3. Mean predicted AQoL utility scores were not significantly different from their corresponding mean observed scores in all stroke (t = 0.000, p = 1.000) patients or for the NIHSS = 1–5 (t = -0.572, p = 0.567) subgroup but the presence of significant differences in NIHSS = 0 (t = 2.662, p = 0.0079) and NIHSS ≥ 6 subgroups (t = -11.704, p = 0.000) suggests that averaging over all groups masks errors at the group level. The predictive validity of the scale-based algorithm was therefore deemed inadequate for the sort of between-group comparisons required for evaluating the effectiveness and cost-effectiveness of interventions.

Table 3 Post-sample predictive validity for 'all stroke' SF-36 to AQoL algorithms

There is also only a weak correspondence between predicted and observed scores at the individual level. For example, a high proportion (79.4%) of absolute deviations between predicted and observed scores were in excess of 0.10 on the AQoL scale. Likewise, correlations between predicted and observed AQoL utility scores in the validation sample for all stroke (Pearson's r = 0.750), NIHSS = 0 (Pearson's r = 0.744), NIHSS = 1–5 (Pearson's r = 0.676), and NIHSS ≥ 6 groups (Pearson's r = 0.635) were on par with those reported for existing conversion algorithms but are not sufficiently strong to imply that predicted AQoL scores provide an adequate proxy for directly observed AQoL scores at the individual level [9].

Conversion of SF-36 subscale scores to QALY-weights

Parameter estimates and model fit for the subscale-based SF36 algorithm are reported in Table 2. Respondent-specific fixed effects were again significantly greater than zero (F = 2.01, df = (639,431), p < 0.000) and the Hausman specification test (χ2 = 39.87, df = 8, p < 0.000) again suggested that the fixed effects model most appropriately characterised respondent-specific effects. Post-sample tests of predictive validity for the subscale-based SF36 to AQoL algorithm are reported in Table 3. Mean predicted AQOL utility scores were not significantly different from their corresponding mean observed scores in all stroke (t = 0.352, p = 0.725) patients or in the NIHSS = 0 (t = 0.418, p = 0.676) and NIHSS = 1–5 (t = -0.840, p = 0.401) subgroups. However, a significant difference between observed and predicted AQoL scores in the NIHSS ≥ 6 subgroup (t = -6.374, p < 0.000) implies that the predictive validity of the subscale-based algorithm was inadequate for between-group comparisons across the full range of stroke severity.

Partitioning the sample and running separate regressions for the NIHSS = 0–5 ('low severity') and NIHSS ≥ 6 ('moderate to high severity') subgroups produced an improvement in model fit and predictive validity. Table 4 summarises model fit and estimated coefficients for 'low severity' and 'moderate to high severity' subscale-based conversion algorithms. Table 5 summarises post-sample tests of predictive validity for these 'severity-specific' subscale-based conversion algorithms. For the 'low severity' algorithm, respondent-specific fixed effects were significantly greater than zero (F = 2.14, df = (566,364), p < 0.000) and the Hausman specification test (χ2 = 33.92, df = 10, p < 0.000) suggested that the fixed effects model most appropriately characterised respondent-specific effects. Results from random and fixed effects models (not reported here) for the 'moderate to high severity' algorithm suggest that the proportion of variance attributable to respondent specific effects is approximately zero. Model fit and estimated coefficients for the 'moderate to high severity' algorithm are therefore drawn from the population-average model.

Table 4 Severity-specific algorithms for converting SF-36 data into AQoL scores
Table 5 Post-sample predictive validity for 'severity-specific' SF-36 to AQoL algorithms

Mean predicted AQoL utility scores were not significantly different from their corresponding mean observed scores in NIHSS = 0 (t = 0.357, p = 0.721), NIHSS = 1–5 (t = -0.471, p = 0.638) and NIHSS ≥ 6 (t = -0.257, p = 0.798) subgroups when the 'low severity' algorithm is used to predict AQoL scores for patients in the NIHSS = 0 and NIHSS = 1–5 subgroups, and the 'moderate to severe severity' algorithm is used to predict AQoL scores for patients in the NIHSS ≥ 6 subgroup. For all subgroups, the difference between mean predicted and mean observed scores was less than 0.01 on the AQoL scale – a magnitude of error that is unlikely to mask minimally important differences (MIDs) for between-group or pre-post treatment effects [26]. While the predictive validity of the item-based SF-36 to AQoL algorithm is now adequate for between-group comparisons, the mean absolute deviations reported in Table 5 imply that the subscale-based algorithm is not sufficiently precise for the purposes of predicting health state utilities or change scores at the individual level.

Conversion of SF-36 item scores to QALY-weights

Parameter estimates and model fit for the fixed effects, item-based SF36 to AQoL algorithm are reported in Table 2. Respondent-specific fixed effects were again significantly greater than zero (F = 1.85, df = (640,429), p < 0.000) and the Hausman test (χ2 = 55.32, df = 10, p < 0.000) again suggested that the fixed effects model most appropriately characterised respondent-specific effects. Post-sample tests of predictive validity are reported in Table 3. Mean predicted AQoL utility scores were not significantly different at the 0.05 level from their corresponding mean observed scores in all stroke (t = 0.000, p = 1.000) patients or in the NIHSS = 0 (t = 1.036, p = 0.300) and NIHSS = 1–5 (t = -0.682, p = 0.495) subgroups. However, a significant difference between observed and predicted AQoL scores in the NIHSS ≥ 6 subgroup (t = -6.269, p < 0.000) suggests that the predictive validity of the subscale-based algorithm was inadequate for patients at the more severe end of the scale.

Partitioning the sample and running separate regressions for the NIHSS = 0–5 ('low severity') and NIHSS ≥ 6 ('moderate to high severity') subgroups produced an improvement in predictive validity. Results from random and fixed effects models (not reported here) for the 'moderate to high severity' algorithm suggest that the proportion of variance attributable to respondent specific effects is approximately zero. Model fit and estimated coefficients for the 'moderate to high severity' algorithm derived in the NIHSS ≥ 6 subgroup and reported in Table 4 are therefore drawn from a group-average estimator. Table 5 summarises post-sample tests of predictive validity for 'severity-specific', item-based conversion algorithms. For the 'low severity' algorithm, respondent-specific fixed effects were significantly greater than zero (F = 2.05, df = (567,363), p < 0.000) and the Hausman test (χ2 = 46.64, df = 11, p < 0.000) suggested that the fixed effects model most appropriately characterised respondent-specific effects.

Comparison between mean predicted and mean observed AQoL utility scores by subgroup now suggests that the predictive validity of the item-based SF-36 algorithms is adequate for between-group comparisons when the 'low severity' algorithm is used to predict AQoL scores for patients in the NIHSS = 0 and NIHSS = 1–5 subgroups and the 'moderate to severe severity' algorithm is used to predict AQoL scores for patients in the NIHSS ≥ 6 subgroup. Mean predicted AQoL utility scores were not significantly different from their corresponding mean observed scores in NIHSS = 0 (t = -0.185, p = 0.853), NIHSS = 1–5 (t = -0.325, p = 0.745) and NIHSS ≥ 6 (t = -0.084, p = 0.933) subgroups. The difference between mean predicted and mean observed scores was less than 0.01 on the AQoL scale for all subgroups – a magnitude of error that is unlikely to mask minimally important differences (MIDs) for between-group or pre-post treatment effects [26]. While the predictive validity of the item-based SF-36 to AQoL algorithm is now adequate for between-group comparisons, MADs in excess of 0.10 for NIHSS = 0 and NIHSS = 1–5 subgroups imply that partitioning the sample fails to remedy errors at the individual level. Item-based SF-36 algorithms therefore remain insufficiently precise for the purposes of predicting health state utilities or change scores for individual patients.

Conversion of NIHSS index and item scores to QALY-weights

The index-based NIHSS algorithm failed to reach statistical significance at the 0.05 level in the full study sample (F = 1.35, df = (2,595), p = 0.259). Partitioning the sample and running separate regressions for the NIHSS = 0–5 ('low severity') and NIHSS ≥ 6 ('moderate to high severity') subgroups produced an improvement in model fit and predictive validity for index-based NIHSS algorithms. Parameter estimates and model fit for the index-based NIHSS 'all stroke' and 'severity-specific' algorithms are given in Table 6. The Hausman test suggested that the fixed effects model most appropriately characterised respondent-specific effects in the NIHSS = 0 and NIHSS = 1–5 (χ2 = 49.53, df = 2, p < 0.000) subgroups whereas the additional assumptions required for the random effects model were met in the NIHSS ≥ 6 subgroup (χ2 = 0.83, df = 2, p = 0.660).

Table 6 Regression algorithms for converting NIHSS data into AQoL scores

For the item-based NIHSS algorithms, the Hausman test suggested that the fixed effects model most appropriately characterised respondent-specific effects for the all stroke (χ2 = 40.24, df = 2, p < 0.000), NIHSS = 0–5 (χ2 = 23.82, df = 2, p < 0.000) and NIHSS ≥ 6 (χ2 = 76.61, df = 9, p = 0.000) algorithms. With the exception of predictions for the NIHSS ≥ 6 subgroup from the 'moderate to high severity' algorithm, mean predicted AQoL utility scores from item- and index-based NIHSS algorithms were always significantly different from their corresponding mean observed scores. For example, predicted and observed AQoL scores from the index-based NIHSS algorithm were significantly different from one another for NIHSS = 0 (t = 6.084, p = 0.000) and NIHSS = 1–5 (t = -5.732, p = 0.000) but not for the NIHSS ≥ 6 (t = 1.018, p = 0.309) groups. None of the NIHSS-based algorithms can therefore be said to predict AQoL group means with sufficient precision for the purposes of evaluating the effectiveness and cost-effectiveness of interventions. Moreover, MADs for the NIHSS algorithms reported in Table 7 are never lower than 0.120 and as high as 0.307 for some subgroups, nearly one third of the AQoL scale and considerably higher than the mean absolute deviations for the subscale- and item-based SF-36 algorithms reported in Tables 3 and 5. These results suggest that the NIHSS algorithms derived in the present study yielded predicted AQoL scores with such poor correspondence to observed scores that they should not be used for any purpose.

Table 7 Post-sample predictive validity for NIHSS 'all stroke' & 'severity-specific' algorithms

Conversion of Barthel index and item scores to QALY-weights

Parameter estimates and model fit for the index- and item-based 'all stroke' Barthel algorithms are given in Table 8. Post-sample tests of predictive validity for the index- and item-based 'all stroke' Barthel algorithms are reported in Table 9. Neither the index- nor item-based 'all stroke' Barthel algorithms provided sufficient predictive power for the purposes of economic evaluation. Mean predicted AQoL utility scores from both item- and index-based 'all stroke' Barthel algorithms were always significantly different from their corresponding mean observed scores in at least one subgroup. Specifically, predicted and observed AQoL scores were significantly different for index-based (t ≥ 3.063, p ≤ 0.002) and item-based (t ≥ 3.056, p ≤ 0.002) 'all stroke' Barthel algorithms in the NIHSS = 0 and NIHSS ≥ 6 subgroups.

Table 8 Regression algorithms for converting Barthel data to AQoL scores
Table 9 Post-sample predictive validity for Barthel 'all stroke' algorithms

Partitioning the sample and running separate regressions for the NIHSS = 0–5 ('low severity') and NIHSS ≥ 6 ('moderate to high severity') subgroups produced an improvement in model fit and predictive validity for both index- and item-based Barthel algorithms. Parameter estimates and model fit for the index- and item-based 'severity-specific' Barthel algorithms are given in Table 8. Post-sample tests of predictive validity for the index- and item-based 'severity-specific' Barthel algorithms are reported in Table 10. Despite these improvements, comparison between mean predicted and mean observed AQoL utility scores implies that the predictive validity of the index- and item-based Barthel algorithms remains inadequate for the purposes of economic evaluation across the full range of stroke severity. Predicted and observed AQoL scores were significantly different for the item-based Barthel algorithm in the NIHSS = 0 (t = 2.040, p = 0.041) and NIHSS = 1–5 (t = -2.625, p = 0.009) subgroups but not in the NIHSS ≥ 6 subgroup (t = -0.360, p = 0.719), even when the 'low severity' algorithm was used to predict AQoL scores for NIHSS = 0 and NIHSS = 1–5 subgroups, and the 'moderate to severe' algorithm was used to predict AQoL scores for the NIHSS ≥ 6 subgroup.

Table 10 Post-sample predictive validity for Barthel 'severity-specific' algorithms

While mean predicted AQoL utility scores from the index-based severity-specific Barthel algorithms were not significantly different from their corresponding mean observed scores at the 0.05 level in NIHSS = 0 (t = 1.578, p = 0.115), NIHSS = 1–5 (t = -1.840, p = 0.066) and NIHSS ≥ 6 subgroup (t = -0.360, p = 0.719) subgroups, differences approaching clinical significance were observed for the NIHSS = 1–5 subgroup. The difference between mean predicted and mean observed scores in the NIHSS = 1–5 subgroup approached 0.04 (95%CI:0.00–0.08) – a magnitude of error that could potentially mask between-group or pre-post treatment effects. While there may be circumstances where the expected treatment effects from stroke interventions are detectable even in the presence of upper bound errors associated with predicted scores, the Barthel algorithm described above will not always produce 'conservative' estimates. Note, for example, that the item-based algorithms underestimate the mean observed AQoL score for the NIHSS = 0 subgroup but provide an overestimate for the NIHSS = 1–5 subgroup. Where conversion algorithms have the potential to make interventions appear more cost-effective and push borderline interventions under the funding threshold, the use of predicted scores from such algorithms is unlikely to be acceptable to decision-makers.

Discussion

Previous applications of the TTU approach have demonstrated the feasibility and value of regression-based transformations for deriving QALY-weights from generic descriptive measures of health and HRQoL [610]. For example, a number of generic to generic transformations from the SF-36/-12 to preference-based measures have recently been validated in a sample of patients at risk of stroke [27] and in post-stroke patients [28].

Pickard et al. [28] predicted QALY-weights by applying patient-level data to the Brazier et al. [29] SF-36-based SF6D algorithm, the Brazier and Roberts [30] SF-12-based SF6D algorithm and several of the SF-36/-12-based TTU regression-based algorithms [68, 3134] reviewed elsewhere [9]. The study sample for the Pickard et al. [28] validation study included 81 of the 124 patients with confirmed ischaemic stroke enrolled in a longitudinal study of post-stroke HRQoL [35] for whom observations were available on the SF-36 at baseline and follow-up. While Pickard et al. did not provide a comparison between predicted and observed QALY-weights, their comparison of incremental cost-utility ratios (ICURs) derived using different conversion algorithms provided a test of convergent validity. Pickard et al. reported a three-fold difference in ICURs derived from different algorithms and concluded that "...the choice of algorithm could determine whether the intervention is considered cost-effective or unacceptable" (p6).

Kaplan et al. [27] derived QALY-weights from patient-level SF-36 data using the Brazier et al. [36] SF-36-based SF6D and the Fryback [7] and Nichol [33] SF-36/-12-based TTU regression algorithms. The study sample for the Kaplan et al. [27] validation study included 294 patients at risk of stroke from the Quality of Life in Stroke Prevention (QLASP) study. Kaplan et al. [27] reported a strong correlation between predicted QALY-weights from the Brazier [36], Fryback [7] and Nichol [33] algorithms but a sometimes modest correlation between predicted and observed QALY-weights. Kaplan et al. [27] concluded that conversion algorithms produced comparable, but not interchangeable results.

Against the background of this previous research, we have conducted the first study to derive and validate conversion algorithms in a sample of stroke patients for multiple stroke-relevant outcome measures. Our findings can be summarised as follows. For the item- and subscale-based SF-36 algorithms, differences between mean predicted and mean observed AQoL scores were neither clinically nor statistically significant when the 'low severity' algorithm was used to predict AQoL scores for patients in the NIHSS = 0 and NIHSS = 1–5 subgroups and the 'moderate to severe severity' algorithm was used to predict AQoL scores for patients in the NIHSS ≥ 6 subgroup. Model fit and predictive power for our final generic (SF-36) to generic (AQoL) regression-based transformation were superior when compared to TTU regressions included in previous validation studies conducted in stroke patients [27, 28]. The superior explanatory power of our transformations may be attributable to a better correspondence between the coverage of the SF-36 and the AQoL than between the SF-36 and other preference-based measures such as the EQ5D, HUI2/3 or the QWB. Hawthorne, Richardson and Day [13] concluded that coverage of the HRQoL universe was poor for the QWB but good or very good for the HUI2 and AQoL. It might also be the case a lower noise (random variation) to signal (systematic variation) ratio in the AQoL as compared to the HUI2 or QWB might increase the share of variation that can be explained; simply because there is less random error to be discarded as a residual. Whatever the reason, our findings suggest that the predictive validity of our severity-specific item-based and subscale-based SF-36 to AQoL algorithms is more than adequate for evaluating the relative effectiveness and cost-effectiveness of stroke interventions.

With regards to our disease-specific to generic transformations, the difference between mean predicted and mean observed AQoL scores from the NIHSS algorithms reached clinical and statistical significance in at least one subgroup for all models. The relatively poor predictive power of our NIHSS to AQoL transformations is not surprising given the differences in sensitivity and coverage between the NIHSS and the AQoL. Transformation of the NIHSS scale to the AQoL requires mapping from a detailed description of a relatively narrow area of HRQoL space to a much more general description covering multiple dimensions of HRQoL. Variation in AQoL scores for stroke patients might arise due to variation in emotional well-being, physical senses, self-care, household tasks and/or mobility such that it is difficult to see how the NIHSS scales could closely approximate stroke outcomes along the AQoL scale. For disease-specific measures that are designed to provide a detailed picture of only one of several potentially relevant dimensions or that cover different dimensions than the preference-based 'target' instrument, TTU regression is unlikely to provide a satisfactory transformation.

For the 'moderate to severe' index- and item-based Barthel to AQoL algorithm, differences between mean predicted and mean observed AQoL scores were neither clinically nor statistically significant for patients in the NIHSS ≥ 6 subgroup. While the 'severity-specific' Barthel to AQoL algorithms therefore represent a substantial improvement on the NIHSS to AQoL algorithms, it remains the case that differences between predicted and observed AQoL scores from the Barthel algorithms reached levels that could potentially mask minimally important differences over some segments of the severity scale. When the low-severity index-based Barthel algorithm was used to predict AQoL scores for the NIHSS = 1–5 subgroup, the difference between mean predicted and mean observed scores approached 0.04 (95%CI:0.00–0.08) – a magnitude of error that could be considered clinically significant and potentially unacceptable to decision-makers. Analysts and policy-makers should therefore exercise caution when using predicted scores from our severity-specific Barthel to AQoL algorithms in samples that include low severity patients. The predictive validity of our moderate to severe Barthel to AQoL algorithm should, however, be adequate for the purposes of evaluating the relative effectiveness and cost-effectiveness of stroke interventions in patients with moderate to severe stroke severity.

While the predictive validity for several of the regression-based mappings described above appear to be acceptable for predicting between-group differences, our findings are subject to a number of limitations. It should, for example, be emphasised that none of our mappings were deemed suitable for the purposes of predicting health state utilities or change scores at the individual level. To the extent that the coverage and sensitivity of the descriptive and preference-based measures diverge, residual error (potentially precluding the sort of precision required for prediction at the individual level) is unavoidable in a 'self-contained' mapping that would permit SF-36, Barthel or NIHSS data to be converted to AQoL utility scores without relying on additional data that may or may not be available in a particular application.

It should also be emphasised that use of our severity-specific algorithms requires some means of distinguishing 'low severity' patients (whose AQoL scores are most appropriately estimated using the 'low severity' algorithms) from 'moderate to high' severity patients (whose AQoL scores are most appropriately estimated using the 'moderate to severe' algorithms). During estimation, we used the NIHSS to partition the sample into 'low' and 'moderate to high' severity subgroups and the end-user could make similar reference to NIHSS scores in assigned patients or samples to the appropriate algorithm. This is, of course, contingent on the availability of NIHSS data to the end-user in the relevant dataset. It could therefore be argued that using the SF-36 rather than NIHSS to partition the sample during estimation would have made the severity-specific SF-36 to AQoL algorithms more useful and less reliant on additional data. Likewise, it could be argued that using the Barthel rather than NIHSS to partition the relevant estimation sample would have made the severity-specific Barthel algorithms more 'self-contained'. Such arguments would carry particular weight where the derived transformation algorithms are intended for use across multiple conditions. This is not, however, the case in the present study where the intention was to derive algorithms specifically designed for use in stroke. Given the available data, the NIHSS provided a convenient way of identifying clinically distinct groups of patients but it should also be possible to identify low severity and moderate to high severity stroke patients based on clinical assessment (rather than relying on the availability of NIHSS data). Further validation studies will, however, be required to confirm that our 'severity-specific' algorithms are applicable in samples partitioned using clinical assessment.

For the present study, we chose between fixed and random effects models using a Hausman specification test [[20], p576]; with fixed effects frequently identified as our preferred model. However, it is sometimes argued that the random effects model is to be preferred whenever results will be used to draw inferences regarding the distribution of a wider population [37]. Greene [20] offers a different perspective, noting that arguments in favour of fixed or random effects frequently fail to provide unambiguous guidance; and concludes that the choice between fixed and random effects should instead be driven by the data. Specifically, the random effects model treats the cluster-specific effects as uncorrelated with other regressors and, where this assumption is not supported by the data, the random effects model will suffer from inconsistency due to omitted variables and should be rejected [20]. In this context, it is worth emphasizing that interpretation of our findings should respect the assumptions and limitations of the models used in estimation.

Finally, it should be emphasised that the models estimated in the present study are not intended for application in non-stroke populations. The weight attached to each item, subscale or scale entering each of our conversion algorithms reflects the covariance in our data between AQoL health states and Barthel, NIHSS or SF-36 health states. Because this covariance is likely to be quite different in stroke than in other disease-areas or the general population, our conversion algorithms may not be applicable to non-stroke populations. More generally, our findings are contingent upon the characteristics of our study population and on the coverage and sensitivity of the descriptive and preference-based measures used to generate our conversion algorithms. Note, for example, that our findings regarding the feasibility and value of TTU regression in stroke-specific outcome measures might not be generalisable to all condition-specific measures in all disease-areas. Likewise, where transformations have been derived and validated in a sample of stroke-patients with mean age exceeding 70 years, those transformations cannot be assumed valid for the purposes of predicting QALY-weights in children with stroke.

Despite these limitations, the conversion algorithms reported here represent an improvement on the regression-based conversion algorithms that have previously been validated for use in stroke [27, 28]. Moreover, our derivation of a Barthel to AQoL transformation for moderate to severe stroke widens the set of descriptive stroke-specific measures that can be transformed to obtain preference-based outcomes suitable for use in economic evaluation. The present study therefore adds additional tools to the analyst's tool-box; increasing the chances that an appropriate tool with be available for the job at hand. Findings from the present study also provide a unique insight into the feasibility and value of TTU regression in stroke-specific outcome measures such as the Barthel and NIHSS; highlighting the necessity of some minimal correspondence between the condition-specific 'base' measure and the preference-based 'target' with respect to coverage and sensitivity.

Conclusion

Our findings suggest that TTU regression can provide a useful second-best approach for deriving QALY-weights associated with stroke disease-states. While the NIHSS to AQoL transformations proved unsuitable for most applications, transformations from the SF-36 and Barthel to the AQoL provided sufficient predictive power to suggest that stroke-relevant outcomes can be transformed to preference-based measures for the purposes of economic evaluation. While a number of generic to generic transformations from the SF-36 to preference-based outcome measures are now available, the SF-36 to AQoL transformations reported here are the only published transformations to have been derived and validated in a sample of stroke patients [9]. Moreover, our derivation of a Barthel to AQoL transformation for moderate to severe stroke widens the set of descriptive stroke-specific measures that can be transformed for use in economic evaluation. Our findings also suggest that attempts to derive regression-based algorithms from stroke-specific descriptive measures such as the NIHSS to generic preference-based measures such as the AQoL will sometimes be frustrated by a lack of correspondence in the sensitivity and/or coverage of 'base' and 'target' instruments.

The implications for practice are two-fold. First, it is anticipated that our transformations will prove to be a valuable tool for analysts and should allow the best use to be made of the available data; improving the quality and policy-relevance of economic evaluations for stroke interventions. Second, improvements in the economic evaluation of stroke interventions should allow clinicians and policy-makers to make better decisions; potentially saving money and improving patient outcomes. Our findings also have a number of implications for research. First, researchers may wish to take account of the feasibility of TTU regression in certain condition-specific measures (but not in others) when selecting descriptive outcome measures for inclusion in clinical trials. Such considerations will be particularly important where resource constraints or patient burden preclude the direct observation of preference-based measures in the trial population. Second, researchers attempting to derive their own regression-based transformations for other descriptive measures should take particular note of the improvements in predictive validity that we were able to obtain by deriving separate transformations for clinically distinct subgroups of patients. Finally, our findings suggest that validity in predicting group-wise differences will not always translate to validity in predicting health state utilities or change scores for individual patients. Researchers responsible for the derivation of regression-based transformations might therefore wish to provide guidelines for end-users to ensure use consistent with validation data.