Introduction

Patients with cancer are often faced with invasive treatments, with a temporal or permanent impact on appearance. Cancer patients may have to deal for example with scars or amputated body parts following surgery, skin burns due to radiation therapy, or hair loss due to chemotherapy. These appearance changes can negatively affect body image. Body image is a multi-dimensional construct and comprises cognitive, behavioral, and affective aspects of appearance [1]. For instance, altered body appearance after cancer treatment can be accompanied with feelings of shame, negative self-esteem, or social avoidance [2, 3]. For some patients, negative aspects of body image are persistent and remain prevalent years after treatment [4, 5] and can negatively impact quality of life. Therefore, body image is considered to be an essential factor of health-related quality of life (HRQOL) in cancer patients [6, 7]. Monitoring HRQOL (including body image) in clinical practice is important to identify patients who may benefit from supportive care, and patient-reported outcome measures (PROMs) are often used for that purpose [8, 9].

The Body Image Scale (BIS) is a PROM developed to measure body image in all types of cancer patients. This is in contrast to other PROMs that aim to measure body image in non-cancer populations (e.g., Appearance Schemas Inventory-Revised (ASI-R)) [10] or in cancer patients with specific types of cancer or treatment (e.g., Breast Impact of Treatment Scale (BITS) in breast cancer patients, Sexual Adjustment and Body Image Scale (SABIS-g) in gynecologic cancer patients, and Body Image Screener for Cancer Reconstruction (BICR) for patients after breast reconstruction) [11,12,13]. The initial development and validation study of the BIS showed good measurement properties concerning internal consistency, known-group comparison and responsiveness among English-speaking breast cancer patients [14]. Since then, the BIS was validated in several other languages such as Dutch, Greek, and Portuguese [15,16,17] and across diverse cancer populations, e.g., in advanced cancer patients and colorectal cancer patients [18, 19]. Recently, Muzzatti et al. (2017) presented a review of PROMs measuring body image in cancer patients, including the BIS, and concluded that the measurement properties of these PROMs require more thorough investigation [20]. With respect to the BIS specifically, they concluded that the measurement properties were adequate, except for inconsistent results regarding structural validity and lacking evidence for criterion validity. However, not all measurement properties were taken into account (i.e., measurement error and responsiveness). Moreover, no guideline was used to interpret results, and the methodological quality of the extracted studies was not assessed. Therefore, the aim of this current study was to conduct a systematic review specifically focusing on the measurement properties of the BIS in cancer patients, following the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) methodology.

The COSMIN methodology is based on taxonomy and definitions of measurement properties for PROMs [21] including content validity, structural validity, internal consistency, cross-cultural validity, reliability, measurement error, criterion validity, hypotheses testing for construct validity, and responsiveness. The current study will add important information to the previous review [20], which is of high importance when considering the use of the BIS in clinical trials and practice as well as for interpretation of BIS outcomes.

Methods

The Body Image Scale

The 10-item Body Image Scale was developed by Hopwood et al. in 2001 to measure affective, behavioral, and cognitive body image symptoms. Patients can indicate body image symptoms on a 4-point scale (0 “not at all” to 3 “very much”). The total score ranges from 0 to 30 and can be calculated by summing up the 10 items. A higher score means a higher level of body image disturbance [14].

Literature search strategy

This study was part of a larger systematic review (Prospero ID 42017057237) [22], investigating the validity of 39 PROMs measuring quality of life of cancer survivors included in an eHealth application called “Oncokompas” [23,24,25]. Before the actual search, a search for reviews and meta-analyses of the measurement properties of each of the 39 PROMs was performed. This search did not yield any relevant results for the BIS.

The databases Embase, MEDLINE, PsycINFO, and Web of Science were systematically searched for publications directly investigating aspects of measurement properties of the BIS. Search terms were the measurement instrument’s name and its acronym, combined with search terms (text words and key words) for cancer, and a precise filter for measurement properties (Appendix A) [26]. The search was performed in July 2016 and updated in July 2017 to verify new publications. Search results were checked for duplications.

Inclusion and exclusion criteria

Studies were included that reported on original data about at least one measurement property as defined in the COSMIN taxonomy [21] related to the BIS. Validation studies of other PROMs that reported original data on the BIS (as comparison instrument) were also included. The COSMIN taxonomy [21] distinguishes nine measurement properties for PROMs: (1) structural validity (degree to which scores of a PROM are an adequate reflection of the dimensionality of the construct to be measured), (2) internal consistency (degree of interrelatedness among items), (3) reliability (the extent to which scores for patients who have not changed are the same for repeated measurement under several conditions), (4) measurement error (systematic and random error of a patient’s score that is not attributed to true changes in the construct to be measured), (5) hypothesis testing for construct validity (degree to which the scores are consistent with hypotheses on known-groups comparison, and on relations to scores of other PROMs (convergent and divergent validity)), (6) criterion validity (degree to which the scores are an adequate reflection of a gold standard), (7) responsiveness (the ability of a PROM to detect change over time in the construct to be measured), (8) cross-cultural validity (degree to which the performance of the items on a translated or culturally adapted PROM are an adequate reflection of the performance of the items of the original version), (9) content validity (degree to which the content of a PROM is an adequate reflection of the construct to be measured). In the present review study, we did not evaluate content validity because no protocol existed to evaluate this measurement property.

We excluded studies that were conference proceedings, studies without full-text available, publications in other languages than English, and studies that investigated populations without cancer. Full-text publications were reviewed by two independent raters (KN and FJ). Disagreements regarding inclusion and exclusion were discussed until consensus was reached.

Data extraction

Two independent extractors (KN and FJ) who identified eligible studies extracted information on each of the measurement properties defined in the COSMIN taxonomy [21]. Relevant data included the study population, sample size, the method, information on missing values, type of measurement property, and its outcome. Disagreements were discussed until consensus was reached.

Data analyses

Data analyses were performed in three steps to accomplish adequate interpretation of the results, following the COSMIN methodology [27].

First, we rated the methodological quality of the included studies, based on the COSMIN checklist for assessing the methodological quality of studies on measurement properties [28]. Methodological aspects regarding design requirements and preferred statistical methods, specific to the measurement properties under consideration were rated on a 4-point scale: “excellent,” “good,” “fair,” or “poor.” In accordance with COSMIN recommendations, overall methodological quality per measurement property of the BIS was obtained by taking the lowest rating of any of the methodological aspects assessed [29].

Second, criteria for good measurement properties were applied to the results of the included studies, following the COSMIN guidelines for systematic reviews of PROMs [27, 30]. Each measurement property in each individual study was rated as “sufficient” (+), “insufficient” (−), or “indeterminate” (?). For example, hypothesis testing for construct validity is rated as “sufficient” if at least 75% of the results are in accordance with the hypotheses. These results were qualitatively summarized to obtain an overall rating of the measurement property across all included studies: sufficient (+), insufficient (−), “inconsistent” (±) or indeterminate (?). If all studies indicated sufficient or insufficient results, the overall rating was accordingly. If there were inconsistencies between studies, explanations were explored. If no explanations were found, the overall rating would be inconsistent. The overall rating would be indeterminate if not enough information was available [27].

In the third step, this overall rating of evidence was supplemented by a level of quality of the evidence, using a modified Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach from the COSMIN methodology to grade the confidence in the total body of evidence available for the measurement properties [27]. Quality of the evidence was graded as high, moderate, low, or very low. This grade was based on (i) risk of bias, (ii) indirectness, (iii) inconsistency of results, and (iv) imprecision of studies. Each study was rated by a single rater (HM), whose ratings were checked by a second independent rater (KN). Discrepancies in ratings were discussed until consensus was reached.

Results

Search results

In total, 980 non-duplicate abstracts were screened, of which 208 abstracts concerned the BIS. The 2017 search update resulted in 16 extra abstracts on the BIS. Having applied inclusion and exclusion criteria, 177 studies were excluded after title/abstract screening. Of the remaining 47 studies, 37 were excluded after full-text screening and one was excluded during data extraction. In total, we included nine studies that investigated measurement properties of the BIS in cancer patients (see Fig. 1).

Fig. 1.
figure 1

Flow diagram of the systematic search

Study characteristics

Table 1 summarizes the characteristics of the included studies. One study described the development and validation of the BIS in English [14]. Six studies examined validity of the translated BIS in other languages (Greek, Spanish, Korean, Portuguese, Dutch, and Turkish) [15,16,17, 31,32,33]. In one study, screening of body image in patients with advanced cancer (locally advanced, recurrent, or metastatic) was specifically the focus [18]. One study validated the BIS in colorectal cancer patients undergoing surgery [19]. The study populations were breast cancer patients [14, 17, 33], colorectal cancer patients [19], patients with an ostomy (included because 82% of the population were cancer patients) [32], or a mixed cancer population (including breast, gynecological, gastro-intestinal, genitourinary, head and neck, hematologic, and respiratory cancer) [18, 31]. We report on the results based on data extracted from nine studies addressing structural validity, internal consistency, reliability, hypothesis testing for construct validity, and responsiveness. Although none of the studies reported on measurement error, this could be calculated for three studies. None of the studies presented results on cross-cultural validity or criterion validity.

Table 1 Characteristics of included studies

Measurement properties

Structural validity

In total, seven studies examined structural validity using exploratory factor analyses (EFA) [14, 16, 17, 19, 31,32,33] and three studies performed an additional confirmatory factor analysis (CFA) [16, 31, 32] (Table 2).

Table 2 Structural validity of the BIS

Two studies of excellent [14] and good [33] quality concluded that, over the total study sample, the BIS has a one-factor solution. In subgroup analyses, a two-factor structure was found among breast cancer patients after mastectomy [14] and breast cancer patients after surgery with immediate breast reconstruction [33]. Three fair quality studies also reported a one-factor solution [17, 31, 32] and one fair quality study reported a two-factor solution [16] among breast cancer patients after breast-conserving surgery (BCS) or mastectomy. In the poor quality study [19], a multi-trait item analysis was performed.

Based on these findings, structural validity of the BIS overall was rated sufficient (+) because two studies of at least good quality and three studies of fair quality support unidimensionality of the scale. It should be noted that in some studies, a two-factor solution was also found. The quality of evidence of structural validity was graded as moderate due to inconsistent findings.

Internal consistency

All nine included studies reported on internal consistency using Cronbach’s alpha (α) (Table 3). In the excellent and good quality studies, values ranged between α = 0.86–0.96 [14, 15, 19, 33]. These results are sufficient for internal consistency (α ≥ 0.70 and ≤ 0.95) [27], although in one mastectomy subgroup, a value of α = 0.96 was presented, which might reflect overlap of items within the scale. Five studies had fair methodological quality since missing items were not described. Of these studies, four showed sufficient internal consistency [16,17,18, 32] and one [31] showed insufficient results because of values of α = 0.97 in all subgroups.

Table 3 Internal consistency (Cronbach’s α) of the BIS

Based on these findings, internal consistency of the BIS overall was rated as sufficient (+) and the quality of evidence of internal consistency was graded as moderate because there is moderate evidence for the unidimensionality of the scale.

Reliability

Four studies examined test-retest reliability. The good and fair quality studies reported values of r = 0.92 [15] and r = 0.85 [32], indicating sufficient results. Two studies had poor quality and therefore indeterminate results because the time interval was considered too long (6 months compared to 2 weeks in the other studies) [33] and because of a small sample size (n = 19) [19], reporting values of ICC = 0.67 and r = 0.89, respectively. The low value of 0.67 may be an underestimation of the true reliability because of the long time interval. Hence, reliability of the BIS overall was rated as sufficient (+). The quality of evidence of reliability was graded as moderate because three out of four studies reported Pearson/Spearman’s correlation coefficients [15, 32, 33], while an intraclass correlation coefficient (ICC) would have been more appropriate.

Measurement error

Although measurement error was not reported in the included studies, we were able to calculate the standard error of measurement (SEM) and the smallest detectable change (SDC) in three studies reporting reliability data and standard deviations. Two studies of good [15] and fair quality (n = 40) [32] had an SDC of 4.7 (SEM = 1.7) and 9.1 (SEM = 3.3), respectively. The poor quality study because of the large time interval between the measurements had an SDC of 11.1 (SEM = 4.0) [33]. Interpretation of measurement error is only possible if a SDC score is compared with data on minimal important change (MIC), but this was not reported. Based on these findings, measurement error of the BIS overall was graded as indeterminate (?).

Hypothesis testing for construct validity

Known-groups comparison

Eight studies performed known-group comparisons (Table 4). No a priori hypotheses were formulated in four studies [15, 17, 18, 33], and in those cases, we assumed the hypothesis would be that BIS scores are higher (worse) (1) in patients who were treated with a mastectomy compared to patients treated with BCS [34] or breast reconstruction [35], (2) in younger patients compared to older patients [36], (3) in patients with a longer time since treatment [37], and (4) in patients with a stoma vs. without a stoma [38]. Two studies with good quality confirmed their hypotheses [14, 19]. Out of five studies with fair quality [15,16,17, 31, 33], two studies confirmed the hypotheses [15, 16]. One study had a poor quality [18] because no a priori hypotheses were formulated.

Table 4 Known-group comparison and convergent validity of the BIS

Convergent and divergent validity

Six studies reported on convergent validity with other body image-related instruments, psychological function, or HRQOL scales (Table 4). One good quality study [19] showed moderate correlation (r = 0.40 to 0.60) with a related construct but failed to confirm their hypotheses on three other constructs, indicating insufficient convergent validity. One study of fair quality [31] found moderate and high correlations (r > 0.60) with related constructs, indicating sufficient convergent validity. However, three other fair quality studies [16, 17, 33] presented low correlations (r < 0.40) with most of the related constructs, indicating insufficient convergent validity. The poor quality study did not formulate a hypothesis a priori [18]. None of the studies in this review examined divergent validity.

Based on these findings, hypothesis testing for construct validity was rated as inconsistent (±) because although three studies showed sufficient evidence (> 75% of the hypotheses on known-groups and/or convergent validity confirmed) [14, 15, 31], this was contradicted by four studies showing insufficient evidence [16, 17, 19, 33]. Moreover, studies reported inconsistent results in comparison with the same instrument (ASI-R and RSES) [17, 18, 33]. For this reason, and due to the lack of clearly stated a priori hypotheses, quality of evidence of construct validity was graded as low.

Responsiveness

Two studies reported on responsiveness. One study of good quality [14] found a significant increase in body image disturbance for the overall sample (n = 55) and for the BCS and mastectomy subgroups 2 weeks to 4 months postoperatively, indicating sufficient responsiveness. The other study had poor quality [19] because of a small sample size (n = 17) and found no change in BIS scores from before to after surgical treatment. Based on these findings, responsiveness of the BIS was rated as indeterminate (?). An overall summary of the results for every measurement property of the BIS is shown in Table 5.

Table 5 Overall rating of the results and levels of evidence of the BIS

Discussion

This systematic review evaluated the measurement properties of the BIS among nine studies identified in a literature search up to July 2017. In summary, evidence on structural validity, internal consistency, and reliability of the BIS was rated as sufficient, and the quality of evidence was moderate. Measurement error and responsiveness were rated as indeterminate, and hypothesis testing for construct validity was rated as inconsistent with a low quality of evidence. None of the studies reported on criterion validity and cross-cultural validity.

For structural validity, a one-factor solution was found and evidence was rated as sufficient. However, one fair quality study and subgroup analyses in two good quality studies showed a two-factor structure [14, 16, 33]. Hopwood et al. [14] found a two-factor structure among breast cancer patients after mastectomy, and Khang et al. [33] after surgery with immediate breast reconstruction. These two factors were labeled as “attractiveness” and “satisfaction with body” [14, 16]. However, there was no agreement on which items belonged to which factors precisely. Also, the findings were inconsistent and in the study of Khang et al. [33] based on a relatively small study sample (subgroup n < 50). Further research is therefore needed to investigate whether the BIS is a unidimensional construct in all breast cancer patients, regardless of treatment modality.

Evidence on reliability was sufficient because it met the criterion of 0.70 in three out of four studies (range 0.67–0.92). The one study that found a correlation < 0.70 had a large time interval (6 months) between the two measurements and was therefore judged as having a poor methodological quality. It is known that body image symptoms can change in the first few months after cancer treatment [14], with patients reporting high deterioration and recovery trajectories [39]. Moreover, body changes (e.g., weight fluctuations or healing of wounds) can occur within half a year. A 7–14-day interval for test-retest reliability is in general considered most appropriate [30].

Measurement error was not reported in any of the included studies, but the SDC could be calculated in three studies. When only taking into account good and fair quality studies, the smallest change in score that can be detected, that is not due to measurement error, ranges between 4.7–9.1 [15, 32], on a total range of 0–30 of the BIS. However, these data are difficult to interpret since no information is available on the anchor points minimal important change (MIC) or minimal important difference (MID). Therefore, further research is needed to establish these anchor points on changes that are important.

Evidence on hypothesis testing for construct validity was inconsistent since findings for known-group comparisons and convergent validity were inconsistent. Known-group comparisons in most studies focused on body image issues related to surgical treatment (comparing breast cancer patients treated with mastectomy versus BCS). It is known that other types of treatment may also impact body appearance. For example, cancer survivors who received chemotherapy reported that hair loss and weight gain disrupted their body image [40, 41]. In addition to recommendations to include other cancer populations than breast cancer patients [20], we also recommend to study construct validity of the BIS taking into account the impact of various cancer treatments on body image.

With respect to convergent validity, correlations with other body image scales were inconsistent. There were indications that consciousness of appearance (DAS24) and shame (ESS) are related with body image, with moderate to high correlations [17]. However, correlation with investment in appearance (ASI-R) was low [17, 18]. Moreover, the relation with self-esteem (RSES) was inconsistent, with only one of two studies finding a high correlation [31, 33]. Given these contradictory findings and the fair quality of these studies, no firm conclusions can be drawn about convergent validity of the BIS. This contradicts the conclusion of Muzzatti et al. presenting adequate convergent validity [20].

Evidence for responsiveness was indeterminate. Only one study of good methodological quality reported a change in BIS scores postoperatively [14], but no hypotheses were formulated on the expected magnitude of change and no comparison with another instrument was made. More research is needed about the ability of the BIS to detect change in body image symptoms over time.

A limitation of this review is that content validity was not investigated because at the time we conducted our data extraction, no protocol existed to investigate content validity through a systematic review. Recently, this protocol has become available [42]. Another limitation is that a precise filter instead of a sensitive filter was used. The precise filter was a pragmatic choice because a sensitive filter would provide too many hits to feasibly screen since the overall search encompassed 39 PROMs (Prospero ID 42017057237) [22]. There is a small possibility that validation studies of the BIS may have been missed. Lastly, the assessment of quality ratings was performed by one rater. This rating was then checked by a second independent rater, and discussed until consensus was reached. The gold standard practice is to have the assessment done by two raters independently because raters initially may have different opinions and consensus is needed.

This systematic review provides in-depth insight of the current evidence of the BIS as an instrument to measure body image in cancer patients and complements a recent review [20]. For researchers who want to further study the psychometric properties of the BIS, this paper points out future directions. With respect to reliability, this includes examining measurement error and research on minimal important change. Regarding validity, existing evidence on content validity should be summarized and new evidence is needed for cross-cultural validity. Criterion validity is impossible to assess, since a “gold standard” for assessing body image is not available. Efforts are therefore needed to reach consensus on a measure that could serve as second best. This may comprise body image scores by proxies such as health care providers with vast experience in the targeted study population. Furthermore, it would be valuable to examine structural validity on a possible two-factor structure among cancer subgroups (patients who had reconstructive surgery or amputation of a body part) more thoroughly. High-quality studies exploring convergent validity with investment in appearance (ASI-R) and self-esteem (RSES) are recommended. Finally, responsiveness should be more thoroughly investigated by formulating hypotheses for change scores in the BIS compared to change scores in other instruments. The BIS is mainly tested in a population of patients who are surgically treated for breast cancer. Further research including a wider variety of cancer patients and treatment modalities is recommended. New validation studies with a good methodological quality can further optimize evidence regarding the measurement properties of the BIS.