Introduction

Regular physical activity reduces the risk of premature death, cardio- and cerebrovascular disease, metabolic disorders and some forms of cancer [1, 2]. Based on the overwhelming evidence, the World Health Organization recommend adults to perform ≥150-min moderate-intensity aerobic physical activity, or ≥ 75-min vigorous-intensity aerobic physical activity per week [3]. More recently, the importance of sedentary behaviour (SB) for health has emerged. High levels of SB are associated with an increased risk of premature death, cardiovascular disease, metabolic disorders and cancer [4,5,6], with especially strong associations in those who are physically inactive. These observations highlight the importance of accurately measuring physical activity and SB in order to understand their respective roles in health outcomes.

Various devices [7] and questionnaires [8] are available to assess physical activity. Since SB is a distinct behavioural entity and not simply reflective of the lack of sufficient physical activity, these measures may not directly assess SB [9]. Furthermore, in contrast with structured exercise, SB occurs habitually throughout the day, making valid assessment of SB challenging. SB is defined as any activity during awake time with an energy expenditure ≤1.5 METs (i.e. sitting or activities in reclining posture) [9, 10]. Patterns and total volume of SB can be assessed using objective measures such as thigh-worn accelerometers combining acceleration and posture, which is currently regarded as the gold standard to quantify free-living SB and to distinguish between sitting or lying, standing and physical activity [11]. Nonetheless, used in isolation, these objective measures do not distinguish between different domains (e.g. occupation, transportation and leisure time) and settings (e.g. TV viewing, car driving and sitting while reading) of SB. This is important since some settings of sitting, e.g. TV viewing and screen time, are more strongly associated with poor health outcomes compared to total sedentary time [12,13,14] and may serve as useful intervention targets. These observations emphasise the need for valid subjective measures to assess SB within the various domains and settings in which it occurs. Ideally, these measures should be taken in combination with objective assessments [15]. However, given this is not always possible or feasible, it is also important to understand the measurement metrics of self-report methods when they are used in isolation.

Several self-reported tools (i.e. questionnaires, logs and diaries) have been developed recently to measure SB. These tools vary from single-item questions to extensive questionnaires about SB considering various domains. Currently, some reviews compared the validity and reliability of these tools [15, 16]. However, previous reviews did not take the risk of bias across studies into account and did not combine the results into a meta-analysis. Knowledge about the validity, reliability and the quality of the studies performed is essential to plan, perform and correctly interpret results in this field of research, because measurement error may seriously impact study results. The aim of this systematic review and meta-analysis was to identify subjective methods to assess SB and, subsequently, to examine their validity and reliability to assess SB in adults. Where the sedentary time measured by subjective methods was compared to objective and other subjective methods. This overview will contribute to improved selection of appropriate subjective measures of SB (in relation to their research question), and to identify gaps of knowledge within this area of research.

Methods

Date source and literature search

A literature search was performed in databases of MEDLINE, EMBASE and SPORTDiscus. The search strategy combined three main search terms: sedentary behaviour, self-reported measures, and validity/reproducibility. The complete search strategy is shown in the Additional Table 1. The last search was performed on March 11th, 2020. All citations were imported into the bibliographic database of EndNote, version X7 (Thomas Reuters, New York City, NY). This review was registered in PROSPERO (number CRD42018105994) and the ‘Preferred Reporting Items for Systematic Reviews and Meta-Analyses’ (PRISMA) [98] guidelines were used to perform the systematic review and meta-analyses.

Table 1 Description of measurement tools to determine sedentary behaviour

Selection of papers

After importing all citations in Endnote, duplicates were removed, and title, abstract and full text were independently screened by two reviewers (EB, YH). In case of disagreement, a third reviewer (TE) was consulted. Inclusion criteria were: 1) assessment of SB, 2) evaluation of subjective measurement tools, 3) being performed in healthy adults, 4) manuscript written in English, and 5) paper was peer-reviewed. Papers were excluded if the study did not aim to determine any construct of SB, when studies did not investigate the validation or reliability of the tool and/or the aim was to cross-cultural validate the subjective tool in different languages. A flowchart of the search strategy and the inclusion of manuscripts is presented in Fig. 1.

Fig. 1
figure 1

Flowchart of the inclusion of studies

Data extraction, synthesis and analysis

Study characters were extracted using an extraction form including: 1) study population, 2) number of participants, 3) gender and age, 4) the construct of SB measured (domain, setting, recall period, number of questions), 5) measurement outcomes (e.g. total sedentary time, breaks in sitting time, bouts), 6) comparison measure when validity was assessed, 7) interval between first and second measure when reliability was assessed, and 8) results of the measurement properties (e.g. intra correlation coefficients [ICC], correlations, mean bias with limits of agreement, kappa values and sensitivity/specificity). The extraction form was created by one (EB) and piloted by both reviewers (EB, YH). The pilot was performed using 10 randomly selected studies and changes were made to improve the extraction form. The quality of the studies was determined using the checklist with 4-point scale of COSMIN (Consensus-based Standards for the selection of health Measurement Instruments) criteria [99,100,101]. The COSMIN checklist contained items about the criterion validity (Additional Table 2) and reliability (Additional Table 3). For each item different design requirements and statistical methods were rated on quality using a 4-point scale. A methodological quality score per item was obtained by taking the lowest rating of any score per item (‘worse score counts’) [101].

Table 2 Construct validity of subjective sedentary behaviour measurement tools
Table 3 Reliability of subjective sedentary behaviour measurement tools

Assessment of construct validity and reliability

Criterion validity was defined as the degree to which the outcome measure measures the construct it purposes to measure [103]. Thigh-worn accelerometry (e.g. activPAL) was considered as the gold standard for total sedentary time, as they can more accurately distinguish between sitting and standing [11]. Hip-, waist- and wrist-worn accelerometers are frequently used as criterion measure. However, these accelerometers are not sensitive enough to distinguish between stationary standing and sitting [104]. On these grounds, studies using only hip-, waist- and wrist-worn accelerometers as criterion measure were graded with a lower level of evidence. In addition, if validity results of both thigh-worn accelerometers or hip-, waist- and wrist-worn accelerometers were included in the study, only the results of the thigh-worn accelerometers were reported in this review.

Reliability was defined as the degree of consistency and reproducibility of a measurement tool. Test-retest reliability is often assessed using an ICC [103]. Since Pearson and Spearman correlation coefficients neglect systematic errors, the use of Pearson and Spearman correlation coefficient was considered as inadequate and these studies were graded with a lower level of evidence. In addition, if studies provided both ICCs and correlation coefficients, only ICCs were reported in this review. An ICC > 0.90 was considered as excellent, ICC between 0.75–0.90 was considered as good, ICC between 0.50–0.75 as moderate and > 0.50 as poor [105].

Data analyses

A meta-analysis using random effects [106] was performed to assess the pooled validity of the 1-item questionnaires, 2 to 9-item questionnaires, ≥10-item questionnaires and logs/diaries. A random effect model was used because it was unlikely that included studies were functional equivalent and results of the included studies had a large heterogeneity. Only studies expressing validity as Pearson or Spearman correlation coefficients were included in this analysis. When no correlation coefficient was provided for total sedentary time, an (unweighted) mean was calculated based on correlation coefficients of all setting and domains. Finally, I2 was calculated, which describes the proportion of total variation in effect size that was due to systematic differences between effect sizes rather than by chance [106]. Stratified analyses including only studies examining questionnaires with a good-to-excellent quality were performed to investigate if the quality of the study affected the pooled validity. Meta-analyses were performed using R with ‘Meta-Analysis with Correlations’ (MAc) package, version 1.1.1.

Results

Search results

The literature search resulted in 2423 hits (Fig. 1). After excluding duplicates, 1272 studies were screened for title and abstract. Most papers were not eligible for this review because: i. the articles did not aim to determine SB, ii. no measurement properties were assessed, and/or iii. The study was performed in children or diseased populations. In total 82 studies and 75 self-reported measurement tools were included (Table 1).

Attributes of the questionnaires, logs and diaries

The majority of the subjective measures were questionnaires and contained different domains and settings of SB (Table 2). Measurement tools differed regarding the timing (week vs weekend), recall period and number of questions. Nearly all self-reported measurement tools expressed SB in total sitting time (hrs/day or hrs/week). The PASB-Q, SITBRQ, SIT-Q, SIT-Q-7d, TASST and several other questionnaires [31, 54, 61, 67, 69, 71, 78, 79] included total sitting time, but also information about sitting bout duration or breaks in sitting time.

Validity

A total of 80 studies examined the validity of one or more methods to assess SB, resulting in a comparison of 96 unique methods (Table 2). Of the 96 results, 5 were ranked with an excellent quality of the study, 7 studies with a good quality, 9 with a fair quality and 75 with a poor quality. The most important shortcoming of the validation studies was the use of an accelerometer (n = 62) to examine criterion validity of the method to assess SB. A total of 29 studies used the gold standard approach (thigh-worn accelerometer), three studies used diaries/logs and one used direct observation to assess construct validity. Most studies calculated correlation coefficients between the criterion measure and the self-reported questionnaire, which ranged between − 0.01 to 0.90 for total sedentary time and ranged between 0.02 to 0.39 for number of sedentary bouts or breaks (Table 3). Other studies used ICCs (N = 8), kappa values (N = 2), and sensitivity and specificity outcomes (N = 1) to determine the validity, and some added Bland-Altman plots with a mean difference and limits of agreement to examine the accuracy of the method to assess SB (N = 48). Figure 2a provides an overview of the correlation coefficient of all individual studies combined with the quality of the study.

Fig. 2
figure 2

Overview of construct validity (a) and test retest reliability (b). 1 EEPAQ, Lopez-Rodriguez et al. 2017; 2 GPAQ, Chu et al. 2018; 3 GPAQ, Cleland et al. 2014; 4 GPAQ, Kastelic et al. 2019; 5 GPAQ, Laeremens et al. 2017; 6 GPAQ, Metcalf et al. 2018; 7 GPAQ, Rudolf et al. 2020; 8 GPAQ, Wanner et al. 2017; 9 IPAQ (short), Craig et al. 2003; 10 IPAQ (short), Prince et al. 2018; 11 IPAQ (short), Rosenberg et al. 2008; 12 Modified MOSPA-Q, Chau et al. 2012; 13 PPAQ, Simpson et al. 2015; 14 SED-GIH,; 15 SQ, Aguilar-Farias et al. 2015; 16 SQ, Clemes et al. 2012; 17 TASST Single item total times, Dontje et al. 2018; 18 TASST TV time, Dontje et al. 2019; 19 TASST Single item total times, Chastin et al. 2018; 20 TASST Single item proportion, Chastin et al. 2018; 21 TASST TV time, Chastin et al. 2018; 22 T-SQ, Kozey-Keadle et al. 2012; 23 TV-Q, Kozey-Keadle et al. 2012; 24 YPAS, Gennuso et al. 2015; 25 Single item proportion (3 months), Gao et al. 2017; 26 Single item proportion (1 day), Gao et al. 2017; 27 Gupta et al. 2017 [29]; 28 AQuAA, Chinpaw et al. 2009; 29 Cancer Prevention Study-3 Sedentary Time Survey, Rees-Punia et al. 2018; 30 CHAMPS, Hekler et al. 2012; 31 CHAMPS, Gennuso et al. 2017; 32 FPACQ, Matton et al. 2007; 33 FPACQ, Scheers et al. 2012; 34 IPAQ (long), Chastin et al. 2014; 35 IPAQ (long), Chau et al. 2011; 36 IPAQ (long), Cleland et al. 2018; 37 IPAQ (long), Craig et al. 2003; 38 IPAQ (long), Rosenberg et al. 2008; 39 IPAQ (long), Ruan et al. 2018; 40 IPAQ (long), Wanner et al. 2016; 41 OPAQ, Reis et al. 2005; 42 OSPAQ, Chau et al. 2012; 43 OSPAQ, Jancey et al. 2014; 44 OSPAQ, Pedersen et al. 2016; 45 OSPAQ, van Nassau et al. 2015; 46 PAS2, Pedersen et al. 2017; 47 PASBAQ, Scholes et al. 2014; 48 PASB-Q total SB, Fowles et al. 2017; 49 PASB-Q breaks, Fowles et al. 2017; 50 PAST-U, Clark et al. 2016; 51 PAT Survey, Yi et al. 2015; 52 RPAQ, Besson at el. 2010; 53 RPAQ, Golubic et al. 2014; 54 Regicor Short Physical Activity Questionnaire [47] Molina et al. 2017; 55 SCCS PAQ, Buchowski et al. 2012; 56 SITBRQ bout frequency, Pedisic et al. 2014; 57 SITBRQ bout duration, Pedisic et al. 2014; 58 Stand Up For Your Health Questionnaire, Gardiner et al. 2011; 59 STAQ, Mensah et al. 2016; 60 TASST, Sum of domains, Dontje et al. 2018; 61 TASST Sum of domains, Chastin et al. 2018; 62 TASST Patterns, Chastin et al. 2018; 63 Survey of older adults’ sedentary time, Gennuso et al. 2016; 64 Web-based physical activity questionnaire Active-Q, Bonn et al. 2015; 65 WSWQ Time method, Matsoe et al. 2016; 66 WSWQ Percentage method, Matsoe et al. 2016; 67 Sedentary time, Clark et al. 2011; 68 Sedentary breaks, Clark et al. 2011; 69 Jefferis et al. 2016; 70 Lagersted-Olsen et al. 2014; 71 Mielke et al. 2020; 72 Sitting time, Sudholz et al. 2017; 73 Sitting breaks, Sudholz et al. 2017; 74 ASBQ, Chu et al. 2018; 75 D-SQ, Kozey-Keadle et al. 2012; 76 MPAQ, Anjana et al. 2015; 77 MSTQ, Whitfield et al. 2013; 78 PAFQ sitting time, Verhoog et al. 2019; 79 PAFQ sitting proportion, Verhoog et al. 2019; 80 PAST-WEEK-U, Moulin et al. 2020; 81 NIGHTLY-WEEK-U, Moulin et al. 2020; 82 SBQ, Kastelic et al. 2019; 83 SBQ, Prince et al. 2018; 84 SBQ, Rosenberg et al. 2010; 85 SIT-Q, Lynch et al. 2014; 86 SIT-Q-7d, Busschaert et al. 2015; 87 SIT-Q-7d, Wijndeale et al.2014; 88 STAR-Q, Csizmadi et al. 2014; 89 TASST Chastin et al. 2018; 90 WSQ, Chau et al. 2011; 91 WSQ, van Nassau et al. 2015; 92 WSQ, Toledo et al. 2019; 93 Clark et al. 2015; 94 Clemes et al. 2012; 95 Ishii et al. 2018; 96 Marshall et al. 2010; 97 Van Cauwenberg et al. 2014; 98 Visser et al. 2013 [64]; 99 7-day SLIPA Log, Barwais et al. 2014; 100 BAR, Hart et al. 2011; 101 BeWell24 Self-Monitoring App, Toledo et al. 2017; 102 cpar24, Kohler et al. 2017; 103 EMA, Knell et al. 2017; 104 MARCA, Aguilar-Farias et al. 2015; 105 MARCA, Gomersall et al. 2015; 106 PAMS, Kim et al. 2017; 107 Time Use Survey, van der Ploeg et al. 2014; 108 Updated PDR, Matthews et al. 2013. The studies within each category are place randomly to avoid overlap when they are aligned. An ICC > 0.90 was considered as excellent, ICC between 0.75–0.90 was considered as good, ICC between 0.50–0.75 as moderate and > 0.50 as poor

Meta-analyses

The correlation coefficients of logs and diaries (correlation coefficient estimate [R] = 0.63 [95% CI 0.48–0.78], I2: 95%) were substantially higher than the coefficients of the questionnaires (R = 0.35 [95% CI 0.32–0.39], I2: 90%). Furthermore, correlation coefficient estimates of the questionnaires with ≥10-item (R = 0.37 [95% CI 0.30–0.43], I2: 86%)) did not differ much from the questionnaires with fewer items (1-item questionnaire R = 0.34 [95% CI 0.30–0.39], I2: 68%; 2 to 9-item questionnaires R = 0.35 [95% CI 0.29–0.41], I2: 93%) (Fig. 2a). Stratified analyses, including only studies examining questionnaires with a good-to-excellent quality, revealed similar results (R questionnaires = 0.35 [95% CI 0.28–0.42], I2: 87%).

Reliability

Reliability for total sitting time and number of breaks in sitting time was determined in 44 studies. One study was rated with excellent quality; other studies were rated with good (n = 27), fair (n = 16), and poor (n = 8) quality. Most studies with a lower quality of the study were limited by a small sample size and calculation of correlation coefficients instead of ICCs. The time interval between the first and second assessment ranged between 0.5 h and 15 months, but most studies had an interval of 1–2 weeks (n = 40, Table 3). The majority of the studies calculated the ICC to examine the test-retest reliability of SB, but some studies used correlation coefficients (N = 6), Bland-Altman plots with mean difference and limits of agreement (N = 2), and kappa values (N = 2). The ICC of the test-retest reliability of the subjective measures of SB ranged between 0.44 and 0.91 (Table 3, Fig. 2b). The ICC estimates were comparable between the logs and diaries, ≥10-items questionnaires, 2 to 9-item questionnaires, and 1-item questionnaires.

Discussion

Time spent in SB has markedly increased over the last few decades and is expected to continue to increase even further [107]. Since SB is associated with many adverse health outcomes [4,5,6], exposure to excessive levels of SB represents an emerging health threat, particularly in the least physically active [108]. To improve quality and guide future studies in this rapidly expanding area of research, this systematic review assessed the validity and reliability of subjective measures of SB, taking the methodological quality into account. We present the following observations. First, despite the presence of several measures to assess SB, significant variability in measurement properties and quality of the studies is present. Second, criterion validity of the subjective measures ranged between poor to excellent (R range − 0.01 to 0.90), in which the quality of most studies (i.e. level of evidence) was poor. Third, the validity of the logs/diaries was more favourable compared to the validity of questionnaires, with little improvement in validity of questionnaires when including multiple questions. Fourth, a moderate-to-good reliability was found for questionnaires and logs/diaries, with the quality of these studies being largely fair-to-good. Taken together, logs and diaries are recommended to validly and reliably assess SB when only self-report measures are available. However, considering limitations pertaining to logs and diaries (e.g. time constraint, resources), one may prefer using questionnaires in larger scaled observations.

Validity of measures of SB

This meta-analysis showed that the overall validity for instruments to assess SB characteristics was moderate to low. These observations raise the question whether these results relate to the poor validity of methods to assess SB per se or the poor quality of the studies that were included. Excluding studies with lower quality from our meta-analyses reinforced the poor-to-moderate validity of the various methods, suggesting measures of SB possess poor validity. It is important to indicate that questionnaires examining physical activity show similarly poor level of validity [8]. This highlights the difficulty of examining subjective physical (in) activity behaviours with questionnaires, a finding that seems present across the whole physical activity spectrum: from SB to exercise. Due to the low validity and the large variation in quality, the results of different studies are difficult to compare or harmonise. More importantly, the large variety in validity and questionnaire characteristics (i.e. type and context of SBs) prevents the identification of one (or few) questionnaire(s) that can be recommended for all type of future research that aim to examine SB.

Factors explaining the poorer variation in validity of the questionnaires versus diaries/logs may relate to differences in qualitative attributes (e.g. recall period and questions/formats). For example, diaries/logs typically adopt a short recall period (e.g. every 15–30 min), whilst questionnaires are often filled in covering a longer recall period (i.e. day, week, and/or month). Consequently, diaries and logs are less reliant on long-term recall and can more accurately capture sporadic and intermittent behaviours. This fits with the higher validity of diaries/logs versus questionnaires. Unfortunately, this approach of using diaries/logs comes with the cost of high participant burden (in time), which subsequently may limit the response and compliance rate and introduce reporting bias. Another potential limitation of logs/diaries is that repeatedly filling in SBs may influence participants’ behaviour and cause (unwanted) adjustment of SB. These factors should be considered when deciding on the preferred way to assess SB in a future study.

Previous work-related poor validity of questionnaires to systematic and random error, specifically reporting and recall bias which may lead to a low agreement with over- and underestimation (Table 2). For example, a potential underestimation of SB in single-item questionnaires was suggested [15, 104], whereas wider limits of agreement in questionnaires are present with multiple items [104]. Another factor contributing to validity of questionnaires may relate to the number of questions, and therefore detail of information, with more questions on SB potentially improving the criterion validity of the measurement tool. In contrast to this hypothesis, our analysis revealed no substantial differences between the criterion validity of the 1-item, 2-to-9-item and ≥ 10-item questionnaires. One possible explanation is that participants find it difficult to recall SB, with multiple-item questionnaires making it even more complicated to replicate detailed and domain-specific patterns of SB [31]. Furthermore, some behaviours are easier to remember because these are more habitual and restricted to certain periods during the day, e.g. TV viewing, computer use or sitting at work [15, 31, 86]. Finally, multiple-item questionnaires may over-report SB because subjects may report sedentary activities twice when using sub-scales (e.g. driving while listing to music). Although more questions may cover multiple domains and provide more detailed information, the complexity of these questionnaires may contribute to the negligible improvement in criterion validity of multiple-item questionnaires for total sedentary time. Nonetheless, exploring multiple domains of sitting may still seem relevant. For example, some domains are more strongly associated with poor health outcomes [12,13,14], whilst detailed information about domains may provide insight for intervention development.

Reliability of subjective measures of SB

Despite the significant heterogeneity in validity of the various measures to assess SB, the reliability of the questionnaires and diaries or logs were moderate-to-good. Importantly, these conclusions are based on studies with a fair-to-good quality. A central question pertaining to the reliability of questionnaires is whether differences are present in reliability for weekdays versus weekend days or for workdays versus non-workdays, especially given the marked differences in (sedentary) behaviour that exist between these days [104]. Indeed, our study found that approximately 50% of included studies reported a ≥ 10% better reliability to assess SB during weekdays versus weekend days or during workdays versus non-workdays (Table 3). These observations support a previous review, which reported higher reliability for weekdays compared to weekend days [104]. Moreover, we found that reliability was better for specific behaviours, such as TV viewing, compared to a more general categories, such as ‘other leisure time activities’. An explanation for this finding is that more specific and regularly performed behaviours have a higher reliability [15].

Choosing an appropriate measurement tool

Logs and diaries have a higher validity compared to the questionnaires, are less reliant on long-term recall and can more accurately capture sporadic and intermittent behaviours. Therefore, we recommend logs and dairies as self-reported measurement tools. However, important limitations such as time constrains, lack of resources and the potential to influence participants’ behaviour, make them less useful for large-scale observational studies and/or intervention studies. Within the spectrum of questionnaires, there is no obvious preference for a single questionnaire. In fact, the most appropriate tool seems to depend on the nature of the study, especially since this review showed large variety in both validity and questionnaire characteristics (i.e. type and context of SBs). Therefore, some studies will benefit from questionnaires focusing on specific domains of SB, whilst others will benefit from a reliable estimate of total sedentary time or distribution of SB. Furthermore, when performing an intervention study, measures will benefit from the ability to measure changes across time. Since this ability was not examined within this review, we cannot make specific recommendations related to this type of studies. Nonetheless, these characteristics should be taken into account when planning such studies. Ultimately, and when feasible, a combination of objective and subjective assessments is preferred to provide valid and reliable insight into SB.

Conclusions

This review identified the widespread (and rapidly growing) use of a large range of self-reported measures of SB, which significantly differ in type, extensiveness, complexity and duration. Our results indicated that the criterion validity of subjective measures ranged between poor and excellent, whereas the quality of most studies was poor. The validity of the logs/diaries was significantly higher compared to the questionnaires, with little improvement in criterion validity of questionnaires when increasing items to assess SB. Therefore, when only self-report measures are feasible, logs and diaries are recommended to validly and reliably assess SB, but due to time constraints and resources related to logs and diaries, 1-item questionnaires may be preferred in large-scale studies when showing similar validity and reliability compared to longer questionnaires. Whenever feasible, the combination of objective and subjective assessments will provide the most valid and reliable method to assess SB.