Introduction

Mitochondrial diseases (MD) are the most prevalent inherited metabolic diseases, with an incidence of ∼1:5000 live births (Schaefer et al. 2004). Since mitochondria are present in almost all cells, theoretically, symptoms can arise from every organ. The most commonly affected organs and tissues include the brain, eye, heart, and skeletal muscle (Koopman et al. 2012). There is enormous variability in the pattern of affected organs and degree of disability experienced. Whereas some children with MD thrive in mainstream school and live well into adult life, others follow a more rapidly progressive course and die in the neonatal period or function at a low level, barely interacting with their environment.

Currently, there is no cure for MD, but there are some promising results of pharmacological interventions in cells and animals, and the prospects for randomised clinical trials of novel and repurposed pharmaceuticals are increasing (Koene and Smeitink 2009; Wenz 2009; Viscomi et al. 2011, 2015; Koopman et al. 2012; Blanchet et al. 2015; Peng et al. 2015). Outcome measures that are valid, reliable, sensitive and clinically relevant are critical to the success of such trials, but the heterogeneity and multisystemic nature of MD pose significant challenges in choosing an appropriate, universally applicable outcome measure (Koene et al. 2013a). To be able to measure disease severity and progression within the full range of the phenotypic spectrum, a combination of objective, subjective, functional and biochemical end points will probably be necessary.

An MD-specific follow-up tool for use with children already exists: the Newcastle Paediatric Mitochondrial Disease Scale (NPMDS) (Phoenix et al. 2006). This scale was originally designed to be a concise and pragmatic clinical tool to monitor the biophysical markers of disease progression. Although the NPMDS fits this purpose from a natural disease course perspective, it was not designed as an end-point instrument and probably lacks a sufficient level of detail required for this purpose in clinical trials. Moreover, the scale was not developed to measure the clinically relevant concept of functional disability (Phoenix et al. 2006).

In this study, we aimed to adapt the NPMDS to a more clinically relevant and detailed scoring system for future clinical trials in paediatric patients with MD, thus creating the International Paediatric Mitochondrial Disease Scale (IPMDS), which should cover more symptoms indicated by patients and parents as “burdensome,” such as tiredness and lack of energy, behavioural problems and depression (Koene et al. 2013b). We also aimed to include a functional domain to quantify changes in a child’s motor abilities, since clinically relevant changes in motor function are not always equally reflected by changes in muscle power or tone and vice versa (Abel et al. 2003; Beenakker et al. 2005; Parreira et al. 2010). After a Delphi-based development process, we tested the construct validity and reliability (interrater, intrarater and test–retest) by field testing in several expert international centres.

Methods

The IPMDS was developed during a Delphi-based process by consulting patients, parents and MD experts. After the pilot reliability test, the scale was further optimised for subsequent testing in five expert centres. All raters received both written and video instructions. At each centre, two to four randomly selected patients were assessed by three to four physicians subsequently to evaluate interrater reliability. Construct validity was tested using factor analysis and by hypothesis testing. In a subset of patients within these studies, test–retest reliability and intrarater reliability was tested. For a more detailed description of the methods, we refer to supplementary file 1.

This study was conducted in all indicated countries after approval from the regional Medical Research Ethics Committee (MREC NL.44833.091.13). In accordance with the Declaration of Helsinki, written informed consent was obtained from each participant or his/her legal guardian(s).

Statistics

Because of the relatively small number of patients in our study, we used nonparametric tests for analyses and report medians and ranges. Parent and patient experiences were assessed for each patient individually. From a rater perspective, the number of blank items was counted to test feasibility. Factor analysis was used to explain the variance–covariance matrix for data in terms of relationships between a much smaller number of unobserved variables, called factors. We used data from the rater with the least missing data, and missing items were replaced using the mean of the other three raters. We removed items for which <80 % was completed, including items that could be assessed only when the child was >6 years old (hopping, running, and rotating) and in case the child was unable to report complaints, such as headache, gastroesophageal reflux, muscle pain, vibration or subtle touch at physical examination. Items with little variance (less than two items score 1 or more) were also removed for the factor analysis. Principal component analysis (PCA) was used as the extraction method for factors. The orthogonal rotation (varimax) with Kaiser’s normalisation was used to simplify the interpretation of factors. The number of factors extracted was based on Eigen values >1. Sampling adequacy was determined using the Kaiser-Meyer-Olkin (KMO) measure. In addition, Bartlett’s test of sphericity was applied to test whether correlations between items were sufficiently large for PCA. The percentage of variance explained by each factor is also presented. Sum scores for the factors were calculated using clinically suitable items. Cronbach’s alpha was calculated as a measure of internal consistency of each constructed factor. The difference between patients with mild and with severe disease (median value rated by physicians) for these sum scores was tested using the Mann–Whitney U test.

We used hypothesis testing to assess construct validity. Since all mentioned hypotheses involved two measurements of the same construct, we aimed at moderate to good correlations (ρ = 0.4–0.79). We used Spearman’s correlation coefficients to correlate between continuous or interval variables (IPMDS and its subdomains and functional parameters). The mean raters’ score in each patient was used to calculate the correlation. Interrater reliability between physicians within one centre seeing one patient was calculated using intraclass correlation coefficient for agreement between raters (ICCagreement). Intrarater reliability within two physicians was calculated using ICCagreement between each rater’s scores. Test–retest reliability was calculated using ICCagreement between scores. An ICCagreement ≥ 0.7 was used as “acceptable” (Vet et al. 2011).

We used a p value of 0.05 for statistical significance. Correlation coefficients were interpreted in accordance with the guidelines provided at the BMJ website (http://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/11-correlation-and-regression). All analyses were performed using IBM’s SPSS Statistics 22.0.0.1.

Results

Cohort description

A clinically and genetically heterogeneous cohort of 17 children aged 1.6–16 years from five expert centres participated in this study (supplementary Table 4A). Rater details are presented in supplementary document 4B.

Scale description

After our Delphi-based process, the IPMDS consists of 61 items in three domains: 23 in Domain 1 (subjective complaints and symptoms; obtained by interviewing parents); 25 in Domain 2 (physical examination; obtained by physical examination) and 13 in Domain 3 (functional assessment; obtained by physical/motor function evaluation). The final scale and the manual are presented in supplementary documents 2 and 3.

Feasibility

The average time to complete the IPMDS was 35 min; 96 % of all items (85–100 %) relevant to the patient based on age and/or mental capacities was completed. Sixteen (patients and their) parents filled out the feasibility questionnaire. All patients and parents indicated that the number of questions, the burden of the physical examination to the child, the duration of the interview, physical examination and the total time burden were just right or could be (much) longer. The parents of a 2-year-old patient experienced difficulties in translating the questions to the situation of a toddler. This was not reflected in the number of missing items for toddlers (94 %; 91–97 %) and in the interrater reliability for the first and the third domain (ICCagreement 0.86 versus 0.74 and 0.96 versus 0.97, respectively, for toddlers versus all children). The interrater reliability of the second domain was low in toddlers compared with the whole cohort (ICCagreement 0.39 versus 0.81).

Factor analysis

A total of 44 of 61 items was included in the factor analysis. Based on factor loadings and communalities, we were able to include 32 of the 49 items in one of the factors (Table 1). Items with little variance were removed (1.17 diarrhoea, 2.03 alertness, 2.04 breathing, 2.09 nystagmus). Sample adequacy of the factor analysis was sufficient for all items within the IPMDS (KMO 0.51). Bartlett’s tests of sphericity indicated that correlation between items was sufficiently large for PCA (p <0.001).We identified three factors (Tables 1 and 2), naming them based on items they contained: 1, basic functioning (31.1 % of explained variance); 2, eating and digesting (14.1 % of the explained variance); 3, abnormalities at neurological examination (12.7 % of explained variance). The internal consistency of these factors was good for factor 1 (Cronbach’s alpha = 0.97), acceptable for factor 2 (Cronbach’s alpha = 0.76) and poor for factor 3 (Cronbach’s alpha = 0.16). The total percentage of explained variance by the three factors was 57.9 %. Spearman’s correlation coefficients between factors (Table 3) indicate that all factors represent a unique identity.

Table 1 Factor analysis: factor loadings for individual items. Based on clinical communalities between items, we composed three factors (in bold) including 32 of the 49 items
Table 2 Factor analysis: factors and their characteristics resulting from factor analysis of items within the IPMDS
Table 3 Factor analysis: Spearman’s correlation coefficients between sum scores of factors

Construct validity

We hypothesised that patients with a severe general MD severity would have higher median score on extracted factors compared with patients with mild disease (Table 4). This hypothesis was confirmed for factor 1 (basic functioning; p = 0.01; n = 13) but not for factors 2 (eating and digesting) and 3 (abnormalities at neurological examination (p = 0.07 and 0.28, respectively). In Table 5 correlation coefficients between factors, IPMDStotal, IPMDS subdomains, severity scores, Pediatric Evaluation of Disability Index (PEDI) and the NPMDS are presented. Correlation between the basic functioning factor and PEDI, rating the functional performance and abilities of the child, was excellent (ρ = 0.90; p < 0.001); correlation between the factor abnormalities at neurological examination and rater-rated severity of abnormalities at general physical examination was weak (ρ = 0.23; p < 0.05). All predefined hypotheses used for construct validity testing indeed had good to excellent correlations (ρ = 0.64-0.90; Table 5).

Table 4 Factor analysis: median and IQR of factor sum scores for patient groups within different disease stages and difference between groups
Table 5 Construct validity of the IPMDS: correlations between tests

Reliability

Table 6 shows the interrater reliability of the IPMDS. Inter-rater reliability of sum scores of the three factors was good (ICCagreement = 0.98, 0.94 and 0.78 for basic functioning, eating and digesting, and abnormalities at neurological examination, respectively). The median ICCagreement of all individual items within the IPMDS was 0.85 (0.23–0.99), 0.81 (0.44–0.98) of the first domain, 0.74 (0.23–0.93) of the second domain and 0.97 (0.93–0.99) in the third domain (Table 6).

Table 6 Interrater reliability of items on the International Paediatric Mitochondrial Disease Scale (IPMDS)

Small pilot studies on video-rated intra-rater and test–retest reliability showed good intrarater reliability (median ICCagreement = 0.87; n = 4). Test–retest reliability by telephone interview of the first domain after 1 week was excellent when performed by the same rater (median ICCagreement = 1.0 in 15 of 16 items with variance; n = 4) but inconsistent when performed by a different rater (median ICCagreement 0.73; n =4 ).

Discussion

The IPMDS, a multidimensional scale rating clinically relevant aspects of MD in children, was developed during a Delphi-based process by an international expert team in close dialogue with patients and their parents. Designing a scale for general MD severity is challenging because of the wide variability in symptoms and (dis)abilities seen in children with MD. This complexity is reflected in the high number of items included in the IPMDS and the breath of items reflecting the multisystemic nature of the disease. Critically evaluating psychometric properties in 17 children in five international expert centres, we found good feasibility and an acceptable construct validity and reliability.

We found a suboptimal interrater reliability for some IPMDS items, mainly within the physical examination domain. This is in agreement with literature reports in which low interrater agreement in gait analysis, neurological reflexes and classification of movement disorders has been reported (Miller and Johnston 2005; Singerman and Lee 2008; Bella et al. 2012; Beghi et al. 2014). Lower agreement was observed in patients with developmental delay (Hsue et al. 2014) and in general neurologists compared with residents and experts (Beghi et al. 2014). Our teams of raters reported difficulties in agreeing on hypertonia and rigidity. This might be explained by the mixed picture of pyramidal and extrapyramidal syndromes in patients with MD (and more specifically in patients with Leigh syndrome, which represents about half of the cases included in this study). Moreover, tone is highly dependent on the child’s level of alertness, emotional state, fatigue level posture (Sanger et al. 2003), and changes in tone during the day were frequently reported by our raters.

Although the IPMDS was adopted from the NPMDS, there are important differences. First, the NPMDS consists of 26 items with mostly a 0–3 scale, whereas the IPMDS consists of 61 items, mostly on a 0–5 scale. For example, the “feeding” item in the NPMDS was replaced by items on chewing (including ability to chew, e.g. bread crusts and meat), swallowing (including difficulties/choking with dry food or fluids) and vomiting. The IPMDS therefore takes longer to execute (on average: 35 min) but is considerably more detailed. Although the burden to the patient doesn’t seem to be unacceptably high—as indicated by the positive evaluation of the IPMDS by patients and parents—the feasibility of performing the IPMDS as part of daily care at the outpatient clinic, a central tenet of developing the NPMDS, is questionable. Secondly, since behavioural problems and speech and language problems were among the most burdensome complaints experienced by patients and their parents (Koene et al. 2013b), we included these items in the complaints and symptoms and in the functional (communication) domain of the IPMDS. Thirdly, the IPMDS has a functional domain that is expected to be the most objective, responsive and relevant item for natural history and intervention studies. Lastly, we use the same scoring system for all ages (with the exception of some functional items). Although this complicates the analysis of very young children—which was also difficult in the NPMDS for 0- to 24-month-olds since newborns differ from toddlers—longitudinal analysis is more meaningful when using the same scoring system. These differences are illustrated by the lack of a statistically significant correlation between the IPMDS and the NPMDS.

Strengths of our study include: following applicable guidelines provided by the US Food and Drug administration (Administration 2009) for designing the IPMDS, the expert international team providing input to IPMDS content and validation in five expert international centres in a heterogeneous group of children with MD. However, since we aimed at a scale covering the whole phenotypic spectrum of MD, the IPMDS may contain many irrelevant items for individual patients. Disadvantages of the study include the undetermined influence of normal development on the score, and relatively small study population per centre. The small population also affects the validity of the factor analysis, so exploratory factor analysis should be repeated when more data is obtained. Lastly, eight out of 17 patients in this study had Leigh syndrome, limiting the generalizability of results.

Based on our data, we suggest using the IPMDS for assessing natural history in a larger and more heterogeneous population of children with MD. Since the IPMDS also includes subjective and functional parameters, this natural history data will not only provide clinicians with relevant information about which patients will be at risk to develop, for example, cardiomyopathy or renal failure. However, it will provide relevant prognostic information to patients and their parents. Data collected in these natural history studies should undergo another exploratory factor analysis to obtain further evidence for construct validity of the IPMDS. (We kindly invite you to upload your IPMDS scoring sheets anonymously on our website http://www.rcmm.info/ipmds). Besides, such data will facilitate setting age limits for the IPMDS, since both the number of missing items and the interrater reliability indicate that adaptations are necessary for babies and toddlers.

The aim of this study was to adapt the NPMDS to a more clinically relevant and detailed scoring system for future clinical trials in paediatric patients with MD. This is the first report on the rational for IPMDS development, including a first impression of the feasibility, reliability and validity of the current scale. Using the anonymously uploaded data from international collaborators, as well as more detailed studies on the psychometric properties in larger and more heterogeneous populations with a larger age range, we aim to optimise the scale further. Examples of future research questions include: Is the interrater reliability of Domain 2 acceptable after optimising the instructions in the manual? What is the minimal clinically important difference (Wright et al. 2012)? What is the influence of normal development and growth on the reliability and construct validity of the IPMDS, including the possible requirement of age-specific scales?

In conclusion, the IPMDS seems a robust tool for the follow-up of children with MD. Data obtained in larger and more heterogeneous populations included in natural history studies, in combination with a close dialogue with parents regarding the minimal clinically important difference, will further substantiate the instrument for clinical trials.