Introduction

A clinical hallmark of motor neuron diseases is a progressive loss of strength. This loss underlies much of the disability that patients encounter, and is a major driver of healthcare costs associated with this constellation of diseases. Although other factors, such as upper motor neuron burden, may alter function, eating, breathing, speaking, ambulation, and fine motor control, all are dramatically affected by changes in muscle strength. A wide range of outcome measures have been employed in clinical trials, including survival, functional scales, and measures of specific functions such as vital capacity, sniff nasal inspiratory pressure, timed up and go test, walking distance for a defined period of time, and many others. However, as all of these are, in large part, a function of muscle strength, direct strength measurements have been a part of the vast majority of most clinical trials of experimental therapies for motor neuron disease. It is critical, therefore, that strength measurements be sensitive, repeatable, and performed in the same manner across study centers in multicenter clinical trials. This review will discuss the ways that muscle strength has been measured in clinical trials, and address the positive attributes and limitations of the various methods. The relationships between strength measures and other outcome measures will also be addressed.

Methods of Strength Assessment

Manual Muscle Testing

Manual muscle testing (MMT) was first described in 1912 to assess the status of patients with poliomyelitis [1]. In modern clinical settings, strength is most often assessed using the MMT scale established by the Medical Research Council of the Royal College of Physicians and Surgeons [2]. In its original form, this scale grades strength of individual muscles on a scale from 0 to 5, with 0 representing no muscle function and 5 indicating normal strength. Grade 1 implies observation of muscle activation without movement, grade 2 requires the ability to move with gravity eliminated as a force, grade 3 means that a muscle can move a limb against gravity, and grade 4 requires good but not normal muscle power. In clinical trial settings, this scale has sometimes been expanded using either pluses or minuses or equivalently to 10 points with similar anchors but the ability to grade these subjective impressions in a somewhat finer manner.

MMT strength grading has been used in a number of clinical trial settings. In the phase III trial of riluzole versus placebo in patients with amyotrophic lateral sclerosis (ALS), a statistically significant benefit of riluzole was noted with respect to survival, with a suggestion of a dose-dependent trend toward increased efficacy with higher doses [3, 4]. However, MMT strength grading showed no difference from placebo, or any hint of a dose effect. In a later study comparing riluzole serum levels to both survival and MMT muscle testing, no effect of riluzole concentration was found on either measure [5]. In a phase III study of minocycline in ALS, a statistically significant trend toward faster progression in the ALS Functional Rating Scale-revised (ALSFRS-R) in patients treated with minocycline versus placebo; there was a trend in the same direction for MMT, but this was not significant [6]. A large, phase II trial of TCH346 in ALS showed a trend toward detriment on most outcome measures assessed but no effect on MMT [7]. Other studies have employed MMT testing in ALS, spinobulbar muscular atrophy, and spinal muscular atrophy (SMA) [812]. However, in most studies, no therapeutic benefit was noted with any outcome measure, so that the relative sensitivity of MMT testing compared with other measures could not be assessed. One study that suggested a therapeutic benefit in spinobulbar muscular atrophy has been recently reported; in this study, clenbuterol treatment was associated with a benefit as measured by the 6-min walk test, while no effect was noted in strength as measured with MMT testing [13].

To evaluate the properties of MMT testing as an outcome measure in ALS, the Great Lakes ALS Consortium performed a multicenter natural history study, evaluating patients with ALS longitudinally every 3 months for 1 year [14]. Eighteen muscle groups were evaluated bilaterally for a total of 36 muscles tested; evaluators all attended a training course to maximize consistency of measurement and technique. A 10-point scale was used. Coefficient of variation of rate of change [CoV(r)] was calculated for single muscles, as well as averages from 2 to 36 muscle groups. CoV(r) is a measure that incorporates variability from reliability of measurement, as well as intrinsic variability of progression from patient to patient. Not surprisingly, CoV(r) was improved as more muscles were averaged together; with 36 muscle groups, CoV(r) was as good or better than many other ALS outcome measures, including the commonly used ALSFRS-R and vital capacity.

Quantitative Muscle Testing

Although the abovementioned studies show that MMT testing can be performed reliably in patients with motor neuron diseases, the measure itself is subjective, and it is clear that the scaling does not meet the requirements of an interval scale. The fact that several studies suggesting therapeutic change using other outcome measures did not show effects on MMT muscle strength testing also raises questions regarding the sensitivity of this measure with respect to its ability to detect meaningful therapeutic benefit. For these reasons, a method to measure muscle strength quantitatively is potentially attractive.

The relationship between MMT strength testing and isometric strength as measured by a strain gauge was originally evaluated by van der Ploeg et al. [15]. In this study, classic MMT strength grading of the biceps was compared with isometric strength (Fig. 1); antigravity strength (MMT grade 3) only required 2 % of maximal biceps strength, with grades 4 and 5 spanning the remaining dynamic range of the muscle. It was noted that approximately 80 % of the strength range of the biceps was graded as 4. Thus, in a clinical trial, a patient could lose > 50 % of his/her muscle power without a change in MMT level. Other studies have showed that a given level of isometric strength is variably graded by different trained evaluators, and by the same evaluator over repeated testing sessions. Andres et al. [16] showed that left ankle dorsiflexion of 40 % of normal maximal force was graded between 3 and 8 by different evaluators using the modified 10-point MMT grading system. Given the fact that most of a muscle’s dynamic range can fall within 1 MMT grade, and the fact that different evaluators grade the same strength markedly differently, a quantitative muscle measurement system is clearly to be desired.

Fig. 1
figure 1

A comparison between Medical Research Council strength grading and quantitative strength measurements for biceps femoris. Reprinted with permission from van der Ploeg et al. [15]. Copyright 1984 Journal of Neurology

A variety of quantitative measures have been employed, some using strain gauges and others with hand-held myometers [15, 17]. However, the first well-defined method to be used in clinical trials of motor neuron disease was developed by Munsat and Andres [16, 1821]. Named the Tufts Quantitative Neuromuscular Evaluation (TQNE), the entire instrument included measurements of muscle strength, pulmonary function, and timed motor tasks. With respect to muscle strength, standardized patient positions were defined, and a strain gauge moved around the patient to be orthogonal to specific muscle groups was used to measure isometric strength. The full battery included 9 muscle groups measured bilaterally, plus handgrip measured with a separate grip dynamometer.

Longitudinal studies of patients with ALS demonstrated a number of important findings. First, though careful patient positioning and rigorous evaluator training resulted in reproducible measurements for individual muscles over time, reliability was increased if certain muscle groups were considered together. To be determined which muscle groups could most effectively be combined, Andres et al. [21] performed a factor analysis on strength data from single studies of 176 patients with ALS to determine how different muscle groups were intercorrelated. This analysis showed that muscles from the arms could effectively be combined, as could muscles form the legs. As absolute strength for different muscles can be vastly different, each muscle strength measurement was linearly transformed to a z score, using ALS population means and SDs for every muscle. Muscles in different body regions were then averaged to yield a more global value called a megascore. Declines in megascores for the arms and legs were, in general, very linear over time; however, significant differences were noted both from patient to patient and from one area of the body to another. Decline in leg strength was slightly slower than arm strength in patients with ALS patients [22].

Several studies have directly compared quantitative strength testing with MMT grading. Andres et al. [16] compared progression in patients with ALS using both TQNE and MMT grading, and noted that reproducibility and sensitivity to decline were both strikingly greater using quantitative evaluations. The Great Lakes ALS Consortium [14] evaluated both the effect of averaging strength in multiple muscles and the relative sensitivity to change with MMT testing and TQNE. Not surprisingly, the characteristics of the measure improved with number of muscles evaluated both for TQNE and MMT. However, for any given number of muscles averaged, the characteristics of TQNE were superior to that of MMT grading.

Quantitative isometric strength using the TQNE system has been used in several multicenter ALS trials. In a trial of celecoxib in ALS [23], no significant differences between celecoxib and placebo were found, including for muscle strength. As previously noted, leg strength declined slightly slower than arm strength. In a trial of topiramate in ALS [24], only arm strength was measured. A deleterious effect was found for topiramate on all measures; this effect was not statistically significant for the ALSFRS-R or for vital capacity, but was significant for the arm megascore. The percent difference over time between treatment groups was greatest for the arm megascore compared with other measures. A small, phase II study comparing talampenal 50 mg orally 3 times daily to placebo in 60 patients over 9 months showed a trend toward reduced strength loss in patients treated with talampenal; a subsequent phase III study using MMT testing did not confirm this finding [25].

Both natural history studies and clinical trial data show that quantitative strength testing using the TQNE apparatus provided high-quality, reproducible data on muscle strength. However, in the clinical trial setting, several aspects of testing proved problematic. First, the apparatus was quite large, requiring a full examination room. Many clinical trial sites found that committing a full room for testing that occurred occasionally at most was too great an investment. Second, to test all of the muscles suggested in the original evaluation required that patients assume a variety of positions on a physical therapy table, including fully supine and fully prone. Such position changes were fatiguing for many patients, and the appropriate positioning was not possible for patients who had orthopnea. Thus, as the trial progressed, an increasing number of patients were unable to complete the evaluation. In addition, a trained physical therapist or other clinician was required to perform the test.

To address these issues, the use of a hand-held dynamometer was proposed. Such dynamometers have been in frequent use in a variety of clinical situations, primarily to assess recovery after stroke or injury. In spinal cord-injured patients, use of a hand-held myometer was much more reliable than MMT testing [26]. In the clinic, quantitative strength measurements with a hand-held device has been useful in a variety of neuromuscular diseases [17, 27]. In general, however, reproducibility was not as good for TQNE in most cases, and issues were raised about variability of both patient and evaluator positioning, as well as the fact that, for strong muscles, the strength of the evaluator might be less than the patient [21]. Another source of variability lay in the fact that some protocols required a “break” in position while performing testing; that is, in order to perform a test successfully, the evaluator must overcome the muscle force exerted by the patient. This requires more evaluator strength than the “make” maneuver, in which the evaluator simply matches the strength exerted by the patient. However, in normal volunteers, there is < 3 % difference between forces measured in the make versus break technique, suggesting that evaluator muscle strength may be less of a source of variability than originally proposed [28].

Despite the abovementioned considerations, quantitative strength testing using a hand-held dynamometer (HHD) has been implemented in a large number of clinical trials in ALS and SMA in recent years. To address issues of evaluator variability, a rigorous training and evaluation program was implemented, with standard patient and evaluator positions mandated and a prespecified level of test–retest performance required. In ALS, trials of lithium, ceftriaxone, dexpramipexole, and tirasemtiv have been successfully performed, with reliable data acquired that have characteristics suggesting that measurements of muscle strength with HHD is a sensitive measure of disease progression and therapeutic efficacy [2932]. For lithium, ceftriaxone, and dexpramipexole, HHD data closely matched other outcomes, including ALSFRS-R and slow vital capacity, and survival. However, in a metaanalysis of 3 small, phase II studies of tirasemtiv, strength testing using HHD showed a statistically significant benefit of active treatment with tirasemtiv (Fig. 2) [3335]. A subsequent large, phase II study of the same agent showed a significant amelioration of rate of progression of muscle weakness with tirasemtiv [32]. A significant benefit was also seen with slow vital capacity, which is actually another measure of maximal muscle strength. However, a signal was not seen in the ALSFRS-R, suggesting that quantitative tests of muscle strength may be more sensitive than a self-report rating scale.

Fig. 2
figure 2

Quantitative muscle strength measured using hand-held dynamometry as a function of serum concentration of tirasemtiv from a meta-analyis of 3 small, phase II studies. The trend for increased strength as a function of serum concentration is statistically significant (p < 0.002). Reprinted with permission from Shefner et al. [34]. Copyright 2013 Amyotrophic lateral sclerosis & frontotemporal degeneration

In order to more fully understand the relationships between quantitative strength measurements using HHD and other measures of ALS progression, as well as how different muscle groups change with respect to each other, the placebo groups of 2 recent, large, phase III ALS trials were analyzed [36]. For both data sets, strength in individual muscles or expressed as megascores declined over time. CoV(r) was chosen as the best measure to incorporate both variability in rate change and variability of measurement; using this measure, knee extension was the most variable measure. CoV(r) for ALSFRS-R suggested slightly less variability than for strength, but measurements of vital capacity were more variable than either of the other 2 measures. Rates of decline of these 3 measures correlated with each other with Pearson r’s, ranging from 0.71 to 0.40. Rates of decline in muscle strength were, for the most part, highly correlated from side to side, with correlation coefficients ranging from 0.82 to 0.43 (Fig. 3). While outliers certainly exist in which side-to-side strength is quite different, the data suggest that, despite the view that ALS is a disease of focal onset and progression, by the time many patients participate in clinical trials, the disease has reached a more disseminated stage with muscle groups declining at similar rates.

Fig. 3
figure 3

Side-to-side correlations between 3 muscle groups (elbow flexion, shoulder flexion, ankle dorsiflexion) in both the ceftriaxone and dexpramipexole clinical trials. Reprinted with permission from Shefner et al. [37]. Copyright 2004 Neurology

While studies of quantitative strength using HHD have provided important insights into the pattern of disease progression in ALS, and HHD has been used frequently in ALS clinical trials, strength measurement has been questioned as an important clinical trial endpoint for several reasons. First, rate of decline in extremity muscle strength does not strongly correlate with survival [37]. The reason for this poor correlation, however, is clear. Death in ALS is almost always due to respiratory failure, an aspect of loss of strength not captured in extremity measurements. Second, power analyses comparing sample sizes required to show meaningful effects in different outcome measures have suggested that the ALSFRS-R has the potential to show a statistically significant difference with fewer patients per group than other measures, including quantitative muscle strength and vital capacity [36, 38]. These estimates of power are derived primarily from 2 factors: rate of change in the measure over time, and variability across patients. While important factors, neither addresses the question of whether a measure is sensitive to change as a function of a specific therapeutic agent. For example, a recent phase II trial of tirasemtiv, an agent intended to influence muscle strength, showed a robust effect on muscle strength and no effect on ALSFRS-R [32]. Finally, a demonstration of a clinical effect on a functional rating scale provides little insight into what the effect actually is; the ALSFRS-R is a 12-item scale encompassing a range of functions so that an effect on this measure may or may not be clinically meaningful. For all of these reasons, it seems clear that assessment of muscle strength in a disease characterized by progressive weakness should be considered as an important component of clinical trial design.

Quantitative strength measurement using a HHD has also been used in studies of other motor neuron diseases. A small, open-label trial of valproic acid in patients with SMA types III/IV of ages 17 and older showed an increase in strength as measured by HHD in an open-label setting [39]. However, a placebo-controlled trial in adults with SMA also showed no effect, though quantitative strength testing was found to be reliable and reproducible in a trial of valproic acid in SMA [40]; results suggested no benefit but the properties of the strength assessment suggested that it would be a good measure in future trials. Similarly, a trial of type II/III SMA enrolled patients aged 6 to 36 years in a crossover study of growth hormone; HHD was again found to be reliable and reproducible, although the results suggested no effect [41].

Children as young as 5 were enrolled in a natural history study of quantitative strength testing in SMA prior to the onset of a clinical trial [42]. Interrater reliability was higher for upper than lower limbs but was quite good for both (intraclass correlation of 0.92–0.98 for upper limb muscles and >0.85 for all lower limb muscles except ankle dorsiflexion). Interrater reliability was also very good with an intraclass correlation of > 0.91 for all muscles.

Despite the clear advantages of quantitative muscle testing using HHD over MMT strength grading, concern is still expressed over the possibility that, for very strong muscles, the strength of the patient may be greater than the evaluator such that the evaluators strength is being measured rather than that of the patients. To address this issue, Andres et al. [43] have recently developed a modification of the TQNE system, in which patients are seated in a chair and exert effort against an immobile strain gauge rather than an evaluator. Called ATLIS, the apparatus measures 6 muscle groups bilaterally (grip, elbow and knee flexion and extension, and ankle dorsiflexion). In a study of 432 normal patients, ATLIS was highly reproducible, and evaluations could be rapidly obtained without the multiple different subject positions required for TQNE. Regression equations were established for males and females that described changes in muscle strength with age for the 6 muscle groups tested. Such datasets will be extremely valuable to scale values obtained from diseased patients over a wide range of neuromuscular disorders. Whether the restricted group of muscles reduces the overall quality of a combined muscle measure is yet to be determined.

In summary, strength measures have been used over many years to assess therapeutic benefit in clinical trials for motor neuron diseases. Quantitative measurements have clear advantages over qualitative muscle testing, and a range of techniques are now being incorporated into clinical trials. The use of such measurements have led to increased understanding of common patterns of disease progression in ALS; quantitative strength testing should be incorporated into any trial evaluating an agent intended to either slow motor dysfunction or to cause improvement. Available tools are not perfect; for example, it is currently not possible to distinguish between weakness caused by lesions anywhere in the neuraxis from muscle to motor cortex. The ability to determine objectively the source of weakness would be of great value and should be a subject of future research.