1 Introduction

A large number of runners at every level of expertise incur injuries that may be linked to several different factors, such as age, body mass index, distance, experience, previous running injuries, incomplete healing, and faulty biomechanics. Of the different types of injuries among runners, back and lower extremity injury rates are high [1, 2]. Pre-participation screening tools, such as the Functional Movement Screen and Star Excursion Balance Test, can be used to identify current injury in many runners [3]. However, rather than screening only for current injuries, a gait screening tool that could predict whether an individual may incur a chronic running injury in the future would substantially reduce the financial, health, and psychological burden.

Analyses of human gait have been performed by health professionals using numerous techniques in diverse fields of healthcare for various reasons. In sports medicine, gait analysis is often performed on athletes to optimize their movements for energy use and to avoid dangerous patterns that may cause injuries both while conditioning and performing. Advances in motion capture and analysis technology have allowed for extremely accurate computerized 3-dimensional (3-D) human body models with six degrees of freedom to be built [4]. These models enable users to extract kinematic and kinetic data for analysis of the human gait. Although these types of data are useful for healthcare professionals observing an individual’s gait, the analysis of the data is still left in the hands of the healthcare provider. This type of analysis takes time, is prone to errors, and can be subjective because there is no set standard for a “normal” gait. Thus, a completely objective method that can predict whether an individual would incur a chronic running injury in the future based on the patterns found in kinematic and kinetic gait data collected from a motion capture system is necessary.

The aims of the present study were to determine the most important kinematic and kinetic parameters extracted during a running gait using a motion capture system and to develop a model that incorporated these parameters to predict a future running injury. The model proposed in the present study was developed using four steps: (1) extract the parameters (2) select the significant parameters using a two-step model, (3) build three prediction models (logistic regression, random forest, and boosting) to predict whether the individual would incur chronic running injuries in the future based on their current running gait patterns, and (4) select the appropriate model (Fig. 1). The results of the present study will enable clinicians to integrate suggested interventions into an individual’s running program to prevent immediate injury and decrease the probability of future running injuries.

Fig. 1
figure 1

Steps used to develop the proposed model

Section 2 of this article describes the methods used to ascertain the model, including participant recruitment, data collection, and data analysis. Section 3 provides the results and evaluates the performance of the model, and Sect. 4 offers a discussion of the model, including the strengths and limitations of the present study and model and future directions.

2 Methods

2.1 Definition of injury

Various definitions of injury have been used throughout the literature. In the present study, a running-related injury in recreational runners was defined as stated by Yamato et al.: “running-related (training or competition) musculoskeletal pain in the lower limbs that causes a restriction on or stoppage of running (distance, speed, duration, or training) for at least 7 days or 3 consecutive scheduled training sessions, or that requires the runner to consult a physician or other health professional” [5].

2.2 Instruments and procedures

The present study was approved by the Southern Illinois University, Edwardsville (SIUE) Institutional Review Board. Participants received verbal explanations about the purpose of the study and the methods before providing informed consent. All participants then signed informed consent forms.

The data for this research was collected in the Motion Capture and Analysis Laboratory on the SIUE campus. The laboratory houses a Vicon optical motion capture system (Oxford Metrics, Oxford, UK) and all the necessary software tools to extract kinematic and kinetic data from motion capture trials. The Vicon Vantage V5 standard camera is a 5-megapixel camera that captured images at 420 frames/s. The Vicon Vue video camera enabled the capture of a reference video. The Vicon Nexus 2.6 software package (Oxford Metrics, Oxford, UK) was used for system calibration, data capture, post-processing and analysis, and data export. Tracker 3.0 software (Oxford Metrics, Oxford, UK) allowed the integration of the data into a 3-D visualization application, Visual 3-D V5 Professional software (C-Motion, Germantown, MD, USA).

Collecting the data for each participant was conducted in a systematic manner to ensure the most accurate representation for each person’s running gait cycle. First, each participant was asked to provide a static trial in which he or she stood in the middle of the laboratory with the arms in front of the body for a brief period. This trial was used to ensure that all the markers were in the correct locations and were visible to the cameras. This trial was also used later while analyzing the data to build a model based on specific anthropometric measurements, including mass, height, leg length, knee width, ankle width, shoulder offset, elbow width, wrist width, and hand thickness. After the static trial, each participant provided five dynamic trials. These trials consisted of the participant jogging on a treadmill (ProForm Performance 300i) for 45 s. Running velocity affects lower extremity kinematics [6]. Therefore, matching treadmill speed to a similar speed at which an injured runner experienced symptoms was accommodated. For an injury-free runner, the treadmill speed was set to match the running velocity of a “long run,” because if this group of runners demonstrated abnormal biomechanics during the longer runs, the faults would accumulate over the longer exercise period and might contribute to running injuries [7]. After these dynamic trials, Nexus, version 2.6, software was used to rebuild the trial, that is, the software recreated the set of markers on the monitor and played back the entire trial. The markers were then labeled with respect to their specific location.

2.3 Parameter extraction and significant parameter selection

After the continuous data were computed, a method in the software called “event creation” was used. This method allowed a user to create events in a trial based on specific parameters within the trial. Because force plates were not used in this study, the gait events were created manually by using a method commonly referred to as a coordinate-based algorithm [8]. The first step in this process was to transform the markers used for the heel and the toe on each foot into the coordinate system of the pelvis to create a parameter that computed the distance from the pelvis to the heel and toe. Once those parameters were computed, they were used to define a maximum and minimum distance of the markers with respect to the pelvis segment. Thus, the heel strike event of the gait cycle was defined as the maximum distance the heel marker translated forward (anteroposterior), and the toe-off event of the gait cycle was defined as the minimum distance of translation (anteroposterior) for the toe marker. Figure 2 depicts the coordinate system used in this study. The three key phases of running have been previously described [7] as (1) the end of the terminal swing, identified as when the foot remains elevated from the treadmill, just before initial contact, (2) initial contact, identified as when the foot hits the ground, and (3) the loading response, identified as when the runner’s weight is being transferred onto the lead leg and is characterized by the presence of shoe deformation.

Fig. 2
figure 2

Coordinate system

After the events of the gait cycle were determined, it was then possible to extract data values at or between gait events. Kinematic patterns are very similar for treadmill running and over-ground running [9]. All the necessary kinematic gait parameters, that is, the absolute and relative angles at these events, were extracted from the pelvis, hip, knee, and ankle bilaterally based on the position of the reflective markers. The parameters were then transformed into a discrete form. In addition, because there were many gait cycles in each trial (36 ± 4.3 cycles), the means and standard deviations of those parameters were computed from each trial for the left and right sides of the pelvis, hip, knee, and ankle.

Assessment of ground reaction forces and moments is an important stage in the biomechanical analysis procedure [10]. Conventionally, these measures are recorded during running using a treadmill that has force plates as an integral part of it, which is very costly. As an alternative, kinetic parameters during the running gait can be estimated with inverse dynamics if anthropometric and kinematic data are known. Because a treadmill with integrated force plates was not used in the present study, ground reaction forces and moments data were estimated using a previously proposed accurate method [11]. To predict ground reaction forces without a force plate, the traditional method of Newtonian mechanics was used for the single support phase. An artificial neural network model was applied for the double support phase to solve statically indeterminate structure problems.

The proposed model computed kinetic and kinematic metrics using six degrees of freedom, thus allowing the degree of rotation about every axis to be computed on the pelvis, hip, knee, and ankle. However, not all the metrics concerning these axes might be of importance for the prediction model. Therefore, after kinematic and kinetic parameter extraction, a two-step significant parameters subset selection was performed. In the first step, irrelevant or redundant features were removed using the independent significance parameter selection method as previously described [12]. This step eliminated parameters with a significance level lower than 2 as calculated using the following equation:

$${{significance}}_{i} = \frac{{\left| {{{mean }}\left( {{{Injured}}_{i} } \right) - {{mean}} \left( {{{Normal}}_{i} } \right)} \right|}}{{\sqrt {\frac{{{var} \left( {{{Injured}}_{i} } \right) + {var} \left( {{{Normal}}_{i} } \right)}}{{n_{1} + n_{2} }}} }},$$
(1)

where \({{Injured}}_{i}\) indicates the ith parameter being measured from the injury data set, \({{Normal}}_{i}\) represents the ith parameter being measured from the normal data set, and \(n_{1}\) and \(n_{2}\) are the corresponding number of injured and uninjured participants, respectively. In the second step, a number of subset parameters were selected for the prediction models using the sequential forward selection method measured by fivefold cross-validation. In this fivefold cross-validation, the parameter set was divided into five subsets of equal size. Each subset was tested on the remaining four subsets using the mean squared error that minimized the mean criterion value. This process continued until the addition of one more parameter did not decrease the criterion any further. After this step, the 10 most significant parameters were identified to build the prediction models.

2.4 Building prediction models

The first predictive model built was a logistic regression model [13] using the selected significant parameters. Bootstrapping methods, using 1000 iterations, were used to resample the full parameters set to reduce bias incurred by uneven numbers of the data set originating from each participant, and to enable model development and validation using the same data set. For each iteration, mean areas under each receiver operating characteristic (ROC) curve (\(A_{UC}\)) were calculated using the selected significant parameters and models of interest; mean \(A_{r}\) and bootstrap 95% confidence intervals (95% CIs) were reported; differences between \(A_{UC}\) s were calculated, and the bootstrap \(p\) value determined.

Two machine learning methods were also used, random forest [14] and boosting [15], to build the prediction models. Then, all three prediction models were compared, and the best predictive model was selected as a prediction model for running injuries.

A random forest is a tree-based learning model. The basic premise of the random forest is that a number of decision trees are built, and each individual tree is used to make a prediction. This process starts with bootstrapping. Bootstrapping is the process of taking multiple samples from a single set of training data. After bootstrapping, a method called bootstrap aggregation (bagging) is used to lower the variance within each individual tree that is built. Bagging works on the principle of averaging a set of observations to reduce variance. The prediction can be calculated as \(\hat{f}^{1} \left( x \right),\hat{f}^{2} \left( x \right), \ldots \hat{f}^{R} \left( x \right),\) with \(R\) number of training sets, and the predictions averaged using \(\hat{f}_{{avg}} \left( x \right) = \frac{1}{R}\sum\nolimits_{r = 1}^{R} {\hat{f}^{r} \left( x \right)}\). Each tree is then modelled on the rth bootstrapped training set to obtain \(\hat{f}^{*r} \left( x \right)\). The average of the predictions is obtained using \(\hat{f}_{{bag}} \left( x \right) = \frac{1}{R}\sum\nolimits_{r = 1}^{R} {\hat{f}^{*r} \left( x \right)}\). The random forest model is able to decorrelate the trees by considering only a subset of predictors in the data for each split. In other words, not every predicting feature is considered in each split. This ensures that a new sample of predictors is chosen at each split, thus decorrelating the predictions by ensuring that each tree is constructed differently. The number of \(m\) predictors considered for each split in this model was \(m = \sqrt p\), where p is the total number of predictors in the data set. After the training process was completed for the model, the final prediction was made on a majority vote basis. That is, each decision tree built made a prediction, and the most frequent prediction that was made was the final prediction of the model.

Boosting is an iterative process that focuses on misclassified data such that each tree is based on the weighted average of the data points, and the weights are calculated based on the previous model in the iterative process. The random forest and boosting methods were validated in the present study using an approach with tenfold cross-validation and 100 times replication.

3 Results

In total, 36 individuals participated in the present study: 14 participants (age, 21.2 ± 3.4 years; body mass, 71.9 ± 3.8 kg; and height, 185.4 ± 11.4 cm) had a previous injury or were currently experiencing a chronic injury from running, such as shin splints, knee pain, or lower back pain, and 22 participants (age, 23.6 ± 4.1 years; body mass, 75.9 ± 4.6 kg; and height, 192.7 ± 21.0 cm) were injury-free at the time of data collection. All participants were active recreational runners who completed a minimum of three training sessions no less than 30 min per week. Each participant provided five trials of running gait data. This sample size provided more than 80% statistical power.

Tables 1 and 2 give the kinematic and estimated kinetic parameters extracted from injury-free and injured participants.

Table 1 Kinematic parameters
Table 2 Estimated kinetic parameters

Figure 3 shows the relative and absolute angles of the pelvis, hip, knee, and ankle. The x-axis of all panels in Fig. 3 represents the gait cycle, which is the time frame between two consecutive heel strikes. A cycle consists of two main phases: the stance (support) and swing phases. The stance phase starts with the heel strike and ends with the toe-off. The swing phase starts with the toe-off and ends with the heel strike. This means that on the x-axis, from 0 to 50 represents the stance phase of the gait cycle, and from 50 to 100 represents the swing phase. Each colored line on the graph represents the data for a cycle, and the thicker black line represents the mean values for all the cycles from the trials.

Fig. 3
figure 3

Descriptive kinematic parameters from left and right sides extracted using the motion capture system

The two-step significant parameters selection method was implemented after this step. These significant parameters were used to build the prediction models. The significant features were angular velocities and moments along the y-axis and the z-axis. These parameters are important because they refer to the abduction or adduction (anteroposterior) of the joints and the internal rotation (vertical) of the joints and the rate at which theses rotations are occurring during the gait cycle. Angular velocity was calculated as the derivative of the joint angles with respect to time. Another important parameter that was used in the analysis was the internal reaction forces at the knee and ankle joints. These forces were computed using a 3-D force vector. To avoid focusing on the reaction along one axis, the resultant force was computed and used in the prediction model. These parameters were first computed as continuous values for the entire gait cycle, and the data were normalized. Box 1 shows the most significant parameters used to build the prediction model.

Box 1 The most significant features selected by the two-step technique

After selecting the best descriptive set of parameters, the predictive models were built using a logistic regression model, a random forest, and boosting. Mean areas under each ROC curve (\(A_{UC}\)) were calculated using bootstrapping techniques for data resampling and were used during model development and performance comparisons. The \(A_{UC}\)-ROC curve is a performance measurement for classification problems at various thresholds settings. The ROC is a probability curve, and \(A_{UC}\) represents the degree or measure of separability, providing how much each model is capable of distinguishing between classes. The higher the \(A_{UC}\), the better the model is at predicting zeros as zeros and ones as ones. By analogy, the higher the \(A_{UC}\), the better the model is at distinguishing between participants with injury and those with no injury.

The training algorithm used to build the random forest model was the split, train, and test method from the scikit-learn utility package used in the Python, version 3.7, programming language. This algorithm allows for the data to be randomly split into training and testing data (70% was used as training data, 30% used as testing data). After the model was trained and tested, the accuracy score, confusion matrix, and variable importance were all calculated to further understand the model that was built.

Table 3 and Fig. 4 show the \(A_{UC}\)-ROC results for the three prediction models built using different parameter sets, baseline data, parameters after the first parameter selection step, and parameters after the second parameter selection step to differentiate participants by their condition.

Table 3 Accuracy comparisons of three different prediction models using different sets of parameters
Fig. 4
figure 4

Receiver operating characteristic (ROC) curves for logistic regression, random forest, and boosting prediction models

Without parameter selection steps, the \(A_{UC}\) was 0.6339 (95% CI 0.5905–0.6598 using logistic regression, 0.6428 (95% CI 0.6212–0.6646) using a random forest, and 0.7168 (95% CI 0.7001–0.7443) using boosting (Fig. 4, left panel). The boosting-based classification model was able to predict better than the other two models. The random forest was the second-best predictor. However, the prediction percentages were very low. After the first parameter selection step, the \(A_{UC}\) was 0.7440 (95% CI 0.7254–0.7786) using logistic regression, 0.7555 (95% CI 0.7412–0.8093) using random forest, and 0.8677 (95% CI 0.8087–0.9104) using boosting (Fig. 4, middle panel). The boosting-based classification model was again able to predict better than the other two models. The random forest was again the second-best predictor. However, the prediction percentages were still very low. After the second parameter selection step, the prediction results were markedly improved, with the \(A_{UC}\) 0.9016 (95% CI 0.8808–0.9369) using logistic regression, 0.8892 (95% CI 0.8463–0.9152) using a random forest, and 0.8732 (95% CI 0.8401–0.9178) using boosting (Fig. 4, right panel). The logistic regression-based classification model was the best predictor, followed by the random forest and boosting models.

4 Discussion

In the present study, kinetic and kinematic parameters extracted using an optical motion capture system were combined to build a prediction model to accurately predict whether a participant was injured or not. Of the three different prediction models developed, the logistic regression model had the highest prediction accuracy. This model will be improved in future studies to create a model that accurately predicts the probability that a person will incur a chronic running injury based on their current running gait. To the best of our knowledge, there is no study that integrates a range set of kinematic and kinetic parameters to predict future running injury using an optical motion capture system.

The present study has a few limitations that should be considered when interpreting the results. Because of the lack of having a force sensing tandem treadmill in our laboratory, kinetic parameters have been estimated using a model that might produce an error. For more accurate analyses, this type of treadmill should be integrated into the study. In addition, it was assumed that GRFs and GRMs were the only external forces applied to the individuals during the running activity. However, in a wider spectrum, there might be secondary external forces introduced in the activity. Finally, the proposed model was dependent on a set of reflective markers, which requires 21 reflective markers. In future studies, minimizing the number of markers to make the prediction model more practical for clinical applications should be investigated.

With additional data capture and analysis, more accurate and efficient prediction models can be constructed and applied. This type of predictive analysis pertains to more than just running gait analysis for sports medicine or physical therapy applications. The human gait holds much information about the neurological state of a person. The gait and the ability to balance have been studied in many cases for various neurological disorders, including Parkinson’s disease, autism, and multiple sclerosis. These are examples of neurological disorders in which gait deficiencies have been defined as symptoms and in some cases have been shown to be predictors. With the use of motion capture, either through a worn device or an optical motion capture, extremely accurate data is obtained. Such data hold information that is unable to be seen by the human eye and can be used to develop predictive models with the ability to recognize patterns involved with neurological disorders.

5 Conclusions

To summarize, running injuries are a common problem for runners. The purpose of the present study was to use a motion capture system to determine the most important running gait parameters and to develop a prediction model that incorporates these parameters to predict a likely future chronic running injury. The proposed model was developed using parameter extraction and two steps to select significant parameters to build three prediction models (logistic regression, random forest, and boosting). The proposed model showed potential for accurately predicting future injury by analyzing the running gait patterns of runners.