1 Introduction

1.1 Traditional kinematic analysis

Pose estimation has gained significant popularity and traction in recent years, primarily driven by the remarkable advancements in computer vision techniques. This technology enables the estimation and analysis of human poses, including the identification of key joints and their positions, from images. Kinematic analysis stands as one of the prominent use cases for pose estimation, leveraging its ability to accurately track human movements and joint positions. This capability has proven invaluable in various fields, including biomechanics, sports science, physical therapy, and robotics. Traditionally, the kinematic analysis was done by marker- (e.g. VICON (VICON Motion Systems Ltd., Oxford, the UK), OptiTrack (OptiTrack, Corvallis, USA)) or inertial measurement unit- (e.g. Xsens (Xsens, Enschede, the Netherlands), and Rokoko (Rokoko, Copenhagen, Denmark))-based motion capture system or both. However, the marker-based motion capture system is costly [1, 2], laboratory environment-dependent [3, 4], and complicated to use [5]. The inertial measurement unit-based motion capture is error-prone to drifts, sensitive to its system calibration [6], and unable to obtain the connection to the world coordinates by itself [7] although it can be used outside of the laboratory environment.

1.2 Advantage of human pose estimation

Compared to the traditional motion capture system, pose estimation is freely available from an open-source code, fast to process data, and portable since the users capture motions by cameras and process the image or videos to receive the positional data of human joints. There are several well-known pose estimation models, such as OpenPose (CMU, Pittsburgh, USA), ARKit (Apple Inc., Cupertino, USA), and TensorFlow Pose Estimate (Google, Mountain View, USA). Currently, among various pose estimation models, OpenPose is well-known and often used for applications in the sports and exercise science domain [8,9,10,11].

1.3 The accuracy of marker-less motion analysis

Zago et al. [8] conducted a case study to evaluate the accuracy of human pose estimation-based gait analysis. They found mean errors between human pose estimation and marker-based motion capture system of 20 mm, 0.03 s, 1.23 cm, and 0.03 s in 3D tracking trajectories, stance-phase duration, swing-phase duration, and step length, respectively. D’Antonio et al. [9] conducted gait analysis with the human pose estimation and found that although gait trajectories were accurately tracked, human pose estimation under and over-estimated the minimum and maximum joint angles by up to 9.9 degrees. The inaccuracy is probably due to camera angles and locations as Zago et al. [8] investigated the accuracy of the human pose estimation tracking in different settings and found that the result was optimal when a camera was 1.8 m away from a participant and the camera position was perpendicular to the gait direction. D’Antonio et al. [9], however, placed two cameras one meter away from a treadmill (one to diagonally left and another to diagonally right) to capture gait. Ota et al. [10] conducted a reliability and validity study of human pose estimation in a squat motion and found that the kinematic measurement was reliable and valid as intra-class correlation coefficient for human pose estimation measurements were 0.92–0.96, and intra-class correlation coefficient between human pose estimation and marker-based motion capture system measurements were more than 0.6. Nakano et al. [11] captured more rapid and complicated motions including counter-movement jump and ball throwing in addition to walking. They found that although some of the joint positions were tracked with errors of more than 40 mm compared to a reference measurement, the error was less than 30 mm in 80% of the time series. Despite the applicability and usability, the measurement quality relies on video quality. Zago et al. [8] found that camera setting heavily influenced measurement accuracy. Compared to errors in the marker-based motion capture system, which is below 1 cm [7], studies mentioned above [8, 11] reported bigger errors. Also, at least two cameras are necessary to reconstruct 3D data from multiple 2D data since each camera can capture motions in 2D.

1.4 Aim

Despite the extensive research on human pose estimation, there is a lack of reliability and validation studies on human pose estimation-based kinematic measurement in sports and athletic movements. Owing to its portability and ease of use in on-field settings, human pose estimation holds potential for capturing athletes' movements without disrupting their concentration or limiting their range of motion. This capability empowers researchers to analyze real athletic movements, paving the way for enhanced athletic performance and injury prevention strategies. Consequently, this study endeavors to validate human pose estimation measurement as a human motion capture system for kinematic analysis of sports and athletic movements.

2 Methods

2.1 Study overview

To evaluate the accuracy of the human pose estimation measurements, joint angles were calculated from the estimated 12 key points including right and left wrist, elbow, shoulder, hip, knee, and ankle joint points. Then, the angles were compared with the respective VICON (VICON Motion Systems Ltd., Oxford, UK) measurements. In total, eight athletic motions including counter-movement jump, squat jump, standing, spreading arm, 360-degree turn while spreading arm, walk, and jog, and nine sports motions including football inside kick, basketball chest pass / free throw, volleyball receiving / overhead serving, tennis forehand / backhand / overhead swing, were performed twice by each participant. For tennis motions, participants were asked to simulate the motions without actual tennis balls. All participants were informed how to execute each movement correctly. The movements were captured by 12 Contemplas ab Baumer VLXT-31C cameras with undistorted lenses and an automatic synchronization system (CONTEMPLAS GmbH., Kempten, Germany) for OpenPose [12] and 10 infrared cameras to extract joint positions. The extracted joint positions were further processed to calculate joint angles which were then compared to evaluate the accuracy of the human pose estimation measurements. Data processing was done using Python 3.10.

2.2 Participants

In total, five male participants (Age (mean ± standard deviation): 30.2 ± 6.6 years old, Height (mean ± standard deviation): 176.2 ± 6.7 cm, Body mass (mean ± standard deviation): 74.2 ± 9.1 kg) participated in this study. All participants were in good physical condition and did not have any orthopedic or neurological impairments. Instruction on movements to be captured was given to all participants before the experiment. During the instruction, the ability of motion execution was checked by sports scientists. The study was conducted according to the ethical guidelines of the Technical University of Munich. All participants were informed about the process of the study upfront and written consent was obtained.

2.3 Data collection

The experiment was conducted in a sports hall with similar dimensions to a volleyball court. The infrared and RGB cameras were strategically positioned around the perimeter of the capturing area, encompassing a full 360-degree view. The capturing environment enclosed a volume of roughly 4 m3, and each camera stood at a height of about 2.5 m. To ensure time alignment between the two camera systems, the time instant at which a falling reflective marker touched the ground was captured.

2.4 Marker-based motion capture system setup

The VICON software (Nexus 2.8.2, Version 2.0; VICON Motion Systems Ltd., Oxford, UK) was used to configure and post-process the captured data. The sampling frequency was 100 Hz. Reflective markers were placed on the body landmarks according to the Full-Body Plug-in Gait marker placement model provided by VICON Motion Systems Ltd [13]. All infrared cameras were calibrated using an active wand with five LED lights. Static participant calibration was performed in T-pose, and the participant’s anthropometric measurement including leg length, waist width, shoulder width, elbow width, ankle width, knee width, wrist width, and palm width was collected beforehand using a measure tape and caliper. Estimated marker positions were filtered and fitted according to the anthropometric measurement using built-in VICON software functions. Then, the center of the left and right wrist, elbow, shoulder, hip, knee, and ankle joint was estimated following the model specifications. Left and right elbow, shoulder, hip, and knee joint angles were calculated using the joint center position.

2.5 RGB camera and human pose estimation setup

Each RGB camera was calibrated using a calibration cage with 12 reflective markers at known 3D positions. In each vertical pole, three markers were placed from the ground level to a 100 cm point with equal space. The horizontal distance of each marker was 100 cm. Figure 1 shows all the camera views with the calibration cage. Knowing the 3-dimensional point location of each reflection marker, the corresponding 2-dimensional points were manually extracted from each view. Finally, Direct Linear Transformation (DLT) [14] was used to compute a projection matrix. The projection matrix was refined using a Bundle Adjustment method [15]. Human pose estimation was run on each frame from each RGB camera which was configured to produce 100 frames per second with 1920 by 1080 pixel resolution with undistorted lenses and without sounds. Human pose estimation outputs 25 key points with a confidence rate from 0 to 1 for each key point and estimates multiple people in a JSON format, but 12 key points including right and left wrist, elbow, shoulder, hip, knee, and ankle joint points were used. The key points from all camera views were triangulated to reconstruct 3D data [16]. In the triangulation process, a projection to each key point was weighted by the confidence rate [17]. The joint angles corresponding to the ones from marker-based motion capture system were calculated from the triangulated key points. Afterwards, they were filtered using a 4th-order Butterworth low-pass filter. A cutoff frequency for the filter was determined using a residual method [18] with a determined frequency range of 1–20 Hz.

Fig. 1
figure 1

Camera views with a calibration cage

2.6 Data analysis

Mean Absolute Error was used to compare differences between the corresponding joint angles calculated from the 2 different systems frame by frame. The first and second trials of each movement were averaged, and mean and standard deviation were calculated over all the participants. Also, paired t-test and Cohen’s d (small effect: < 0.2, medium effect: >  = 0.2 and < 0.8, large effect: >  = 0.8) with 0.05 threshold were applied to find statistical significance between synchronized marker-based motion capture system and human pose estimation continuous measurements par joint angle, par movement, and par participant. All the data and statistical analysis were done by Python 3.10.

3 Results

3.1 Errors in athletic movements

Results for athletic movements are displayed in Fig. 2. The most erroneous joint angle was the right elbow joint angle in jogging, which was 18.8 ± 12.3 degrees although the smallest error was observed in arm spreading, which was 2.5 ± 1.4 degrees. The biggest error in each movement type was an elbow angle, but in general, the elbow joint angles were more erroneous than the other joint angles except for squat and squat jump movements. Even considering the complexity of the movement, the right and left elbow joint angle in the 360-degree turn while spreading the arm showed a 14.8 ± 2.7 and 14.4 ± 3.6-degree error, respectively. The arm spreading showed a 17.7 ± 3.7 and 16.6 ± 2.8-degree error in a right and left elbow joint angle, respectively. More interestingly, the elbow joint angle in the standing showed an 18.5 ± 5.2 and 16.1 ± 7.6-degree error on the right and left sides, respectively. Interestingly, despite the absence of movement in the standing posture, the elbow angle exhibited greater error than the arm spreading and 360-degree turn with arm spreading. Additionally, simultaneous bilateral movements like the squat, counter-movement jump, and squat jump demonstrated distinct error and standard deviation ranges for the left and right sides. Notably, the left side consistently displayed greater error than the right side. Figure 3 shows that p value of t-test in each participant and trial. There is no pattern regarding which movements or joint angles or both display significant differences. However, Cohen’s d effect sizes of counter-movement jump in all trials and joint angles were less than 1 (Fig. 4) although most of trials and joint angles in counter-movement jump were statistically significant. In contrast, standing displayed the highest Cohen’s d in the elbow joint angle.

Fig. 2
figure 2

Mean and standard deviation of each joint angle in each athletic movement. CMJ Counter-movement jump. The results were a mean of all participants

Fig. 3
figure 3

p value of t-test between marker-based motion capture system and human pose estimation measurements in each participant and athletic movement trial. Movement name with 0 is the first trial, and 1 is the second trial. CMJ Counter-Movement jump

Fig. 4
figure 4

Cohen’s d value between marker-based motion capture system and human pose estimation measurements in each participant and athletic movement trial. Movement name with 0 is the first trial, and 1 is the second trial. CMJ Counter-Movement jump

3.2 Errors in sports movements

Results for sports movements are shown in Fig. 5. In sports movement, the left elbow joint angle in the tennis backhand swing was most erroneous, which was 18.2 ± 3.6 degrees. The right hip joint angle in tennis forward swing showed the smallest error in the sports movements, which was 4.3 ± 2.2 degrees. Elbow angles were the most erroneous among all joint angles except for volleyball receiving. Figure 6 shows that p value of t-test in each participant and trial. There is no pattern regarding which movements or joint angles or both display significant differences and Cohen’s d values (Fig. 7).

Fig. 5
figure 5

Mean and standard deviation of each joint angle in each athletic movement. The results were a mean of all participants

Fig. 6
figure 6

p value of t-test between marker-based motion capture system and human pose estimation measurements in each participant and sports movement trial. Movement name with 0 is the first trial, and 1 is the second trial

Fig.7
figure 7

Cohen’s d value between marker-based motion capture system and human pose estimation measurements in each participant and sports movement trial. Movement name with 0 is the first trial, and 1 is the second trial

3.3 Post hoc analysis

Figure 8 illustrates the right and left elbow joint angles for a participant during standing. Since there was a clear consistent error throughout the trial (offset), adjustments of elbow joint angles based on the offset were applied. Figures 9 and 10 illustrate the errors of elbow joint angles before and after adjusting the offset in athletic and sports motions, respectively.

Fig. 8
figure 8

Right and left elbow joint angle in a participant during standing

Fig. 9
figure 9

Errors of elbow joint angles before and after adjusting offset in athletic movements

Fig. 10
figure 10

Errors of elbow joint angles before and after adjusting offset in sports movements

4 Discussion

4.1 Errors in general

Several potential factors could have contributed to the observed errors, including occlusion, mis-estimation, and an unsuitable capturing environment. Occlusion is an inherent challenge, as limbs may become obscured by the torso during certain movements, depending on the camera angle. In this study, elbow and wrist joints were often occluded, e.g., behind the trunk during volleyball receiving from back cameras. Human pose estimation usually assigns a low confidence rate to an occluded key point. This study used the confidence rate to weigh a projection line during triangulation. Therefore, the error by the occlusion should be minimized. The mis-estimation can be improved by training a pose estimation model and making sure that the capturing environment is proper, which includes lighting, background color, and removing extra persons in a frame. This study was conducted in a controlled environment. Therefore, lighting and background color were proper enough to see a person of interest clearly, but extra persons who controlled the motion capture systems and helped to guide a participant were in a frame sometimes. They may have confused the pose estimation model to estimate the right person with the right joint locations. Based on the errors found in this study, knee and hip joint angles can be measured by human pose estimation and used in gait analysis and sports performance analysis, for example.

4.2 The error in the elbow joint angle

Among all the joint angles measured, the elbow joint angle exhibited the highest degree of error. Occlusion, caused by the upper body limbs frequently being obscured behind the torso, could be a contributing factor. However, even considering occlusion, the error in elbow joint angle measurements appears to be excessively high. Interestingly, some human pose estimation measurements of elbow joint angles displayed a noticeable offset compared to the corresponding marker-based motion capture system measurements as Fig. 8 illustrates. In a standing position, the elbow should be straight, implying that the elbow angle should approach 180 degrees. As evident from the graph, the elbow joint angles obtained using marker-based motion capture system may be underestimated compared to the expected joint angles. This phenomenon could potentially explain the substantial error observed in elbow joint angles. Marker-based motion capture system relies on infrared markers attached to specific anatomical landmarks on tight underwear for its measurements. However, there is inherently an offset between the actual joint center and the marker position. Moreover, the markers themselves can become occluded by the human body. To improve these issues and accurately estimate the true joint center, marker-based motion capture system utilizes anthropometric measurements and sophisticated post-processing techniques. Pose Estimation, on the other hand, estimates key points on the human body’s surface, which are not susceptible to occlusion. Theoretically, this should enable Pose Estimation to provide more accurate joint center estimation compared to marker-based motion capture system or marker-based motion capture systems. In fact, when the offset calculated from elbow joint angles in the standing position was adjusted, the error of elbow joint angles decreased in most of the motions (Figs. 9 and 10). Statistically, the right elbow joint angle in standing in participant 3 observed the highest effect size, but the effect sizes in general differ in each participant, trial, and movement types. Therefore, it would be difficult to statistically conclude that elbow joint angles were more erroneous than other joint angles. However, the analysis of the offset is out of this scope in this study. The further investigation is needed to find the cause and potential solution for this phenomenon.

4.3 Possible ways to improve the accuracy of human pose estimation measurement

Avoiding the occlusion as much as possible can be important for accurate human pose estimation. The camera height, angle, and position need to be adjusted based on the movements to be captured. Regarding the capturing environment, the pose estimator may not be able to estimate the person of interest when the capturing environment is dark. This is because the pose estimator extracts key features from RGB values in a frame to look for human poses. The dark environment also causes motion blur since a camera slows down the shutter speed to include enough lights. Extra persons can confuse the pose estimator. Especially, human pose estimation uses a bottom-up approach that extracts body parts first and then associates them with a human pose. Therefore, when there are extra persons in a frame, the human pose estimation pays attention to the persons and may confuse the body parts. Removing the background may be a simple solution to this.

Nowadays, there are many selfie segmentation models to separate backgrounds from humans. Also, a frame without a human can be recorded before motion capture. The frame is used as a background reference to compare the frames with the person of interest by calculating RGB value differences. An anthropometric fitting can be another way to improve the accuracy. OpenCap [19] can be used to fit the pose estimation data into human anthropometry. OpenCap is a 3D motion capture application that can simulate kinematics and kinetics from pose estimation data. In the process of simulation, OpenCap calculates the kinematic and kinetic variables using the height and body mass of the person and a 3D human model from OpenSim [20]. The height and body mass of the person are the only requirements for anthropometrics in OpenCap, but if more anthropometric measurements other than height and body mass are available, the 3D human pose data can be refined by minimizing the difference between actual anthropometric measurements and calculated anthropometric measurements from the 3D human pose data using a least square method, for example. This study can be extended to see if the accuracy of the pose estimation measurement would improve with the anthropometric fitting methods. Another possibility to improve the accuracy is to train human pose estimation model with a biomechanical-focused dataset. As a study [21] pointed out that the publicly available dataset was not prepared for the biomechanical use case, the model should be trained with the proper dataset according to the use case. For this study, the accuracy may improve if a dataset with athletic and sports movements was used to train the human pose estimation model. In fact, a study could significantly improve the accuracy of extreme poses such as head down poses when the human pose estimation model was trained with a dataset of these extreme poses [22].

5 Conclusion

This study assessed the accuracy of human pose estimation-based kinematic measurements by comparing them to marker-based motion capture system, a widely recognized motion capture system. The average errors for athletic and sports movements were 9.7 ± 4.7 degrees and 9.0 ± 3.3 degrees, respectively, but they were 7.8 ± 3.5 degrees and 7.4 ± 1.6 degrees excluding elbow joint angles. Employing pose estimators like human pose estimation offers several advantages over traditional motion capture systems like marker-based motion capture system, but the accuracy of pose estimator-based kinematic measurements has not been thoroughly examined. The acceptable range of errors depends on the application. If human pose estimation is used in clinical settings where it requires precise measurements, the error found in this study may not be acceptable. In other fields, such as gait analysis, human pose estimation may contain the potential to reduce the efforts to conduct biomechanical analysis. Potential sources of error include the capturing environment, occlusion, and mis-estimation. Considering these factors, the benefits of using pose estimators for kinematic analysis generally outweigh the acceptable errors. However, the users of the pose estimator still need to pay attention to the above factors that may cause errors and make efforts to avoid those errors as much as possible although further investigation is needed to evaluate how much they influence the errors. In that sense, this study provides evidence of which kinematic measurements human pose estimation would be able to measure better in different movements. This information should be valuable when the users develop applications or apply kinematic analysis using human pose estimation.