1 Introduction

Monocular human motion capture is an important and large part of recent research. Its applications range from surveillance, animation, robotics to medical research. While there exists a large number of commercial motion capture systems, monocular 3D reconstruction of human motion plays an important role where complex hardware arrangements are not feasible or too costly.

Fig. 1.
figure 1

Mapping from a 3D point representation to the kinematic chain space. The vectors in the KCS equal to directional vectors in the 3D point representation. The sphere shows the trajectories of left and right lower arm in KCS. Since both bones have the same length their trajectories lie on the same sphere.

Recent approaches to the non-rigid structure from motion problem [1,2,3,4] achieve good results for laboratory settings. They are designed to work with tracked 2D points from arbitrary 3D point clouds. To resolve the duality of camera and point motion they require sufficient camera motion in the observed sequence. On the other hand, in many applications (e.g. human motion capture, animal tracking or robotics) properties of the tracked objects are known. Exploiting known structural properties for non-rigid structure from motion problems is rarely considered e.g. by using example based modeling as in [5] or constancy of bone lengths in [6]. Recently, linear subspace training approaches have been proposed [6,7,8,9,10,11]. They can efficiently represent human motion, even for 3D reconstruction from single images. However, they require extensive training on known motions which restricts them to reconstructions of the same motion category. Further, training based approaches cannot recover individual subtleties in the motion (e.g. limping instead of walking) sufficiently well.

This paper closes the gap between non-rigid structure from motion and subspace-based human modeling. Similar to other approaches which depend on the work of Bregler et al. [12], we decompose an observation matrix in three matrices corresponding to camera motion, transformation and basis shapes. Unlike other works that find a transformation which enforces properties of the camera matrices, we develop an algorithm that optimizes the transformation with respect to structural properties of the observed object. This reduces the amount of camera motion necessary for a good reconstruction. We experimentally found that even sequences without camera motion can be reconstructed. Unlike other works in the field of human modeling we propose to first project the observations in a kinematic chain space (KCS) before optimizing a reprojection error with respect to our kinematic model. Figure 1 shows the mapping between the KCS and the representation based on 2D or 3D feature points. It is done by multiplication with matrices which implicitly encode a kinematic chain (cf. Sect. 3.1). This representation enables us to derive a nuclear norm optimization problem which can be solved efficiently. Imposing a low rank constraint on a Gram matrix has shown to improve 3D reconstructions [3]. However, the method of [3] is only based on constraining the camera motion. Therefore, it requires sufficient camera motion. The KCS allows to use a geometric constraint which is based on the topology of the underlying kinematic chain. Thus, the required amount of camera motion is much lower.

We evaluate our method on different standard databases (CMU MoCap [13], KTH [14], HumanEva [15], Human3.6M [16]) as well as on our own databases qualitatively and quantitatively. The proposed algorithm achieves state-of-the-art results and can handle problems like motion transfers and unseen motion. Due to the noise robustness of our method we can apply a CNN-based joint labeling algorithm [17, 18] for RGB images as input data which allows us to directly reconstruct human poses from unlabeled videos. Although this method is developed for human motion capture it is applicable to other kinematic chains such as animals or industrial robots as shown in the experiments in Sect. 4.3.

Summarizing, our contributions are:

  • We propose a method for 3D reconstruction of kinematic chains from monocular image sequences.

  • An objective function based on structural properties of kinematic chains is derived that not only imposes a low-rank assumption on the shape basis but also has a physical interpretation.

  • We propose using a nuclear norm optimization in a kinematic chain space.

  • In contrast to other works our method is not limited to previously learned motion patterns and does not use strong anthropometric constraints such a-priorly determined bone lengths.

2 Related Work

The idea of decomposing a set of 2D points tracked over a sequence into matrices whose entries are identified with the parameters of shape and motion was first proposed by Tomasi and Kanade [19]. A generalization of this algorithm to deforming shapes was proposed by Bregler et al. [12]. They assume that the observation matrix can be factorized into two matrices representing camera motion and multiple basis shapes. After an initial decomposition is found by singular value decomposition (SVD) of the observation matrix they compute a transformation matrix by enforcing camera constraints. Xiao et al. [20] showed that the basis shapes of [12] are ambiguous. They solved this ambiguity by employing basis constraints on them. As shown by Akther et al. [1] these basis constraints are still not sufficient to resolve the ambiguity. Therefore, they proposed to use an object independent trajectory basis. Torresani et al. [21,22,23] proposed to use different priors on the transformation matrix such as additional rank constraints and Gaussian priors. Gotardo and Martinez [24] built on the idea of [1] by applying the DCT representation to enforce a smooth 3D shape trajectory. Parallel to this work they proposed a solution that uses the kernel trick to also model nonlinear deformations [25] which cannot be represented by a linear combination of basis shapes. Hamsici et al. [2] also assume a smooth shape trajectory and apply the kernel trick to learn a mapping between the 3D shape and the 2D input data. Park et al. [26] introduced activity-independent spatial and temporal constraints. Inspired by [1] and [26] Valmadre et al. [27] proposed a dynamic programming approach combined with temporal filtering. Dai et al. [3] minimize the trace norm of the transformation matrix to impose a sparsity constraint. Different to [3] Lee et al. [28] define additional constraints on motion parameters to avoid the sparsity constraint. Since all these methods assume to work for arbitrary non-rigid 3D objects, none of them utilizes knowledge about the underlying kinematic structure. Rehan et al. [4] were the first to define a temporary rigidity of reconstructed structures by factorizing a small number of consecutive frames. Thereby, they can reconstruct kinematic chains if the object does not deform much. Due to their sliding window assumption, the method is even more restricted to scenes with sufficient camera motion.

Several works consider the special case of 3D reconstruction of human motion from monocular images. A common approach is to previously learn base poses of the same motion category. These are then linearly combined for the estimation of 3D poses. To avoid implausible poses, most authors utilize properties of human skeletons to constrain a reprojection error based optimization problem. However, anthropometric priors such as the sum of squared bone lengths [7], known limb proportions [8], known skeleton parameters [5], previously trained joint angle constraints [9] or strong physical constraints [29] all suffer from the fact that parameters have to be known a-priorly. Zhou et al. [10] propose a convex relaxation of the commonly used reprojection error formulation to avoid the alternating optimization of camera and object pose. While many approaches try to reconstruct human poses from a single image [30,31,32,33,34,35] using anthropometric priors, such constraints have rarely been used for 3D reconstruction from image sequences. Wandt et al. [6] constrain the temporal change of bone length without using a predefined skeleton. Zhou et al. [36] combined a deep neural network that estimates 2D landmarks with 3D reconstruction of the human pose. A different approach is to include sensors as additional information source [37,38,39]. Other works use a trained mesh model for instance SMPL [40] and project it to the image plane [41, 42]. The restriction to a trained subset of possible human motions is the major downside of these approaches.

In this paper we combine NR-SfM and human pose modeling without requiring previously learned motions. By using a representation that implicitly models the kinematic chain of a human skeleton our algorithm is capable to reconstruct unknown motion from labeled image sequences.

3 Estimating Camera and Shape

The i-th joint of a kinematic chain is defined by a vector \(\varvec{x}_i \in \mathbb {R}^3\) containing the x, y, z-coordinates of the location of this joint. By concatenating j joint vectors we build a matrix representing the pose \(\varvec{X}\) of the kinematic chain

$$\begin{aligned} \varvec{X}=(\varvec{x}_1, \varvec{x}_2, \cdots , \varvec{x}_j). \end{aligned}$$
(1)

The pose \(\varvec{X}_k\) in frame k can be projected into the image plane by

$$\begin{aligned} \varvec{X}'_k=\varvec{P}_k \varvec{X}_k, \end{aligned}$$
(2)

where \(\varvec{P}_k\) is the projection matrix corresponding to a weak perspective camera. For a sequence of f frames, the pose matrices are stacked such that \(\varvec{W}=(\varvec{X}'_1, \varvec{X}'_2, \dots , \varvec{X}'_f)^T\) and \(\hat{\varvec{X}}=(\varvec{X}_{1}, \varvec{X}_{2}, \dots , \varvec{X}_{f})^T\). This implies

$$\begin{aligned} \varvec{W}=\varvec{P} \hat{\varvec{X}}, \end{aligned}$$
(3)

where \(\varvec{P}\) is a block diagonal matrix containing the camera matrices \(\varvec{P}_{1,\dots ,f}\) for the corresponding frame. After an initial camera estimation we subtract a matrix \(\varvec{X}_0\) from the measurement matrix by

$$\begin{aligned} \hat{\varvec{W}}=\varvec{W} - \varvec{P} \hat{\varvec{X}}_0, \end{aligned}$$
(4)

where \(\hat{\varvec{X}}_0\) is obtained by stacking \(\varvec{X}_0\) multiple times to obtain the same size as \(\varvec{W}\). Here, we take \(\varvec{X}_0\) to be a mean pose. We will provide experimental evidence that the algorithm proposed in the following is insensitive w.r.t. the choice of \(\varvec{X}_0\) as long as it represents a reasonable configuration of the kinematic chain. In all the experiments dealing with kinematic chains of humans, we take \(\varvec{X}_0\) to be the average of all poses in the CMU data set.

Following the approach of Bregler et al. [12] we decompose \(\hat{\varvec{W}}\) by Singular Value Decomposition to obtain a rank-3K pose basis \(\varvec{Q} \in \mathbb {R}^{3K\times j}\). While [12] and similar works then optimize a transformation matrix with respect to orthogonality constraints of camera matrices, we optimize the transformation matrix with respect to constraints based on a physical interpretation of the underlying structure. With \(\varvec{A}\) as transformation matrix for the pose basis we may write

$$\begin{aligned} \varvec{W}=\varvec{P} (\hat{\varvec{X}}_0 + \varvec{A}\varvec{Q}). \end{aligned}$$
(5)

In the following sections we will present how poses can be projected into the kinematic chain space (Sect. 3.1) and how we derive an optimization problem from it (Sect. 3.2). Combined with the camera estimation (Sect. 3.3) an alternating algorithm is presented in Sect. 3.4.

3.1 Kinematic Chain Space

To define a bone \(\varvec{b}_k\), a vector between the r-th and t-th joint is computed by

$$\begin{aligned} \varvec{b}_k=\varvec{p}_r-\varvec{p}_t=\varvec{X}\varvec{c}, \end{aligned}$$
(6)

where

$$\begin{aligned} \varvec{c}=(0,\dots , 0, 1, 0, \dots , 0,-1,0,\dots ,0)^T, \end{aligned}$$
(7)

with 1 at position r and \(-1\) at position t. The vector \(\varvec{b}_k\) has the same direction and length as the corresponding bone. Similarly to Eq. (1), a matrix \(\varvec{B} \in \mathbb {R}^{3\times b}\) can be defined containing all b bones

$$\begin{aligned} \varvec{B}=(\varvec{b}_1, \varvec{b}_2, \dots , \varvec{b}_b). \end{aligned}$$
(8)

The matrix \(\varvec{B}\) is calculated by

$$\begin{aligned} \varvec{B}=\varvec{X}\varvec{C}, \end{aligned}$$
(9)

where \(\varvec{C}\in \mathbb {R}^{j\times b}\) is built by concatenating multiple vectors \(\varvec{c}\). Analogously to \(\varvec{C}\), a matrix \(\varvec{D}\in \mathbb {R}^{b\times j}\) can be defined that maps \(\varvec{B}\) back to \(\varvec{X}\):

$$\begin{aligned} \varvec{X}=\varvec{B}\varvec{D}. \end{aligned}$$
(10)

\(\varvec{D}\) is constructed similar to \(\varvec{C}\). Each column adds vectors in \(\varvec{B}\) to reconstruct the corresponding point coordinates. Note that \(\varvec{C}\) and \(\varvec{D}\) are a direct result of the underlying kinematic chain. Therefore, the matrices \(\varvec{C}\) and \(\varvec{D}\) perform the mapping from point representation into the kinematic chain space and vice versa.

3.2 Trace Norm Constraint

One of the main properties of human skeletons is the fact that bone lengths do not change over time.

Let

$$\begin{aligned} \varvec{\varPsi }=\varvec{B}^T\varvec{B}= \begin{pmatrix} l_1^2 &{} \cdot &{} \cdot &{} \cdot \\ \cdot &{} l_2^2 &{} \cdot &{} \cdot \\ \cdot &{} \cdot &{} \ddots &{} \cdot \\ \cdot &{} \cdot &{} \cdot &{} l_b^2 \\ \end{pmatrix}. \end{aligned}$$
(11)

be a matrix with the squared bone lengths on its diagonal. From \(\varvec{B}\in \mathbb {R}^{3\times b}\) follows \(rank(\varvec{B})=3\). Thus, \(\varvec{\varPsi }\) has rank 3. Note that if \(\varvec{\varPsi }\) is computed for every frame we can define a stronger constraint on \(\varvec{\varPsi }\). Namely, as bone lengths do not change for the same person the diagonal of \(\varvec{\varPsi }\) remains constant.

Proposition 1

The nuclear norm of \(\varvec{B}\) is invariant for any bone configuration of the same person.

Proof

The trace of \(\varvec{\varPsi }\) equals the sum of squared bone lengths (Eq. (11))

$$\begin{aligned} trace(\varvec{\varPsi })=\sum _{i=1}^b l_i^2. \end{aligned}$$
(12)

From the assumption that bone lengths of humans are invariant during a captured image sequence the trace of \(\varvec{\varPsi }\) is constant. The same argument holds for \(trace(\sqrt{\varvec{\varPsi }})\). Therefore, we have

$$\begin{aligned} \Vert \varvec{B}\Vert _*=trace(\sqrt{\varvec{\varPsi }})=const. \end{aligned}$$
(13)

Since this constancy constraint is non-convex we will relax it to derive an easy to solve optimization problem. Using Eq. (9) we project Eq. (5) into the KCS which gives

$$\begin{aligned} \varvec{W} \varvec{C}=\varvec{P} (\hat{\varvec{X}}_0 \varvec{C} + \varvec{A}\varvec{Q} \varvec{C}) \end{aligned}$$
(14)

The unknown is the transformation matrix \(\varvec{A}\). For better readability we define \(\varvec{B}_0=\varvec{X}_0 \varvec{C}\) and \(\varvec{S}=\varvec{Q} \varvec{C}\).

Proposition 2

The nuclear norm of the transformation matrix \(\varvec{A}\) for each frame has to be greater than some scalar c, which is constant for each frame.

Proof

Let \(\varvec{B}=\varvec{B}_1+\varvec{B}_0\) be a decomposition of \(\varvec{B}\) into the initial bone configuration \(\varvec{B}_0\) and a difference to the observed pose \(\varvec{B}_1\). It follows that

$$\begin{aligned} \Vert \varvec{B}\Vert _*=\Vert \varvec{B}_1+\varvec{B}_0\Vert _*=c_1, \end{aligned}$$
(15)

where \(c_1\) is a constant. The triangle inequality for matrix norms gives

$$\begin{aligned} \Vert \varvec{B}_1\Vert _* +\Vert \varvec{B}_0\Vert _* \ge \Vert \varvec{B}_1+\varvec{B}_0\Vert _*=c_1. \end{aligned}$$
(16)

Since \(\varvec{B}_0\) is known, it follows

$$\begin{aligned} \Vert \varvec{B}_1 \Vert _* \ge c_1 - \Vert \varvec{B}_0\Vert _* = c, \end{aligned}$$
(17)

where c is constant. \(\varvec{B}_1\) can be represented in the shape basis \(\varvec{S}\) (cf. Sect. 3) by multiplying it with the transformation matrix \(\varvec{A}\)

$$\begin{aligned} \varvec{B}_1 = \varvec{AS}. \end{aligned}$$
(18)

Since the shape base matrix \(\varvec{S}\) is a unitary matrix the nuclear norm of \(\varvec{B}_1\) equals

$$\begin{aligned} \Vert \varvec{B}_1\Vert _* = \Vert \varvec{A}\Vert _*. \end{aligned}$$
(19)

By Eq. (17) follows that

$$\begin{aligned} \Vert \varvec{A} \Vert _* \ge c. \end{aligned}$$
(20)

Proposition 2 also holds for a sequence of frames. Let \(\varvec{\hat{A}}\) be a matrix built by stacking \(\varvec{A}\) for each frame and \(\varvec{\hat{B}}_0\) be defined similarly, we relax Eq. (20) and obtain the final formulation for our optimization problem

(21)

Equation (21) does not only define a low rank assumption on the transformation matrix. By the derivation above, we showed that the nuclear norm is reasonable because it has a concise physical interpretation. More intuitively, the minimization of the nuclear norm will give solutions close to a mean configuration \(\varvec{B}_0\) of the bones in terms of rotation of the bones. The constraint in Eq. (21) which represents the reprojection error prevents the optimization from converging to the trivial solution \(\Vert A\Vert _*=0\). This allows for a reconstruction of arbitrary poses and skeletons.

Moreover, Eq. (21) is a well studied problem which can be efficiently solved by common optimization methods such as Singular Value Thresholding (SVT) [43].

3.3 Camera

The objective function in Eq. (21) can also be optimized for the camera matrix \(\varvec{P}\). Since \(\varvec{P}\) is a block diagonal matrix, Eq. (21) can be solved block-wise for each frame. With \(\varvec{X}'_i\) and \(\varvec{P}_i\) corresponding to the observation and camera at frame i the optimization problem can be written as

$$\begin{aligned} \min _{\varvec{P}_i} \Vert \varvec{X}'_i \varvec{C} - \varvec{P}_i(\varvec{AS}+\varvec{B}_0) \Vert _F. \end{aligned}$$
(22)

Considering the entries in

$$\begin{aligned} \varvec{P}_i = \left( \begin{array}{cccc} p_{11} &{} p_{12} &{} p_{13}\\ p_{21} &{} p_{22} &{} p_{23}\\ \end{array} \right) \end{aligned}$$
(23)

we can enforce a weak perspective camera by the constraints

$$\begin{aligned} p_{11}^2+p_{12}^2+p_{13}^2 -(p_{21}^2+p_{22}^2+p_{23}^2) = 0 \end{aligned}$$
(24)

and

$$\begin{aligned} p_{11} p_{21} + p_{12} p_{22} + p_{13} p_{23} = 0. \end{aligned}$$
(25)

3.4 Algorithm

In the previous sections we derived an optimization problem that can be solved for the camera matrix \(\varvec{P}\) and transformation matrix \(\varvec{A}\) respectively. As both are unknown we propose Algorithm 1 which alternatingly solves for both matrices. Initialization is done by setting all entries in the transformation matrix \(\varvec{A}\) to zero. Additionally, an initial bone configuration \(\varvec{B}_0\) is required. It has to roughly model a human skeleton but does not need to be the mean of the sequence.

figure a

4 Experiments

For the evaluation of our algorithm different benchmark data sets (CMU MoCap [13], HumanEva [15], KTH [14], Human3.6M [16]) were used. As measure for the quality of the 3D reconstructions we calculate the Mean Per Joint Position Error (MPJPE) [16] which is defined by

$$\begin{aligned} e = \frac{1}{j} \sum _{i=1}^{j} \Vert \varvec{x}_i - \hat{\varvec{x}}_i \Vert , \end{aligned}$$
(26)

where \(\varvec{x}_i\) and \(\hat{\varvec{x}}_i\) correspond to the ground truth and estimated positions of the i-th joint respectively. By rigidly aligning the 3D reconstruction to the ground truth we obtain the 3D positioning error (3DPE) as introduced by [44]. To compare sequences of different lengths the mean of the 3DPE over all frames is used. In the following it is referred to as 3D error.

Additional to this quantitative evaluation we perform reconstructions of different kinematic chains in Sect. 4.3 and on unlabeled image sequences in Sect. 4.4. All animated meshes in this section are created using SMPL [40]. The SMPL model is fitted to the reconstructed skeleton and is used solely for visualization.

Fig. 2.
figure 2

Reconstruction of the highly articulated directions sequence from the Human3.6M data set subject 1.

Fig. 3.
figure 3

Reconstruction of a running motion from the CMU database subject 35/17.

4.1 Evaluation on Benchmark Databases

To qualitatively show the drawbacks of learning-based approaches we reconstructed a sequence of a limping person. We use the method of [6] trained on walking patterns to reconstruct the 3D scene. Although the motions are very similar, the algorithm of [6] is not able to reconstruct the subtle motions of the limping leg. Figure 4 shows the knee angle of the respective leg. The learning-based method reconstructs a periodic walking motion and cannot recover the unknown asymmetric motion which makes it unusable for gait analysis applications. The proposed algorithm is able to recover the motion in more detail.

Fig. 4.
figure 4

Knee angle of reconstructions of a limping motion. The learning-based method [6] struggles to reconstruct minor differences from the motion patterns used for training whereas our learning-free approach recovers the knee angle in more detail.

We compare our method with the unsupervised works [1, 2] and the learning-based approach of [6]. The codes of [1] and [2] are freely available. Although there are slightly newer works, these two approaches show the inherent problem of these unsupervised methods (as also shown in [4]). We are not aware of any works that are able to reconstruct scenes with very limited or no camera motion without a model of the underlying structure. Rehan et al. [4] assume a local rigidity that allows for defining a kinematic chain model. This reduced the amount of necessary camera motion to 2 degrees per frame. However, due to their assumption that the observed object is approximately rigid in a small time window they are limited to a constantly moving camera.

For each sequence we created 20 random camera paths with little or no camera motion and compared our 3D reconstruction results with the other methods. Table 1 shows the 3D error in mm for different sequences and data sets. For the entry walk35 we calculated the mean overall 3D errors of all 23 walking sequences from subject 35 in the CMU database. The columns jump and limp show the 3D error of a single jumping and limping sequence. KTH means the football sequence of the KTH data set [14] and HE the walking sequence of the HumanEva data set [15]. The last four columns are average errors over all subjects performing the respective motions of the Human3.6M data set [16]. Note that the highly articulated motions from Human3.6M data set vary a lot in the same category and therefore are hard to learn by approaches like [6]. All these sequences are captured with little or no camera motion. The unsupervised methods of [1] and [2] require more camera motion and completely fail in these scenarios. The learning-based approach of [6] reconstructs plausible poses for all sequences. They even achieve a better result for the walking motions. However, motions with larger variations between persons and sequences (e.g. jumping and limping) are harder to reconstruct from the learned pose basis. Although the results look like plausible human motions, they lack the ability to reconstruct subtle motion variations. In contrast, the proposed method is able to reconstruct these variations and achieves a better result. Some of our reconstructions are shown in Figs. 2 and 3 for sequences of the Human3.6M and CMU data set, respectively.

Table 1. 3D error in mm for different sequences and data sets. The column walk35 shows the mean 3D error of all sequences containing walking motion from subject 35 in the CMU database. jump refers to the jumping motion of subject 13/11 of the CMU database and limp to the limping motion of subject 91/16. KTH means the football sequence of the KTH data set [14]. The column HE shows the 3D error for the HumanEva walking sequence [15]. The last four columns are average errors over all subjects performing the respective motions of the Human3.6M data set [16].

4.2 Convergence

We alternatingly optimize the camera matrices (Eq. (21)) and transformation matrix (Eq. (22)). Since convergence of the algorithm cannot be guaranteed we show it by experiment. Figure 5 shows the convergence of the reprojection error in pixel for a sequence from the CMU MoCap database. However, the reprojection error only shows the convergence of the proposed algorithm but cannot prove that the 3D reconstructions will improve every iteration. We additionally estimated the convergence of the 3D error in Fig. 5. In most cases our algorithm converges to a good minimum in less than 3 iterations. Further iterations do not improve the visual quality and only deform the 3D reconstruction less than 1 mm. The 3D error remains constant during camera estimation which causes the steps in the error plot.

Figure 6 shows the computation time over the number of frames for three different sequences. The computation time mostly depends on the number frames and less on the observed motion. We use unoptimized Matlab code on a desktop PC for all computations.

Fig. 5.
figure 5

Reprojection error and 3D error with respect to number of iterations for subject35/sequence1 from the CMU MoCap data set. Even steps refer to camera estimation while odd steps correspond to shape estimation.

Fig. 6.
figure 6

Computation time for walking, running and jumping sequences of the CMU data set using unoptimized Matlab code. It mostly depends on the number of frames and less on the observed motion.

4.3 Other Kinematic Chains

Although our method was developed for the reconstruction of human motion, it generalizes to all kinematic chains that do not include translational joints. In this section we show reconstructions of other kinematic chains such as people holding objects, animals and industrial robots.

In situations where people hold objects with both hands the kinematic chain of the body can be extended by another rigid connection between the two hands. Figure 7 shows the reconstruction of the sword fighting sequence of the CMU data set. By simply adding another column to the kinematic chain space matrix \(\varvec{C}\) (cf. Sect. 3.1) the distance between the two hands is enforced to remain constant. The exact distance does not need to be known, however.

Fig. 7.
figure 7

Reconstruction of the sword play sequence of the CMU database. The kinematic chain is extended such that the hands are rigidly connected.

Fig. 8.
figure 8

Reconstruction of a sequence of an industrial robot moving along a path. The reconstruction is shown as an augmented overlay over the images.

Fig. 9.
figure 9

Reconstruction of a horse riding sequence. Although we use a very rough model for the skeleton of the horse we obtain plausible reconstructions. The complete reconstruction including more views can be seen in the supplemental video.

Figure 8 shows a robot used for precision milling and the reconstructed 3D model as overlay. The proposed method is able to correctly reconstruct the robots motion. In Fig. 9 we reconstructed a more complex motion of a horse during show jumping. We used a simplified model of the bone structure of a horse. Also in reality the shoulder joint is not completely rigid. Despite these limitations the algorithm achieves plausible results.

4.4 Image Sequences

The proposed method is designed to reconstruct a 3D object from labeled feature points. In the former sections this was done by setting and tracking them semi-interactively. In this section we will show that our method is also able to use the noisy output of a human joint detector. We use deeperCut [17, 18] to estimate the joints in the outdoor run and jump sequence from [45]. Figure 10 shows the joints estimated by deeperCut and our 3D reconstruction. As can be seen in Fig. 10 we achieve plausible 3D reconstructions even with automatically labeled noisy input data.

Fig. 10.
figure 10

Reconstruction of a running and jumping sequence from [45] automatically labeled by deeperCut [17, 18].

5 Conclusion

We developed a method for the 3D reconstruction of kinematic chains from monocular image sequences. By projecting into the kinematic chain space a constraint is derived that is based on the assumption that bone lengths are constant. This results in the formulation of an easy to solve nuclear norm optimization problem. It allows for reconstruction of scenes with little camera motion where other non-rigid structure from motion methods fail. Our method does not rely on previous training or predefined body measures such as known limb lengths. The proposed algorithm generalizes to the reconstruction of other kinematic chains and achieves state-of-the-art results on benchmark data sets.