Single view silhouette fitting techniques for estimating tennis racket position
Stereo camera systems have been used to track markers attached to a racket, allowing its position to be obtained in three-dimensional (3D) space. Typically, markers are manually selected on the image plane, but this can be time-consuming. A markerless system based on one stationary camera estimating 3D racket position data is desirable for research and play. The markerless method presented in this paper relies on a set of racket silhouette views in a common reference frame captured with a calibrated camera and a silhouette of a racket captured with a camera whose relative pose is outside the common reference frame. The aim of this paper is to provide validation of these single view fitting techniques to estimate the pose of a tennis racket. This includes the development of a calibration method to provide the relative pose of a stationary camera with respect to a racket. Mean static racket position was reconstructed to within ±2 mm. Computer generated camera poses and silhouette views of a full size racket model were used to demonstrate the potential of the method to estimate 3D racket position during a simplified serve scenario. From a camera distance of 14 m, 3D racket position was estimated providing a spatial accuracy of 1.9 ± 0.14 mm, similar to recent 3D video marker tracking studies of tennis.
Developments in science and technology can influence the game of tennis . The International Tennis Federation (ITF) regularly reviews the rules, balancing equipment technological developments with preserving the nature of the sport. Ever-increasing advances in computing power, motion analysis software, sensors, and cameras combined with decreasing technology costs have also fostered the development of systems for measuring performance variables such as tennis ball velocity and spin. In 2014, the rules of tennis were updated to allow the use of player analysis technology (PAT) in competitive play . PAT has the potential to provide players and coaches with key information for competition and training, for research as well as for sport broadcasting.
Obtaining racket motion during strokes is of interest as its speed and orientation at impact influence the rebound of the ball . Sensors can be attached to a racket to measure its motion , although these add mass and elite players can distinguish differences in moment of inertia as small as 2.5% . Recent work evaluating wireless inertial measurement units for measuring baseball bat swings indicates that current sensors may not be accurate for the full range of speeds experienced in play . Stereo-calibrated cameras have been used to obtain racket motion in 3D, but most previous research, especially in field conditions, has relied on time-consuming manual digitisation of markers attached to the frame [7, 8, 9].
Markerless motion tracking systems have initially been developed to track ball and player movement in tennis, although these are often limited to two-dimensional (2D) analysis. Pingali, Jean and Carlbom  and Yan, Christmas and Kittler  applied temporal differencing techniques to automatically identify ball and player positions from broadcast television footage. Pingali, Jean, and Carlbom  used positional data to provide a statistical analysis of matches and Yan, Christmas and Kittler  developed a system designed for robustness rather than accuracy. Kelley et al.  developed a system to automatically measure ball spin rates and velocity from high-speed video footage. Dunn et al.  developed a semi-automatic method for identifying foot-surface contacts during match-play using a range of image processing techniques. The techniques utilised in these studies [10, 11, 12, 13] cannot be applied directly to accurately track racket motion in 3D.
Marker-free techniques for analysis of movement in 3D often apply a Visual Hull , which uses silhouettes—typically extracted digitally—from a number of camera views to reconstruct a volume of interest [15, 16, 17, 18, 19, 20]. Corazza et al.  tracked walking trials and gymnastic flips, and concluded that Visual Hull-based approaches for tracking human movements require at least eight cameras. Sheets et al.  used a Visual Hull generated from the silhouette views of eight high-speed cameras to measure racket and player motion during a serve. Strategic placement and calibration of eight cameras in relatively close proximity to the court and player is not well suited for analysis of racket movement during competitive play. A non-invasive markerless motion capture system utilising one stationary camera is desirable for measuring 3D racket movement, to limit cost, improve portability, reduce setup times, and facilitate use in competitive play.
Price and Morrison  applied silhouette fitting techniques to estimate the 3D position of particles using the view from one stationary camera. Initially, they calibrated six stationary cameras to obtain their poses within a common reference frame using a calibration object of known dimensions. The six calibrated cameras were then used to obtain six digital silhouette views of a stationary particle. The relative camera pose associated with each particle silhouette view was known from the calibration, and together, the group of silhouettes was defined as a calibrated set. Adopting methods from Forbes et al. , using the same six calibrated cameras, they merged calibrated sets—each with the particle oriented in a different stationary position—increasing the number of silhouette views in the calibrated set up to sixty.
Price and Morrison  then estimated the relative 3D position of the particle with respect to an additional stationary camera, which was not part of the calibrated set. They used the Levenberg–Marquardt routine to optimise the consistency between the calibrated set and a silhouette of the particle captured from the camera pose outside the calibrated set. Silhouette consistency was measured using the epipolar tangency constraint [21, 22, 23, 24]. By adjusting the candidate relative poses, to minimise the epipolar tangency error (ETE), an accurate estimate (within 5° of criterion orientation) of the pose of the particle with respect to the camera was made . The accuracy of a pose estimate increases with the number of silhouettes in a set and the irregularity of the particle shape. More regular shaped particles can appear similar from different viewpoints, and therefore, the optimisation routine is more likely to converge to a local minimum away from the solution . Elliott et al.  adapted methods from other authors [26, 27] to produce a 44-view calibrated set of a 1:5 scale model tennis racket using one camera. Single view fitting techniques adopted from Price and Morrison  were applied to estimate racket position to within ±2 mm. While these single view fitting techniques require prior construction of a calibrated set, they illustrate the potential for measuring racket motion on-court with one camera.
The aim of this paper was to validate single view fitting techniques to estimate the pose of a tennis racket. Applying these techniques to a tennis racket is particularly challenging, as it is more regular in shape than the particles studied by Price and Morrison . An improved method is presented for constructing a more consistent calibrated set of tennis racket views in a laboratory. A view was removed from the calibrated set, and single view fitting techniques were applied to obtain the position of the racket using the silhouette associated with this view. The technique was then applied using computer generated silhouettes of a full size racket model in a simulated on-court scenario to estimate racket position during a simplified serve movement.
2.1 Construction of a calibrated set of silhouette views
The racket was painted matt black and white sheets formed a contrasting backdrop to aid digital silhouette extraction (Fig. 1b). The Perspex board formed the lower calibration plane (global X–Y plane) and control points 1–4 consisted of machined grooves filled with black paint to form the corners of a rectangle measuring ~360 × ~290 mm. The origin of the global coordinate system was set at control point 1, creating a left-hand reference frame (Fig. 1a). The racket formed the upper calibration plane and control points 5–7 were white marks painted on two black rods (diameter of 4 mm) fixed perpendicular to one another in the frame. A laser scanner (Metris ModelMaker D100) accurate to 0.050 mm was used to obtain the relative position of all control points, while also confirming that the calibration planes were orthogonal and the face of the racket was parallel to the global Y–Z plane, within practical limits (<0.5°).
The master camera was positioned ~2 m from the racket, with the optical axis forming an angle of ~10° with the global X–Z plane (Fig. 1c). The slave camera was positioned at angles from ~−60° to ~60° in ~20° increments at heights of ~1.15, ~1.55, and ~1.85 m to form three tiers (Fig. 1d). A suitable configuration for the calibrated set was found using simulations with computer generated silhouette views in Blender (v2.70) , with modifications to incorporate the practicalities of positioning cameras and undertaking a calibration, as detailed in Elliott . The cameras were set at their maximum resolution of 1280 × 800 pixels, with an F-stop (aperture) of F22 and a shutter speed of 1/8th s. The checkerboard (8 × 8 squares each measuring 50 × 50 mm) was set in 50 orientations for each stereo calibration to maximise image coverage [28, 35].
Checkerboard images were passed to Bouguet’s calibration Toolbox  in MATLAB  to obtain the intrinsic [focal length (fx, fy), principal point (cx, cy), and lens distortion] and extrinsic (rotation and translation in a common reference frame) camera parameters. Based on the findings of Elliott , the intrinsic parameters were estimated using a 4th order radial distortion model without the tangential component, and they were not recomputed when estimating the extrinsic parameters. The control points on the orthogonal calibration object were manually digitised ten times each (with short breaks to limit any learning effect), with the mean values passed to the Toolbox along with the corresponding world coordinates from the laser scan. The Toolbox used the control point coordinates from the manual digitisation and the laser scan, along with the intrinsic parameters, to compute the relative pose of the master camera with respect to the origin (control point P1), in a common reference frame with the racket. Averaged across all control points, the mean standard deviation from manual digitisation was 0.1 pixels, which equates to a relative calibrated camera pose error of less than 1 mm in translation and below 0.1° in rotation .
Using the extrinsic parameters from the stereo calibrations, rigid body transformations  were applied to obtain the slave cameras in a common reference frame with the master and racket. The calibration with the orthogonal object was performed once as it remained stationary along with the master camera, reducing uncertainty from manual digitisation in comparison to digitising the control points in an image from each slave camera position. Using an orthogonal object improved camera pose accuracy, in comparison to simply using the Perspex board  or racket as a planar calibration object, as detailed by Elliott . MATLAB’s  image processing toolbox was used to perform thresholding to digitally extract racket silhouettes from the slave camera’s images and to segment polygonal silhouette boundaries [37, 38, 39, 40, 41]. The extracted boundary was plotted on the original image and its quality assessed visually. The threshold value was manually adjusted, to ensure the extracted boundary provided an accurate representation of the racket in the original image.
2.2 Estimating racket position from candidate relative camera poses
Each silhouette was removed from the calibrated set and its camera pose was estimated using an initial candidate relative pose. Estimates were then compared with a criterion, which was the pose of the camera (obtained from calibration) removed from the calibrated set. Tests with computer generated camera poses confirmed that reducing the number of silhouette views in the calibrated set by one did not influence the results .
A maximum of 100 candidate relative poses were used  and searches were terminated when the root-mean-squared (RMS) value of the ETE vector reached a threshold of 0.5 pixels. A threshold was required as the RMS ETE would not converge to zero due to inherent inconsistency in the set, as a result of small errors associated with calibration and silhouette extraction. If the threshold was not reached for any of the candidate relative poses, then the solution corresponding to the lowest ETE would be used. A threshold of 1 pixel did not always allow the optimisation to fully converge and reducing the value below 0.5 did not affect the solution.
Each camera pose estimate obtained with the view fitting techniques was used to reconstruct 106 3D coordinates on the racket face plane, using a camera-plane model [42, 43]. The reconstructed coordinates were compared with corresponding points on the racket frame surface obtained from the laser scan (criterion). As the coordinates extracted from the laser scan were not on the racket face plane, stereo triangulation was used for the reconstruction. Pixel projections of the coordinates were obtained using the calibration parameters, allowing for triangulation using the master camera (criterion) and each slave camera pose estimate from the view fitting techniques. The ability of the view fitting method to accurately reconstruct these coordinates was taken as the measure of how well racket position could be estimated.
2.3 Proof of concept for application to tennis
To simulate motion during a simplified serve, the racket model was rotated about an axis 10.16 cm (4 inches) from the butt aligned with the global Y axis. This is the location of the axis of rotation used to define the swing weight of a racket . The racket was rotated about the Y axis between −40° and 30° in 2° increments, with 0° corresponding to upright. Silhouette images of the racket model were rendered every 2°, which for typical racket head speeds during a serve [18, 45, 46, 47, 48]; a high-speed camera would need to operate at 200 frames per second (fps), so that sufficient silhouette images could be obtained. The algorithm was instructed to perform two optimisations; the first worked backwards from when the racket was oriented at 0° to −40°, the second worked forwards from 0° to 30°. An orientation of 0° was used as a starting point, because it was found that this position provided a more accurate pose initialisation . Thus, for the first optimisation, with the racket orientated at 0°, the candidate relative pose was obtained using the method described in Sect. 2.2. This scenario requires the operator to provide the algorithm with an initial approximate distance between the camera and the racket, i.e., 14 m should be sufficient for baseline shots (Fig. 3). The following optimisations were then initialised using the camera pose estimate from the previous solution. The 3D racket positions were obtained using the camera pose estimates to reconstruct the 130 coordinates on its face plane in the Y, Z, and resultant dimensions for each angle . Reconstruction results were validated against known 3D coordinates obtained from the racket model mesh.
Mean ± standard deviation for the intrinsic parameters (fx, fy, cx, and cy) (pixels) for the master and slave cameras over the 21 stereo calibrations
1804.30 ± 1.72
1781.60 ± 1.30
1802.20 ± 2.31
1779.50 ± 1.26
642.64 ± 0.75
651.53 ± 0.72
440.76 ± 0.84
399.30 ± 0.93
Single view silhouette fitting techniques were able to accurately reconstruct points on the surface of a racket frame with a mean reconstruction error of <1.5 mm in all three principal directions, equating to a mean resultant reconstruction error of ±2 mm. Unlike previous markerless motion capture methods applied to tennis [18, 20]; the work presented here lays the foundations for a method which would require an accurate calibrated set and not a visual hull. An RMS ETE value of 0.41 pixels was measured for the calibrated set, consistent with values reported by other authors for different objects and calibration methodologies [41, 49]. The consistency of a calibrated set is dependent on the accuracy of camera calibration and the quality of silhouette extraction . Future work will, therefore, look to improve the calibration process and silhouette extraction techniques to increase the accuracy of racket pose estimates.
The largest resultant reconstruction errors were found for views from the front (4.63, 0.66, and 2.46 mm in the X, Y, and Z directions, respectively, for camera pose estimate 11, see Fig. 5c) and side (2.96, 1.28, and 4.05 mm in the X, Y and Z directions, respectively, for camera pose estimate 1, see Fig. 5c) of the racket, which may be due to calibration accuracy and the range of poses in the calibrated set. Using stereo calibration methods, the range of views in the calibrated set was constrained by the position of the master camera, as it was not possible to achieve adequate checkerboard coverage when the convergence angle between the two cameras was large (>~60°). The convergence angle was low (~10°) when obtaining frontal silhouette views, which can result in inaccurate estimations of depth in stereo calibrations [35, 50]. Estimating the pose of a racket from a frontal silhouette view is particularly challenging due to its reflective symmetry, which can be accounted for by having a wide range of views in the calibrated set . While future work could focus on how best to position the master camera to achieve a consistent calibrated set with a wider range of views, the best option may be to merge a number of sets [21, 22, 41, 49] produced with the master in different locations. Simulations utilising computer generated views could assist in determining the most suitable camera poses in the calibrated set, as per Elliott , taking into consideration the practicalities of physically reproducing the setup.
Choppin, Goodwill, and Haake  reported an accuracy of ±2.5 mm when reconstructing marker positions on a tennis racket using two stereo-calibrated video cameras. In relation to a tennis stroke, the reconstruction errors corresponded to a mean angular error of ±1° and velocity error of ±0.5 m s−1 for elite tennis players measured during practice. While the initial results presented here for the silhouette fitting techniques are promising, it is not possible to compare these directly with the values reported by Choppin and colleagues . They reconstructed markers on the racket in a much less constrained scenario during a practice session at the Wimbledon qualifying tournament, with the racket up to 14 m from the camera. In the current study, reconstruction was performed in the laboratory from ~1.5 m from the racket and position estimation uncertainty may increase as the camera moves further away.
To trial the method before application to real-play conditions, the 3D position of a full size model racket was estimated from computer generated silhouette views in Blender (v2.70)  captured from a camera distance of 14 m. On average, 3D racket position was reconstructed to within 1.96 ± 0.14 mm during a simplified simulated serve, comparable with measurements obtained in the laboratory, although the computer generated set contained no calibration error. All reconstruction errors were lower than the accuracy criteria of 15 ± 10 mm achieved by the markerless method developed by Corazza et al. , which was used by Sheets et al.  and Abrams et al.  to measure tennis serve kinematics in practice conditions.
The previous studies have reported racket head velocities of 34.8 m s−1 , 38.6 m s−1 , 43.2 m s−1 , 46 m s−1 , and 26.1 m s−1 , during a flat serve. The lower velocity reported by  is because the centre of volume rather than the tip of the racket was tracked. A spacing of 0.005 s between frames coupled with special data smoothing procedures [48, 51] is often used to track racket velocity around impact [18, 45, 46]. In the current study, the tip of the racket model was displaced by a resultant distance of 4 mm between each 2° rotation about the axis 10.16 cm from the butt (Fig. 6). This level of accuracy can be expected to translate to a real serve with a racket tip velocity ranging between 35 and 46 m s−1 captured with a camera operating with at least 200 fps. The results of the current study regarding application of a view fitting method to estimate 3D racket position are, however, limited to a simplified simulated serve movement, computer generated data, and silhouettes which were not extracted from noisy backgrounds in the field.
As robust silhouette extraction is important for obtaining accurate pose estimates with single view fitting techniques, images should be obtained at the highest possible resolution when generating a calibrated set [34, 41]. In the current study, racket silhouette images for the calibrated set were obtained using a white background to simplify boundary extraction. A robust method for digitally extracting racket silhouettes during tennis strokes is now required for view fitting against a calibrated set generated in a controlled environment. Extracting racket silhouettes from a tennis stroke poses a particular challenge due to occlusion of the handle by the hand and the noisy background.
A markerless method capable of accurately estimating 3D racket position with one camera under controlled conditions has been presented. Development of a calibration method provided the relative pose of a camera with respect to a racket which was used to create a laboratory-based calibrated silhouette set. The set consisted of 21 camera poses in a semispherical configuration and its inconsistency was less than 0.1% of the mean length of the racket in a silhouette image. Using this set, mean racket position was reconstructed to within ±2 mm. Tests with computer generated camera poses and silhouette views allowed 3D racket position to be estimated during a simplified serve scenario. From a camera distance of 14 m, 3D racket position was estimated providing a spatial accuracy of 1.9 ± 0.14 mm. Further work will focus on developing the techniques presented here to measure 3D racket movements during play conditions. Thereafter, combining them with single camera, ball tracking software could make for a useful tool to track racket motion for performance evaluations in research and for application with coaches and players.
- 1.Haake SJ, Allen TB, Choppin S, Goodwill SR (2007) The evolution of the tennis racket and its effect on serve speed. In: Miller S, Capel-Davies J (eds) Tennis science and technology 3. International Tennis Federation, London, pp 257–271Google Scholar
- 2.ITF (2016) 2016 ITF rules of tennis. http://www.itftennis.com/officiating/rulebooks/rules-of-tennis.aspx. Accessed 22 June 2016
- 8.Knudson D, Bahamonde R (1999) Trunk and racket kinematics at impact in the open and square stance tennis forehand. Biol Sport 16(1):3–10Google Scholar
- 10.Pingali GS, Jean Y, Carlbom I (1998) Real time tracking for enhanced tennis broadcasts. In: Computer vision and pattern recognition, pp 260–265Google Scholar
- 11.Yan F, Christmas W, Kittler J (2005) A tennis ball tracking algorithm for automatic annotation of a tennis match. British machine vision conference, pp 619–628Google Scholar
- 12.Kelley J, Choppin SB, Goodwill SR, Haake SJ (2010) Validation of a live, automatic ball velocity and spin rate finder in tennis. In: Sabo A, Kafka P, Litzenberger S, Sabo, C (eds) The Engineering of Sport 8. Proced Eng 2(2), pp 2967–2972Google Scholar
- 22.Forbes K, Voigt A, Bodika N (2003) Using silhouette consistency constraints to build 3D models. In: Proceedings of the fourteenth annual symposium of the pattern recognition association of South Africa (PRASA)Google Scholar
- 23.Wong K-YK (2001) Structure and motion from silhouettes. PhD, University of CambridgeGoogle Scholar
- 24.Bradski G, Kaehler A (2008) Learning OpenCV: computer vision with the OpenCV Library. O’Reilly Media, San MateoGoogle Scholar
- 25.Elliott N, Choppin S, Goodwill S, Allen T (2014) Markerless tracking of tennis racket motion using a camera. In: James D, Choppin S, Allen T, Wheat J, Fleming P (eds) The Engineering of Sport 10. Proced Eng 10(72):344–349Google Scholar
- 27.Tavares JMR, Azevedo TC, Vaz MA (2008) 3D reconstruction of external anatomical structures using computer vision. In: Proceedings of the 8th international symposium on computer methods in biomechanics and biomedical engineering – CMBBE 2008. Porto, PortugalGoogle Scholar
- 28.Zhang Z (1999) Flexible camera calibration by viewing a plane from unknown orientations. In: proceedings of the 17th international Conference on Computer Vision. Corfu, Greece, pp 666–673Google Scholar
- 30.Choppin S, Goodwill S, Haake S (2010) Investigations into the effect of grip tightness on off-centre forehand strikes in tennis. Proc Inst Mech Eng Part P: J Sports Eng Technol 224(4):249–257Google Scholar
- 32.MATHWORKS (2014). MATLAB (2014b). http://uk.mathworks.com/help/optim/ug/lsqnonlin.html. Accessed 29 June 2016
- 33.Blender v2.70 (2014). https://www.blender.org. Accessed on 30 June 2016
- 34.Elliott N (2015) Camera calibration and configuration for estimation of tennis racket position in 3D. PhD, Sheffield Hallam UniversityGoogle Scholar
- 35.Kelley J (2014) A camera calibration method for a hammer throw analysis tool. In: James D, Choppin S, Allen T, Wheat J, Fleming, P (eds) The engineering of sport 10. Proced Eng 10(72):74–79Google Scholar
- 36.Bouguet JY (2010) Camera calibration toolbox for MATLAB. http://www.vision.caltech.edu/bouguetj/calib_doc/. Accessed 10 Nov 2011
- 37.Matusik W, Buehler C, McMillan L (2001) Polyhedral visual hulls for real-time rendering. In: Proceedings of 12th Eurographics Workshop on Rendering, June 2001, pp 115–125Google Scholar
- 38.Lazebnik S (2002) Projective visual hulls. Master’s Thesis, University of Illinois at Urbana-ChampaignGoogle Scholar
- 39.Lazebnik S, Boyer E, Ponce J (2001) On computing exact visual hulls of solids bounded by smooth surfaces. Proc IEEE Comput Soc Confer Comput Vis Pattern Recogn 1:1–156Google Scholar
- 41.Forbes K (2007) Calibration, recognition, and shape from silhouettes of stones. PhD, University of Cape TownGoogle Scholar
- 42.Dunn M, Wheat J, Miller S, Haake S (2012) Reconstructing 2D planar coordinates using linear and non-linear techniques. In: Proceedings of the 30th international conference on biomechanics in sports, pp 381–383Google Scholar
- 43.Dunn M (2014) Video-based step measurement in sport and daily living. PhD, Sheffield Hallam UniversityGoogle Scholar
- 50.Choppin S (2008) Modelling of tennis racket impacts in 3D using elite players. PhD, University of SheffieldGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.