Keywords

1 Introduction

It becomes quite usual for us to use multiple information devices such as mobile or tablet PCs, smartphones, PDAs at the same time. Aiming to take full advantage of those devices, many previous works have proposed various styles of cooperation among those devices for transmission and sharing of selected data among the devices [14], operation of the contents displayed on the screens [5, 6], and so on. Some other previous works propose to coordinate the screens so that they share the role of displaying contents, which, for example, are larger each screen.

In order to make several screens coordinated with each other for this purpose, we need to measure their arrangement in advance. The previous works described above obtain the tiled 2D arrangement among the screens based on their adjacency detected by their physical contact with each other using sensing hardware equipped on their edges [7, 8]. However, as we experience in setting the screen arrangement of a PC and its external display manually, we often prefer more various arrangements such as those of screens placed a little bit apart or in contact with each other just at their corners. It is also useful to display a 3D virtual space by specifying a 3D screen arrangement where we are surrounded by the screens.

In this article, we discuss how to measure those various arrangements of multiple screens. Since recent information devices are usually equipped with cameras, we measure the screen arrangement of the devices using the images of the user captured by those cameras on the devices. In the field of computer vision, the 3D geometric arrangement of multiple cameras is conventionally measured by the method for so-called strong camera calibration, in which the same set of markers whose 3D positions have already known is observed by each camera. We employ the feature points on the user’s face for those markers. When a user is operating some information devices, the user should keep gazing at their screens and thus his/her face can be observed from the camera on each device. The facial feature points of the user appearing in the camera image can be extracted by facial image processing, although those facial feature points may sometimes fail to be extracted depending on their appearance in the image. The positions of the facial feature points on the face are approximately available because they are similar for any persons, although some amount of personal differences are included in those positions.

By considering these properties of facial feature points, we discuss how to measure the arrangement of information devices with their cameras from the images of the user’s face. In the discussion, we also try to cope with the failure in extracting facial feature points by introducing the continuity of the change in the arrangement of the devices as a geometric constraint.

2 Measuring Arrangement of Devices

Measuring Geometric Arrangement of Each Device and the User’s Face. As we described above, we measure the arrangement of portable information devices at each moment of their operation by the same user at the same time by employing the feature points of the user’s face as the markers for the strong calibration of the cameras on the devices. The 3D position of the k-th feature point on the face is denoted by \( \varvec{p}_{k} \varvec{ }\left( {k = 1, \cdots ,K} \right) \), where \( K \) is the number of the feature points that we employ for the calibration. These 3D positions of the facial feature points are represented by the face-centered coordinate system with its origin at the center between the two eyes, the x axis passing through the eyes from left to right on the face, the y axis directed downward on the face and the z axis set forward from the face. The 2D positions where the k-th facial feature point appears in the image captured by the camera on the i-th information device is denoted by \( \varvec{q}_{k}^{i} \varvec{ }\left( {i = 1, \cdots ,N} \right) \), where \( N \) is the number of the devices. This 2D position is represented by the camera-centered coordinate system with its origin at the optical center of the camera, the x, y axes set rightward and upward to the image plane, and the z axis forward along the optical axis of the camera. The geometric relation among these face-centered coordinate system and camera-centered coordinate systems for the i-th device and the j-th device in the situation where the user is gazing at the screens of those devices is illustrated in Fig. 1.

Fig. 1.
figure 1

Geometric relation among the cameras on different devices and the user’s face

The geometric relation between \( \varvec{p}_{k} \) and \( \varvec{q}_{k}^{i} \) can be described as follows:

$$ \varvec{q}_{k}^{i} = \lambda A_{i} E_{i} \,\varvec{p}_{k} $$
(1)

Where \( A_{i} \) is the matrix of the internal parameters of the camera on the i-th device. This matrix mathematically represents the process of optical projection from each point in the 3D space onto the 2D image plane of the camera with scaling parameter \( \lambda \). The matrix \( E_{i} \) includes the external camera parameters that consist of rotation matrix \( R_{i} \) and translation vector \( \varvec{t}_{\varvec{i}} \), which represent the orientation and the position of the user’s face relative to the camera on the i-th device, as follows:

In Eq. (1), \( \varvec{q}_{k}^{i} \) is given as the 2D position of the k-th facial feature point extracted from the image obtained by the camera on the i-th device by facial feature processing, whereas \( \varvec{p}_{k} \) can be specified from the 3D position of the k-th facial feature point of the standard human face if we ignore its personal difference. Matrix \( A_{i} \) of the camera on the i-th device can be obtained by preliminary internal camera calibration using usual markers.

Since \( E_{i} \) includes 12 variables, this matrix can be determined by solving the equations above, if \( \varvec{p}_{k} \) and \( \varvec{q}_{k}^{i} \) are obtained for not less than six facial feature points out of \( K \). When the user is gazing at the screen of the i-th device, it is expected to be possible to obtain \( \varvec{q}_{k}^{i} \) for more facial feature points from his/her facial image captured by the camera on the device by facial image processing. Thus, by solving Eq. (1) for all the variables in \( E_{i} \), we can measure the geometric arrangement of the user’s face and the camera on the i-th device. Since the camera installed on an information device is usually fixed around the edge of the screen with its optical axis perpendicular to the screen, we can estimate the geometric arrangement of the user’s face and the screen of the device by measuring the position of the camera on the i-th device relative to its screen in advance.

After \( E_{1} , \cdots , E_{N} \) for all the cameras on all the devices are obtained, the geometric arrangement of the i-th device and the j-th device \( \left( {i \ne j} \right) \) can be calculated from \( E_{i} \) and \( E_{j} \) as follows:

$$ E_{ij} = E_{j} E_{i}^{ - 1} $$
(3)

Where \( E_{ij} \) is the matrix that represents the orientation and the position of the camera on the i-th device relative to that on the j-th device \( \left( {j = 1, \cdots ,N; \;j \ne i} \right) \).

Continuity in the Geometric Arrangements. When a user is operating multiple information devices, the user is usually not gazing at the screens of all the devices at the same time, but gazing at only one of those screens depending on his/her interest. In that situation, facial feature point extraction is successful for the image obtained by the camera on the device gazed by the user because his/her face is captured by the camera from the very front, whereas extraction of facial feature points often fails for the images obtained by the cameras on the devices that are not currently gazed by the user, depending on his/her facial orientation to the cameras. Failure in extracting facial feature points is also caused by occlusion of those points due to the user’s unconscious behavior such as putting a hand over the mouth and so on.

Since the geometric arrangement of different devices are indirectly obtained from the arrangement of each device and the user’s face, the information about the arrangement related to the devices with the camera images at the moments when the failure in facial feature point extraction occurs for those images is completely lost during the user’s operation of the devices. Moreover, even when the geometric arrangement is estimated for a device and the user’s face after facial feature point extraction is successful for the camera image of the device; the estimated geometric arrangement inevitably includes a certain amount of error.

In order to cope with the problems above, we introduce the continuity in the estimated arrangement. The geometric arrangement of the devices and the user’s face does not change drastically as far as the user keeps operating those devices at the same time in a similar manner. By considering it, we estimate the geometric arrangement under the constraint that the difference between the estimated arrangements at adjacent moments should be small as much as possible. This constraint smooth’s the geometric arrangements estimated for each device and the user at different moments when those arrangements can be measured after successful feature point extraction, as well as extrapolates the arrangements at the moments when those arrangements cannot be measured due to the failure in facial feature point extraction by simply duplicating the arrangements estimated at previous moments.

Let us represent \( E_{i} \) at any moment \( \tau \) of the user’s operation of multiple devices by \( E_{i} \left( \tau \right) \). When the facial feature points are successfully extracted from the image obtained by the camera on the i-th device, \( E_{i} \left( \tau \right) \) can be directly measured by the procedure described in 2.1, but includes a certain amount of error. When those facial feature points cannot be extracted from the camera image, \( E_{i} \left( \tau \right) \) cannot be measured. Thus, we estimate the correct value of \( E_{i} \left( \tau \right) \), regardless of the possibility of its measurement. The estimated value of \( E_{i} \left( \tau \right) \) is denoted by \( \hat{E}_{i} \left( \tau \right) \). In order to make \( \hat{E}_{i} \left( \tau \right) \) coincide with \( E_{i} \left( \tau \right) \) when it is measure from the camera image, and to make \( \hat{E}_{i} \left( \tau \right) \) close to \( \hat{E}_{i} \left( {\tau - 1} \right) \) at the previous moment \( \tau - 1 \), \( \hat{E}_{i} \left( \tau \right) \) is determined so that the following function \( {\mathcal{E}} \) is minimized at each possible moment \( \tau \):

$$ {\mathcal{E}}\left( {\hat{E}_{i} \left( \tau \right)} \right) = \mathop \sum \limits_{\tau }^{{}} \left\{ {f_{i} \left( \tau \right)\, \left\| {\hat{E}_{i} \left( \tau \right) - E_{i} \left( \tau \right)} \right\|_{F}^{2} + \left\| {\hat{E}_{i} \left( \tau \right) - \hat{E}_{i} \left( {\tau - 1} \right)} \right\|_{F}^{2} } \right\} $$
(4)

Where \( f_{i} \left( \tau \right) \) is the variable that takes the value 1 at the moment \( \tau \) when facial feature point extraction is successful for the camera image of the i-th device, and becomes zero otherwise. This function evaluates the difference of \( \hat{E}_{i} \left( \tau \right) \) from \( E_{i} \left( \tau \right) \) and \( \hat{E}_{i} \left( {\tau - 1} \right) \) using Frobenius norm denoted by \( \left\| . \right\|_{F} \).

3 Experimental Results

Comparing Results from Facial Feature Points and Corners of a Grid. We have implemented the procedure described above by employing OKAO Vision of OMRON Corporation for facial image processing and OpenCV for other image processing including camera calibration.

We first compared the results of estimating the geometric arrangement between a camera and the user’s face from facial feature points and markers used by the traditional camera calibration method. We employed six facial feature points, which include the inner corners and the outer corners of the eyes and the wings of the nose because they are comparatively stable for extraction by facial image processing and distributed widely over the face. We specified the positions of these facial feature points on the face-centered coordinate system based on the data of the standard human face [10], neglecting the personal difference. For the markers used for the traditional camera calibration method for comparison, we employed six corners around the center of a grid pattern on a checkerboard.

The results are shown in Fig. 2. Virtual 3D objects are drawn on the face and the checkerboard using the estimated arrangement. In spite that the actual positions of the facial feature points in the face-centered coordinate system are neither the same as the positions specified from the standard human face nor located on a flat plane such as a chessboard, the error in the appearance of the virtual object drawn based on the arrangement estimated from those facial feature points is comparable to the result from the corners of the grid patterns. This result shows that the facial feature points are sufficiently available for estimating the arrangement of an information device and its user’s face.

Fig. 2.
figure 2

Comparison between the results of camera calibration using the facial feature points and corners of a grid pattern.

Estimating the Arrangement of Two Devices and the User’s Face. For evaluating the error in estimating the geometric arrangement of the information devices and the user’s face by our method, we compared the estimated arrangement of the devices with their actual arrangement observed by a camera at the viewpoint of the user. In the experiment, two tablet PCs were gazed by the same user. At the top of each PC, we installed a web camera. The user wore eyeglasses equipped with a camera at their bridge as shown in Fig. 3, in order to obtain the images of the devices from the viewpoint of the user. The thick black frame of the eyeglasses was covered by a fabric tape in a skin color in case facial feature point extraction was interfered by the frame.

Fig. 3.
figure 3

Eyeglasses with a camera at their bridge

Figures 4(a) and (b) are the images obtained by the cameras on the two devices, where the points in each image are the facial feature points extracted by facial image processing. Figure 4(c) illustrates the result of drawing the appearances of the devices from the viewpoint of the user based on the geometric arrangement estimated by our method on the image observed by the camera on the eyeglasses. The difference of the estimated arrangement from the actual arrangement seems to be acceptable for the purpose of coordinating the contents to be displayed on the screens of the devices.

Fig. 4.
figure 4

Resultant images without failures of facial feature point extraction occlusion

We also estimated the arrangement of the same devices in the situation where the facial feature points fail to be extracted. Figures 5(a) and (b) are the images obtained by the cameras on the two devices in the situation. Since the user places a hand on the face, no facial feature point is extracted. Figure 4(c) illustrates the result of drawing the appearances of the devices from the viewpoint of the user based on the geometric arrangement estimated by our method in this situation. By extrapolation of the geometric arrangement at this moment from that estimated at the previous moment, the appearances of the devices are still be able to be obtained. Although the geometric arrangement of the devices and the user’s face are slightly changed due to the motion of the user for placing the hand on the face, the amount of the error between the estimated appearance and the actual one is similar to that in the result of Fig. 3(c). This result shows that the extrapolation of the arrangement in our method is effective when the motion of the user is small. However, we need more sophisticated method for the extrapolation to cope with more various situations with the failure of facial feature point extraction.

Fig. 5.
figure 5

Resultant images with occlusion of the face

4 Conclusions

We proposed to measure the 3D geometric arrangement of multiple portable information devices using the facial images of the user captured by the cameras on the devices. In our method, facial feature points are extracted from the images taken by the camera on the devices and the geometric arrangement of each device and the user’s face by using the camera calibration technique. To cope with the errors of estimating the geometric arrangement from facial feature points as well as the failure in extracting those facial feature points, we introduce the process of smoothing and extrapolation for the geometric arrangements estimated at different moments. From some experimental results, we confirmed that our method can estimate the arrangement of the devices and the user’s face with the amount of error acceptable for coordination of contents to be displayed on the screens of those devices even at the moment when the facial feature points fail to be extracted due to occlusions as far as the change in the arrangement is small.

For one of the future steps, we need to introduce more sophisticated methods for extrapolating the geometric arrangement of the devices and the user’s face to cope with more various and large amounts of changes in the arrangement by analyzing the patterns of the change during operation of multiple information devices by users. It is also useful to employ the data of the orientations, accelerations and so on of the devices obtained by various kinds of sensors usually installed in recent information devices to further reduce the error in estimating of the geometric arrangement.