1 Introduction

The wireless capsule endoscope (WCE) is a powerful device to acquire images of the gastrointestinal (GI) tract for screening, diagnostic, and therapeutic endoscopic procedures [1]. Especially, the WCE captures the images of the small intestine where current wired endoscopic devices cannot reach. In this paper, we introduce a method to recover the 3D structure from stereo images captured by a stereo-type WCE, shown in Fig. 1.

To perceive depth from endoscopic images, many researchers have brought various computer vision techniques such as stereo matching [4], shape-from-shading (SfS) [2, 13], shape-from-focus (SfF) [11], and shape-from-motion (SfM) [3]. Ciuti et al. [2] adopted the SfS technique because the position of light sources are known and shading is an important cue in the endoscopic images. Visentini et al. [13] fused the SfS cue and image feature correspondences to estimate accurate dense disparity maps. Takeshita et al. [11] introduced an endoscopic device that estimates depth by using the SfF technique, which utilizes multiple images captured with different focus settings at the same camera position. Fan et al. [3] established sparse feature correspondences between consequent images, and then, they calculated camera poses and the 3D structure of the scene by using the SfM technique. They generated 3D meshes through Delaunay triangulation by using triangulated feature points.

Fig. 1.
figure 1

Stereo-type wireless endoscopic capsule, wireless receiver, and captured images in the stomach and the small bowel, from the left.

Stereo matching is also a well-known technique to estimate a depth map from images, which can be divided into active and passive [9] approaches. We refer structured light-based stereo matching [10] to the active approach which projects a visible or IR pattern to the scene to leverage correspondence searching between the images. However, the active approach is not suitable for wireless endoscopy mainly because of the limited resources, e.g., battery capacity and the size of a capsule. Therefore, previous studies [4] focused on minimally invasive surgery rather than WCE-based GI examination. For the same reason, most commercial wireless endoscopic capsules typically adopt conventional passive image sensors.

To the best of our knowledge, commercially available WCE products are not capable of estimating depth information. This is the first attempt to estimate the geometric structure of the scene inside the GI tract captured by a WCE. To achieve this goal, we designed a stereo-type WCE as shown in Fig. 1 without enlarging the diameter of the capsule. This sensor can capture about 0.12 million images for the entire GI tract as described in Fig. 1 ranging from the stomach to the large bowel. Having captured stereo images in one hand, we estimate a fully dense depth map by using the direct attenuation model. Since there is no external light source except ones attached to the capsule, farther objects look darker than nearer one in the captured image. Therefore, we consider the attenuation trend of the light to estimate depth maps assuming that the medium inside the GI tract is homogeneous. We firstly employ the direct attenuation model to compute an up-to-scale depth map, and then, solve the scale ambiguity by using sparse feature correspondences. Afterward, we utilize the rescaled depth map to guide a popularly used algorithm, i.e., semi-global matching (SGM) [6]. The detailed description of the proposed method is given in the following section.

2 Proposed Method

2.1 Capsule Specification

Our wireless endoscopic capsule consists of two cameras, four led lights, a wireless transmitter, and the battery. Two cameras are displaced about 4 mm, the viewing angle is 170\(^\circ \), and the resolution of captured images is 320\(\,\times \,\)320. The capsule captures three pairs of images per second. In total, it captures more than 115,000 images for eight hours in the GI tract. Four led lights are attached around the cameras as shown in Fig. 1. The lights are synchronized with the cameras to minimize the battery usage. Captured images are transmitted to the receiver because the capsule does not have an internal storage. The length of the capsule is 24 mm and the diameter is 11 mm.

Fig. 2.
figure 2

Sample input images and the depth maps computed by Eq. (5). Here, bright pixels indicates they are farther than dark ones.

2.2 Depth Estimation with the Direct Attenuation Model

Since the captured image has poor visibility, the image for each pixel \(\mathbf {p}\) can be modeled as [5]

$$\begin{aligned} I(\mathbf {p}) = J(\mathbf {p})t(\mathbf {p}) + A(1 - t(\mathbf {p})), \end{aligned}$$
(1)

where J is the scene radiance, I is the observed intensity, t is the transmission map, and A is the atmospheric light. Since there is no source of natural illumination such as sunlight, A can be dropped from Eq. (1). Then, t can be defined as

$$\begin{aligned} t(\mathbf {p}) = I(\mathbf {p}) / J(\mathbf {p}). \end{aligned}$$
(2)

The transmission map also can be defined by Bouguer’s exponential law of attenuation [8],

$$\begin{aligned} t(\mathbf {p}) = \exp {(-\beta (\mathbf {p}) d(\mathbf {p}) )}, \end{aligned}$$
(3)

where an attenuation coefficient \(\beta (\mathbf {p})\) is typically represented by sum of absorption and scattering coefficients, \(\beta (\mathbf {p}) = \beta _\text {absortion}(\mathbf {p}) + \beta _\text {scatter}(\mathbf {p})\). By combining Eqs. (2) and (3), the depth of a pixel \(\mathbf {p}\) can be estimated as

$$\begin{aligned} d(\mathbf {p}) = \dfrac{\ln ( J(\mathbf {p}) ) - \ln (I(\mathbf {p}) ) }{ \beta (\mathbf {p}) } \approx \dfrac{\ln ( \bar{I} ) - \ln (I(\mathbf {p}) ) }{ \beta }. \end{aligned}$$
(4)

To simplify Eq.(4), we approximate two terms \(J(\mathbf {p})\) and \(\beta (\mathbf {p})\) by considering characteristics of the GI tract. First, assuming that the GI tract is filled with a homogeneous matter such as water, the attenuation coefficient \(\beta (\mathbf {p})\) is approximated as a constant value for all pixels, \(\beta \approx \beta (\mathbf {p})\). Second, we also approximate the scene radiance as the mean of all pixel values as \(J(\mathbf {p})\approx \bar{I}\) based on the assumption that most pixels have a similar color in a local GI region. Based on the second assumption, we easily obtain the depth map up to a scale factor \(\beta \),

$$\begin{aligned} d_\beta (\mathbf {p}) = \beta \ d(\mathbf {p}) = \ln ( \bar{I} ) - \ln (I(\mathbf {p})). \end{aligned}$$
(5)

Here, the depth map \(d_\beta (\mathbf {p})\) indicates a depth map up to scale factor \(\beta \). In the following section, we estimate \(\beta \). Beforehand, we apply a noise removal filter to smooth \(d_\beta (\mathbf {p})\) by using a well-known bilateral filter [12].

2.3 Resolving the Scale Ambiguity of \(d_\beta \)

To resolve the scale ambiguity of \(d_\beta (\mathbf {p})\), we compute \(\beta \) from by using sparse feature correspondences. First, we detect and match corner points. Then, we compute the depth of \(\mathbf {p}\), \(d_s(\mathbf {p})\),

$$\begin{aligned} d_s(\mathbf {p}) = \dfrac{f B}{ |p^L_x - p^R_x| }, \end{aligned}$$
(6)

where \(p^L_x\) and \(p^R_x\) are the positions of matched points from the left and right images along the x-axis, f is the focal length of the left camera, B is the baseline between two cameras. Since each corner point has corresponding \(d_\beta (\mathbf {p})\), \(\beta \) can be computed by

$$\begin{aligned} \beta = d_\beta (\mathbf {p})/d_s(\mathbf {p}). \end{aligned}$$
(7)

Assuming that \(\beta \) is constant for all pixels, we find an optimal \(\beta \), \(\beta ^*\), that maximizes the number of inlier points whose error is smaller than a threshold value, \(\tau _c\).

$$\begin{aligned} \begin{array}{lll} \beta ^* = \underset{\beta \in \mathcal {B}}{\arg \max } \sum _{\mathbf {p} \in \mathcal {S}}T(\mathbf {p}, \beta , \tau _c), \\ \begin{array}{lc} T(\mathbf {p}, \beta , \tau _c) = \left\{ \begin{array}{cc} 1 &{} \mathrm {if}\ |d_s(\mathbf {p}) - d_\beta (\mathbf {p}) / \beta | \le \tau _c \\ 0 &{} \text {otherwise}. \\ \end{array} \right. , \\ \end{array} \end{array} \end{aligned}$$
(8)

where \(\mathcal {B}\) is the set of \(\beta \) values computed from all feature correspondences and \(\mathcal {S}\) is the set of correspondences’ positions in the image coordinate. The function T gives 1, if the discrepancy between \(d_s(\mathbf {p})\) and rescaled \(d_\beta (\mathbf {p})\) is small, and 0, otherwise. Therefore, the estimated \(\beta ^*\) minimizes the gap between \(d_s\) and \(d_\beta / \beta \). We thus rescale \(d_\beta (\mathbf {p})\) and compute its corresponding disparity map as

$$\begin{aligned} \begin{array}{ll} \bar{d}_\beta (\mathbf {p}) = \dfrac{d_\beta (\mathbf {p})}{\beta ^*}, \ \ \bar{D}_\beta (\mathbf {p}) = \dfrac{fB}{\bar{d}_\beta (\mathbf {p})}. \end{array} \end{aligned}$$
(9)

We utilize the rescaled disparity map \(\bar{D}_\beta (\mathbf {p})\) to leverage stereo matching.

2.4 Robust Stereo Matching Using a Guidance Depth Map

We slightly modify the SGM algorithm [6] to compute the disparity map \(D(\mathbf {p})\) which minimizes the following energy function,

$$\begin{aligned} \begin{array}{ll} E(d) &{}= \underset{\mathbf {p}}{\sum }\left( \phi (\mathbf {p}, D(\mathbf {p})) + \psi (\mathbf {p}, D(\mathbf {p})) \right) \\ &{} + \underset{\mathbf {q}{\in N_\mathbf {p}}}{\sum } P_1 T[|D(\mathbf {p})-D(\mathbf {q})| = 1] + \underset{\mathbf {q}{\in N_\mathbf {p}}}{\sum } P_2 T[|D(\mathbf {p})-D(\mathbf {q})| > 1]. \end{array} \end{aligned}$$
(10)

In the first term, the function \(\phi (\cdot , \cdot )\) is the pixel-wise matching cost, computed by using Census-based hamming distance and absolute difference of intensities (AD-CENSUS). The function \(\psi (\cdot , \cdot )\) is also the pixel-wise matching cost computed by using \(\bar{D}_\beta (\mathbf {p})\),

$$\begin{aligned} \begin{array}{lc} \psi (\mathbf {p}, D(\mathbf {p})) = \left\{ \begin{array}{cc} |\bar{D}_\beta (\mathbf {p}) - D(\mathbf {p})| &{} \ \ \mathrm {if}\ |\bar{D}_\beta (\mathbf {p}) - D(\mathbf {p})| \le \tau _\text {err} \\ c &{} \text {otherwise} \\ \end{array} \right. . \\ \end{array} \end{aligned}$$
(11)

The second term gives the penalty \(P_1\) for the pixels having small disparity differences with the neighboring pixels \(\mathbf {q}{\in N_\mathbf {p}}\), i.e., \(T[|D(\mathbf {p})-D(\mathbf {q})| = 1]\) gives 1 when the difference of disparity values is 1. Similarly, the third term gives the large penalty \(P_2\) such that \(P_2 > P_1\) for the pixels having disparity differences greater than 1 with the neighboring pixels. We minimize Eq. 10 by using the SGM method [6]. As a post-processing step, we apply the weighted median filter [7]. Finally, we obtain the depth map from the disparity map by \(d(\mathbf {p}) = {fB}/D(\mathbf {p})\).

Fig. 3.
figure 3

Comparison of disparity maps and reconstructed 3D structures.

Fig. 4.
figure 4

Comparison reconstructed 3D structures.

3 Experimental Results

Computing depth maps from endoscopic images is difficult, mainly because of difficulties caused by the characteristics of the GI tract and the limited resources of the capsule. The difficulties are summarized as (1) inaccurate matching due to lens distortion, (2) low resolution image, (3) image noise caused by the lack of illumination.

Note that the proposed method without the direct attenuation model and its cost function terms in Eqs. (10) and (11) is identical to the popularly used stereo matching algorithm, SGM [6]. Therefore, we qualitatively compare results of the proposed method with the conventional SGM to demonstrate the advantages of the proposed method under the aforementioned difficulties. To achieve fair comparison and to explicitly demonstrate the advantages of the proposed method, we used the same similarity measure and parameters for the SGM and the proposed method.

In the first experiment, we used the images of a stomach and a small bowel captured by our stereo-type WCE. In the second experiment, we used the large bowel Phantom modelFootnote 1 not only to capture endoscopic images but also to compare the actual size of an object with the estimated size.

Qualitative Evaluation: Most importantly, the proposed method acquires fully dense depth maps whereas the conventional approach fails at non-overlapping regions because pixels of the left image in non-overlapping regions do not have corresponding pixels of the right image. Moreover, since we used a large field of view cameras, the proportion of non-overlapping regions is about 30\(\sim \)50% depending on the distance of the captured scene from the camera. Computed disparity values in non-overlapping regions are noisy as shown in Fig. 3(b) where noisy disparity values are exceptionally brighter than others in the disparity map although they should show similar disparity as seen in the scene structure of Fig. 3(a). The noisy disparity values become more conspicuous when they are represented in the 3D space as shown in Fig. 3(d), and those noisy disparity values seem to float the space so that they obstruct to see the underlying 3D structure. Differently, the proposed method accurately recovers the 3D structure of the scene as shown in Fig. 3(e) because the proposed cost function with the direct attenuation model well suppresses uncertainties caused by radial distortion and low light noise. In addition, the depth map based on the direct attenuation model effectively enforces the proposed cost function to reconstruct the depth in non-overlapping regions as shown in Fig. 3(c).

As discussed in the introduction, the main advantage of the WCE is that it can capture not only stomach images but also images of small bowel where typical wired endoscopic devices cannot reach. Similar to the results demonstrated in Fig. 3, the proposed method reconstruct 3D structures of the local stomach and small bowel regions more robustly than the SGM as shown in Figs. 4(c) and (d), and effectively estimates dense depth maps in non-overlapping regions as shown in Figs. 4(a) and (b).

Application for Diagnosis: We also show an application of the proposed method for diagnosis. Using estimated depth information, we estimate the size of an object of interest by clicking two points from the image as shown in Figs. 5(a) and (c). In this experiment, we used the large bowel Phantom model and two different types of polyps whose size is known. As shown in Figs. 5, the estimated size is quite similar to the actual size in which the error was at most 0.5 mm. This procedure has long relied on the experienced doctor or endoscopist.

Fig. 5.
figure 5

Size estimation results with a large bowel Phantom model

Parameter Settings and Running Time: We used 7 \(\times \) 9 window to compute census-based matching cost computation, and set P1 and P2 to 11 and 19, respectively. The average running time of the proposed method was about 10ms, implemented on a modern GPU, GTX Titan Xp.

4 Conclusion

We have proposed a stereo matching algorithm designed for a stereo-type wireless capsule endoscopy. We obtained an up-to-scale depth map by using the direct attenuation model because of the light source around the capsule in the completely dark environment. Thereafter, we employed the up-to-scale depth map to guide conventional stereo matching algorithms after resolving the scale ambiguity. Through the experiments, we observed that the proposed method can estimate depth maps accurately and robustly in the GI tract.