1 Introduction

Depth is an important characteristic of a scene, and much of this 3D information is lost when standard image acquisition systems convert scene information into 2D data. The human visual system, however is very good at estimating the depth of various elements in a scene and it is able to do so by taking into account different depth cues. There have been many attempts to replicate the performance of these biological systems as depth information of scenes is vital to many applications. This is why depth approximation is one of the most intriguing and open problems in computer vision.

Range acquisition methods can be broadly classified into optical and non-optical. Non-optical methods provide accurate single point depth measurements but require expensive computations and scanners to provide a dense depth map. Optical techniques are able to do this by retrieving depth from 2D images. Optical methods are divided into Active and Passive. Passive techniques are further divided into binocular and monocular techniques. Some common binocular methods include Depth from Stereo and Structure from Motion Parallax. The monocular vision techniques are Shape from Shading, Shape from Silhouettes, Depth from Focus and Depth from Defocus. There have been many previous attempts at Depth from Defocus. Pentland [1] was the first to use multiple images His technique was based on a multiple image, frequency domain technique, where two images of the same scene captured using different aperture settings (smaller aperture to capture a focused image and a larger aperture to capture a defocused image) were used to estimate the amount of defocus. In 1993, Ens and Lawrence [2] proposed a spatial domain technique based on a matrix regularization approach to recover depth information from two defocused images. Their method was stated as an alternative approach to that of the inverse filtering methods advocated by Pentland. In 2007, a neural network based technique was suggested by Jong [3] which estimated the spread parameter σ of the Gaussian point spread function in the spatial domain. The model was based on a supervised learning network that employed the Radial Basis Function (RBF). Further explanation of depth from defocus can be found in [4,5,6], and a text book [7].

1.1 Objective

In this study we make use of the depth from defocus (DfD) technique. DfD differs from depth from focus, in that it requires only two differently focused images of the scene to be taken. The images are then segmented into subsections for which we estimate the degree to which it has been blurred. In a particular scene different regions are at different distances from the lens aperture, as a result of this the lens is unable to focus the light emanated by all the points in the scene. Once blur has been measured for the corresponding subsections of the two images, relative depth is calculated utilizing this blur parameters of the two images. The primary obstacle is to develop a system which can predict the degree to which segments are blurred irrespective of the texture within that segment. A feedforward multilayer neural network was used to predict the blur. The advantage of using such a system is its ability to generalize and respond to unexpected inputs/textures.

2 Image Acquisition System

2.1 Lens Setup and Tele-centric Optics

DfD utilizes two images of the same scene. For real-time implementation of this system, images have to be captured simultaneously, else any change in the scene during the period between which the two images are captured will lead to erroneous results. However in our implementation the two images were taken one after the other by changing the distance between the image sensor and lens. The lens used to obtain the real images was a Nikon 50 mm with the aperture set to 6.5 mm. For the objects imaged, a near focus of 744 mm and a far focus of 800 mm were set.

One of the major constraints for any imaging setup for DfD is that there has to be constant magnification between the two defocused images. If this criteria is not satisfied it will lead to correspondence problem similar to the depth from stereo case. To solve this issue an aperture is placed on the focal plane of the object side as shown in Fig. 1. This reduces the intensity of light reaching the lens, but the rays which do meet the lens have passed through the focus and as a result the rays emerge parallel to the optical axis. Therefore all images now formed will be of constant magnification. The properties of tele-centric optics is studied in detail by Watanabe and Nayar [8].

Fig. 1.
figure 1

Lens setup for DfD based on tele-centric optics

2.2 Image Formation

From Fig. 1 we see that the two image sensors (i 1 and i 2 ) are place beyond the focal plane. The two image sensors are placed such that the distance between them is 2e. The first image i 1 is formed at a distance of γ from the lens. This image is called the far focused image. The second image i 2 is formed at a distance γ + 2e from the lens, this image is called the near focused image. From Fig. 2, it is seen that when an object is placed a distance u 1 , a focused image is formed at i 1. The light from the object is however distributed in a circular manner on the second sensor i 2 . This circle is called the circle of confusion. The object will appear blurred on the second sensor. Similarly from Fig. 2 when the object is at distance u 2 from the lens then the image is sharp on i 2 and blurred on i 1 . Thus we see that there is some form of relation between the difference in blurring between the two images and the object distance.

Fig. 2.
figure 2

Ray diagram depicting the 2 configurations when the object is at distance u 1 and u 2 . It is to be noted that this diagram is merely for representation and in reality there will be no magnification variation between the two images because of the tele-centric setup.

3 Mathematical Modelling of Blur and Its Correlation to Relative Depth in DfD

3.1 Mathematical Model of Defocused Images

Defocused or blurred images can be mathematically modelled as the spatial convolution (denoted by the operator \( { \circledast } \)) of a kernel/mask with the focused image. These kernels are usually known as 2D point spread functions (PSF). The following equation shows the relation between the focused image region and its defocused counterpart in the spatial domain:

$$ I\left( {defocussed} \right) = I\left( {focused} \right){ \circledast }PSF. $$
(1)

There are different types of point spread functions considered by different researchers based on the type of application and lens parameters. Some of the common ones are Pillbox, Gaussian and Generalized Gaussian. In our study we have assumed that blur is due to a Gaussian kernel. A two dimensional Gaussian point spread function with zero mean [9] is given by:

$$ f\left( {x,y} \right) = \frac{1}{{2\pi \sigma^{2} }} \times e^{{ - \frac{{x^{2} + y^{2} }}{{2\sigma^{2} }}}} $$
(2)

The degree to which particular regions of images are blurred is dependent on the lens and scene setting. This can however be modelled by varying the σ parameter (σ is the spreading parameter of the PSF) which would be convolved with the focused region to imitate the blur. Larger values of σ would mean more blurring and small values of σ would correspond to sharp regions.

3.2 Relating σ to the Relative Depth

If we observe Fig. 2 we see that at i 1 the rays converge to a point from a point-object at a distance u 1 . This means that the image formed here will be sharp and when modelled it will formed by convolving a PSF with σ that’s close to 0, with the focused image. For the image at i 2 the rays are spread in the “circle of confusion” and will appear more blurred. This can be modelled by a PSF with a larger σ than the one used for i 1. As we move from object-distance u 1 to the configuration with object-distance u 2 as shown in Fig. 2 the σ for i 1 goes on increasing while the σ for i 2 goes on decreasing. The σ corresponding to i 1 is called σfar-focused and the one for i 2 is called σnear-focused. This relation between σ and normalized object distance is given by the following relation, similar to the relation given in Watanabe and Nayar [10]:

$$ Normalized\,distance\left( d \right) = \frac{{\upsigma \left( {{\text{near}}\,{\text{focused}}} \right) -\upsigma\left( {{\text{far}}\,{\text{focused}}} \right)}}{{\upsigma \left( {{\text{near}}\,{\text{focused}}} \right) +\upsigma\left( {{\text{far}}\,{\text{focused}}} \right)}} $$
(3)

In the system we have used, the two images formed at i 1 and i 2 are first segmented into smaller subsections each section having dimensions 10 × 10 pixels. This pixel-intensity matrix is vectorized and then inputted to a neural network for it to predict the σ or spread parameter. We thus a form the σ-map for each of the two sets of images. The σ at a particular location on the σ-map will give us information of the extent to which that region has been blurred. Using the σfar-focused and σnear-focused maps and (3) we are able to compute depth maps for the scene. The depth maps computed usually are very noisy, to improve the output it goes through a median filter to remove noise elements without disturbing the edge features of the scene.

4 Training the Multi-layered Neural Network to Predict σ

4.1 Network Architecture

The primary issue with using a multi-layered neural network is to choose a network architecture that has sufficient layers to approximate the function being learnt. But if we choose an excessive number of hidden layers there will be more weights to learn and this will require more training data. Excessive weights will also lead to the problem of overfitting and poor performance on unexpected data. In our study we have made use of a network with 3 hidden layers. The first layer has 100 neurons, the second with 50 and the last hidden layer has 2 neurons. The first hidden layer has 100 neurons as it accepts pixel-intensity values from the vectorized 10 × 10 image subsection. The next 50 neurons and subsequent 2 neurons processes these pixel values to help predict the spread parameter.

4.2 Choice of Network Architecture

We chose the network architecture as depicted in Fig. 3 from a sample space of infinite possibilities. But as usual, in all optimization problems we were constrained by limited computing power and a finite training data-set. In this paper we do not claim that the chosen architecture to be the best, but from the limited amount of testing done we have chosen the best architecture. The first hidden layer has 100 units, and this layer is sufficient if we had only 1 texture but to make it texture invariant while predicting blur, additional layers are introduced. This makes the function to be learnt, non-linear and Fig. 4 shown below shows the performance of few of the sample architectures tested.

Fig. 3.
figure 3

Network architecture

Fig. 4.
figure 4

Network performance for 5 sample architectures trained on 5 fractal textures. (Class 1-1 hidden layer with 100 neurons; Class 2-2 hidden layers with 100 neurons each; Class 3-2 hidden layers with 100 and 50 neurons; Class 4-3 hidden layers with 100,50 and 2 neurons; Class 5-3 hidden layers with 100,10 and 10 neurons)

4.3 Learning Algorithm to Learn the Weights

The learning algorithm utilized in this system is the Scaled Conjugate Gradient Algorithm [11] which is a variant of the standard backpropagation algorithm [12]. We make use of this particular algorithm because it converges significantly faster than the standard algorithm without too much compromise on the performance.

4.4 Training the Neural Network

The neural network has to be adaptive and respond to different types of textures. To do so we experimented with different types of textures, both natural and artificial in the training set. Initially we attempted to train the network with sets of natural textures. The dataset contained 148 images, each of dimension 1000 × 1000 pixels. These images were extremely sharp images taken of naturally occurring textures. They were assumed to be on the same focal plane and it was assumed that the spread parameter, σ = 0. The image is then segmented into 50 rows and the first row is left untouched (assuming σ = 0), the second row is then artificially blurred by convolving that region with a Gaussian PSF with σ = 0.1, the third row with σ = 0.2 and so on until the last row which is blurred with σ = 4.9. Thus now we have an image which is blurred in increasing steps. The image is then segmented into 10000, 10 × 10 sized subsections. These subsections are vectorized and the target σ for these subsections are formatted to form our training set. This entire procedure is done for all 148 images to form our training data of 1480000 samples. After training the neural network on this data it was found that the best mean squared error achieved was 2.08 on the validation set and the system was unable to discern depth variations. The problem with natural textures is that the imaging systems used to acquire them are not perfect and different the regions on the image are not on the same focal plane. Thus our assumption that σ = 0 initially is false. There is a lot of noise and other defects and hence natural textures are not the ideal candidate for training the neural network.

4.5 Training Data

The vectorized image samples along with the target σ form our dataset. This dataset is split such that 70% of the data is for training, 15% for testing and the final 15% is for validation. The near focused/far focused images used to extract depth for the scene were not used in training and these were completely new textures for which σ was to be predicted by the neural network.

4.6 Training Using Fractals and Artificially Generated Images

In order to improve the performance of the system we generated patterns and textures programmatically and used these in our training data. The important characteristic that was common to all these images was that they were of high spatial frequency. The variation of pixel intensity values within a short spatial span is very large. Fractals [13] are mathematical sets that exhibit the property of “self-similarity”. When these sets are visualized these patterns can form very high spatial frequency images. The textures on these images are similar to natural ones but they are free from other optical distortions and hence they are ideal candidates to train our neural network to achieve textural independence.

We used just 19 of these fractal images (190000 samples after segmenting the image) as opposed to the previous 148-image training set, to train out network and they underwent similar pre-processing just as the natural images used before being used as training data. The trained network now began to give a mean squared error of 0.974 on the validation set when trying to predict sigma. With this increased performance the network improved significantly in visualizing the 3D variation in depth of scenes.

To compare the performance with the previous network we artificially blurred a sample texture in linearly increasing steps to simulate depth variation as shown in Fig. 5. This image was then assumed to be the near focused image and was rotated by 180° to obtain the far focused image, these images are then fed into the system to predict the simulated depth. If we look at Fig. 5 we can see this improved performance, the depth as predicted by the network trained on fractals is able to predict linear change in depth much better than the one trained on natural textures.

Fig. 5.
figure 5

Some of the images used for training the neural network. The 2 textures on the left are natural and the 2 on the right are fractal patterns.

Finally we implemented the system on real scenes (Fig. 6) and the results are shown in Figs. 7 and 8. From Figs. 7 and 8 we can see that the system is able to discern variation in depth of scenes and is able to form a 3D representation from the two images.

Fig. 6.
figure 6

The blue line in the above figure shows the actual depth variation induced on textures by artificial blurring. The red line shows the depth predicted by a system trained on fractal images. The orange line shows the depth predicted by a network trained on natural images. On the right side we have the near focused and far focused images of two same sample scenes. (For each pair of scene images, the one on the left is the near focused image and the one on the right is the far focused image)

Fig. 7.
figure 7

3D depth maps formed for the two pairs of images shown in Fig. 6

Fig. 8.
figure 8

A colored representation of normalized depth. Blue indicates that the region is closer to the lens aperture and yellow indicates that the region is further away.

5 Conclusion

In this paper we have demonstrated that by using fractal images we are able to enhance the depth maps of scenes significantly when compared to depth maps produced by networks trained using naturally occurring textures. DfD will begin to show erroneous results when the scene is devoid of textures. This can however be rectified by artificial illumination of patterns onto the scene. There is much to be explored with regard to the neural network architecture and in this study we have chosen one based on limited testing; there is significant scope in this domain to improve the system performance even more. The current system produces a relative depth map but if we incorporate the different parameters of the lens and sensor settings we can predict absolute depth of objects within a particular range.