Keywords

1 Introduction

The study of cells, notably in shoots, based on scanning electron microscopic (SEM) images is important for understanding the plant functions [18]. This has become a priority for the plant biology community, in order to increase the plants’ yield and tolerance to various environmental stresses (drought, salinity etc.). In particular, the analysis of stomatal morphology and its density (termed as stomata phenotyping) is critical for breeding high yield cultivars [18], since it plays a key role in regulating the temperature of the plants through gas exchange between the atmosphere and leaf surface [30]. However, despite the advances in microscopy, the statistical data that is required for breeding a more significantly yielding plant cultivar is currently a bottleneck for biologists [15]. This is due to the fact, that they mainly rely on manual or semi-automatic approaches to compute the stomatal phenotypic traits. For example, authors in [18, 30] manually counted the stomata and visually scored the degree of stomatal opening. On the other hand, ImageJ [22], an open source software with user interactive functionalities has been employed in [4, 5]. However, these methods are susceptible to intra-rater or inter-rater repeatability resulting in low accuracy and reproducibility [16]. Thus, an automated tool is required for an efficient and high-throughput analysis of stomatal phenotypic traits.

Very few methods have been published to automate this process over the past decades. Authors in [11] proposed a framework using template matching to detect stomata, this is followed by binarization for extracting stomatal aperture. In [21], the authors employed maximum stable external regions (MSER) for simultaneous detection and quantification of grapevine stomata. In contrast to these image processing methods for estimating stomatal density, [26] adopted a machine learning approach utilizing a cascade object detector (COD) based on haar-like features in oak species. Similarly, Jayakody et al. [9] employed histogram of oriented gradients (HOG) features in contrast to haar-like features, so as to reduce the training time of the COD for detecting grapevine stomata and used binarization for segmenting the stomatal aperture. However, these approaches are proposed for detecting stomata in SEM images with homogeneous background. Thus, they are not suitable for the investigated images with feature rich background. Also, the stomata segmentation is limited by binarization [9, 11] due to (a) low contrast between the background and stomatal aperture and (b) varying degree of aperture’s occlusion in the presence of papillae (for sample image, see Fig. 1).

Fig. 1.
figure 1

A high resolution patch from sample SEM image.

In the past few years, deep convolutional neural networks (CNNs) have shown promising results on various computer vision tasks [7, 10, 13, 19]. Thus, in order to overcome the aforementioned challenges, we propose a novel deep learning framework for automatic stomata detection and its quantification. In the first step, a single shot multibox detector (SSD) is employed to deal with (a) different stomata configurations, and (b) feature rich background with abundance of similar non-stomata objects in the SEM image. The second step involves the extraction of stomatal aperture (segmentation) from the detected stomata. However, this segmentation task is eclipsed by the small size of stomata in the investigated SEM data. Thus, super-resolution convolutional neural network (SRCNN) [3] scheme is applied with detected stomata as an input prior to the segmentation step. This is done to enhance the visualization of small stomatal structures. Although authors in [11, 20] proposed watershed and threshold based segmentation for extracting the stomatal aperture, these algorithms are not suitable for the investigated images due to the (a) low contrast between background and stomatal aperture and (b) papillae occlusion. These problems are solved by utilizing a deep learning framework [2] based on fully convolutional neural network (FCNN) [14]. The authors in [2] presented a framework for high resolution SEM images containing only a single stomata. In contrast, our automated pipeline handles a more challenging image data-set containing multiple stomata, with no prior information about the stomata parameters (count, shape and size) thus facilitating high-throughput stomata analysis.

The rest of the paper is organized as follows: In Sect. 2 the experiment protocol for data acquisition is elucidated, the proposed methodology is explained in Sect. 3, results are discussed in Sects. 4 and 5 concludes the paper.

2 Dataset

Drought experiment is conducted on different rice cultivars during the kharif season at the phenomics facility (Indian Agricultural Research Institute). The investigated cultivars are divided into three groups of differing irrigation intensities i.e. well-watered, reduced watered and unwatered. The leaf samples of these cultivars are collected at different drought stages and immediately fixed by liquid nitrogen. Images (\(1024 \,{\times }\, 768\)) of the leaf surface are taken with a scanning electron microscope (model EVO50, Zeiss, UK) at 9.00 KX magnification (termed as Data-1, containing single stomata) and 1.00 KX magnification (termed as Data-2, containing multiple stomata) to develop a high-throughput phenotyping pipeline for stomata detection and quantification (for sample image from Data-1 and Data-2, see supplementary Fig. 2).

3 Methodology

In this section, we introduce the proposed framework (workflow shown in Fig. 2). The method consists of a detection stage for computing the stomatal density and a segmentation stage for computing its morphological traits.

Fig. 2.
figure 2

Workflow of the proposed framework.

3.1 Stomata Counting

Stomata density is estimated by counting the stomata on a SEM image of a given leaf surface [26]. A number of methods have been proposed in literature [29] to predict the count of objects in an image based on regressors trained with its global features. A global regressor ignores the spatial information, thus a pixel-level object density map regression was proposed in [12]. However, these methods cannot easily provide locations of the objects which is essential for subsequent stomata quantification. Thus, the task of stomata counting is implemented based on detection and localization of individual stomata.

In recent years, counting object by detection benefits from the superior performance of CNN. Region-based convolutional neural networks (R-CNN) [7] is the first deep learning approach improving the detection accuracy based on object proposal algorithm [7] and learning the features for these proposal region using a CNN. Later, Spatial pyramid pooling in deep convolutional networks (SPPnet) [8] was introduced to speed up the detection of R-CNN by introducing a spatial pyramid pooling (SPP) layer, which shares features between proposals. This was further improved by Fast-RCNN [6] which proposed an end to end training using the SPP layer. Furthermore, Faster-RCNN [19] presented the proposal generation based on CNN and its integration with Fast-RCNN by sharing convolutional layers.

Table 1. SSD architecture. The names conv and pool represent the convolutional and pooling layers, respectively.

In contrast to the previously mentioned proposal based methods, single shot multibox detector (SSD) [13] is a state-of-the-art detector which looks for objects in a set of default boxes over different feature levels and scales. It is free from proposal generation, subsequent feature sampling and classification. Thus, SSD is a suitable choice for the current objective of high-throughput stomata detection. However, unlike the complex images with multiple instances of different objects, which were considered in the original work [13], the leaf microscopic images (in this work) are utilized only for stomata detection. The remaining leaf region is considered as background, thus a shallow version of SSD (architecture shown in Table 1) is employed. Each convolutional (conv) layer is followed by batch normalization and exponential linear units. Boxes of aspect ratios 0.5, 1.0 and 2.0 are generated from the outputs of the conv4 to conv7 layers at different scales varying from 0.08 to 0.96. The model is trained to detect stomata in these boxes using Adam optimizer with initial learning rate of 0.0001 and a decay factor of \(10^{-4}\). SSD is trained for 100 epochs with batch size of 16 and \(L_2\) regularization of weights with 0.0005 decay factor. Bounding box annotations of stomata corresponding to SEM images (Data-2) are manually generated with the help of two expert biologists to obtain the ground truth labels. Since, the investigated SEM images are high resolution (\(1024 \times 768\)) comprising of small-scale stomata, downsampling will lead to loss of discriminative details. Thus, to overcome the limitation of computational cost associated with the CNNs, high resolution patches of size \(256 \times 256\) with \(75\%\) overlap are sampled from these images. 2000 images with the corresponding bounding box annotations are utilized for training and 440 are used for validation. The weights corresponding to the minimum validation loss are used for testing. Detected stomata is then individually utilized for computation of its morphological features, explained in the next sub-section.

3.2 Stomata Segmentation

Stomata are pores on the leaf surface that controls the transpiration of plants through opening and closing of its aperture [18]. Thus, it is important to accurately compute the stomatal opening. However, the accuracy of its quantification depends on the aperture segmentation step. Threshold based segmentation employed in [9, 11, 17] and watershed segmentation in [20] are proposed for images containing stomata with contrast homogeneous background and no papillae occlusion. Thus, these proposed algorithms are not suitable for the investigated images. Authors in [2] presented a solution to overcome these challenges inspired from FCNN [14]. However, the direct application of this framework is not suitable, since the investigated images comprises of small size stomata. Thus, prior to the segmentation step, the detected stomata from the SSD model are super-resolved.

Although, nearest neighbor, bilinear, and bicubic interpolations are conventional methods to generate a high-resolution image, these methods produce over-smoothed images with artifacts such as aliasing and blur around the edges [24]. Example-based super-resolution methods proposed in [25, 28] that learn the relationship between low-resolution and high-resolution image pairs, yield higher image quality over these linear interpolation methods. But the computation time of the example based methods is not suitable for this high-throughput stomata phenotyping task. In contrast, super-resolution convolutional neural network (SRCNN) [3] directly learn an end-to-end mapping between the low resolution image and the high-resolution image thus, SRCNN is utilized.

SRCNN [23] consists of three convolutional layer for patch extraction and representation, non-linear mapping and reconstruction. The first layer uses \(9 \,{\times }\, 9\) kernel size with 64 feature maps to extract compact representation of the LR image. This is followed by a second layer consisting of 32 feature maps with \(5 \,{\times }\, 5\) kernel size, that maps LR feature maps into HR feature maps. The last layer with \(5 \,{\times }\, 5\) kernel size is used for HR image reconstruction. Since, the HR images corresponding to Data-2 are unavailable (acquired at 1.00 KX magnification), Data-1 employed by authors [2] in the segmentation network (termed here as HR images) is utilized for training the SRCNN. The HR images are downscaled to \(64 \times 102\) and then upscaled to \(160 \times 256\) using bicubic interpolation (termed LR) as input and the corresponding HR as output. Down-scale of \(64 \times 102\) is chosen, since in the investigated images (Data-2), this patch size is found to be sufficient to contain individual stomata. A total of 90 and 39 images are used as training and validation dataset respectively and an Adam optimiser to minimise the mean square error loss with \(10^{-4}\) learning rate.

The next step is the extraction of stomatal aperture using the framework presented in [2]. The authors trained the network on Data-1 (HR images), thus for its application on the investigated images, it is fine-tuned using the SRCNN output. In total, 117 HR images and the corresponding masks (available from [2]) are used for this purpose. Images are first down-sampled (\(64 \,{\times }\, 102\)) and then super-resolved to \(160 \times 256\) (using the trained SRCNN) to obtain the desired data. This dataset is divided into training and validation sets containing 78 and 39 images, respectively. Training set images are augmented using random translation, rotation, flipping and zooming which results in 960 samples. Network is initialized with the weights used in [2] which are fine tuned using an Adam optimizer. An initial learning rate of \(10^{-6}\) with decay factor of \(10^{-8}\) and weighted binary cross-entropy loss is used during training. The network is trained for 500 epochs with a batch-size of 20.

At the test time, the patches of size \(64 \times 102\) extracted at each detected stomata location (Data-2) obtained from the SSD network is super-resolved using the aforementioned trained SRCNN model. The binary mask corresponding to the stomatal aperture using this resolved image is extracted by utilizing the fine-tuned segmentation network. Since stomata exhibit an approximate elliptical shape [20], ellipse fitting [27] is then used on this segmented mask to compute its primary morphological features: (1) major-axis (l), (2) minor-axis (w), (3) area and (4) the ratio of width and length (w / l).

4 Results

In this section, we evaluate the performance of the proposed framework in terms of stomata density and its quantification (Table 2).

Table 2. Stomata detection results.

The detection step of the proposed framework is compared with Liu’s MSER, Laga’s template matching approach and Jayakody’s HOG based COD (shown in Table 1) in terms of precision (P) and recall (R) given by: \(P=TP/(TP+FP)\) & \(R=TP/(TP+FN)\) (where, TP is true positive, FP is false positive and FN is false negative, calculated using 0.80 Intersection over Union (IoU) between the predicted and ground truth bounding boxes) on 93 stomata test data. Figure 3 shows the results from the existing stomata detection approaches on a cropped section of the investigated image (due to limited space, the results on the original image is shown in supplementary Fig. 4). MSER algorithm assumes that the object of interest are brighter than their boundaries and there exists some optimal threshold at which individual object can be segmented. However, due to the presence of wart like protuberance and papillae with higher grayscale intensity than the stomatal aperture, this results in the detection of background objects (high FP and FN, shown in Fig. 3(a)). For the extraction of stomatal aperture, the detected regions obtained from MSER algorithm are further filtered based on eccentricity, since they exhibit elliptical shape (shown in supplementary Fig. 5(b)). The template matching approach requires manual selection of both the threshold and templates. We utilized 50 stomata templates, but due to the feature rich background, large number of false positives are generated (Fig. 3(b)). Recent work that utilizes HOG based COD [9], performs better than the previously mentioned image based algorithms. Although, HOG works well in representing the shape of an object, it is also limited by the occlusion of the stomata due to the presence of papillae and the varying level of deformation (closing) of stomata under the drought experiment (Fig. 3(c)). Figure 3(d) shows the stomata detection based on SSD model. Since, overlapping patch based detection approach is employed, multiple bounding boxes for each detected stomata is obtained. To eliminate this redundant counting, non-maximum suppression (NMS) is then utilised for an accurate computation of stomatal density (Fig. 3(e)).

We also compare the detection results with pixel level object density regression method, presented in [1]. Arteta et al. [1] proposed an interactive counting algorithm that predicts the spatial density map based on pixel level ridge-regression with dot annotation as user input and also provide candidate regions containing the objects of interest for iterative user annotation. Figure 4(a) shows the predicted density map and Fig. 4(b) shows the candidate regions containing stomata after the third iteration (the numbers mentioned along these detected regions denotes the total stomata count). Due to the presence of wart like protuberances in the SEM images that look similar to the papillae arrangement around the stomatal aperture, the total stomata count is overestimated in few detected regions. In contrast, to the existing methods, the proposed framework performs considerably better and achieves high recall and precision without any manual intervention (supplementary Fig. 4(d–e)).

Fig. 3.
figure 3

Stomata detection based on (a) MSER (b) Template Matching (c) HOG based COD (d) SSD model and (e) SSD model with NMS.

Fig. 4.
figure 4

Interactive stomata counting.

The windows detected as stomata from the SSD model after NMS, is segmented to extract the stomata aperture for computing its morphological traits. Threshold based segmentation proposed by [9, 11, 17] for this purpose leads to the segmentation of papillae and wart like protuberance as they exhibit higher grayscale intensity w.r.t. background than stomata. In addition to this low contrast, the accuracy of the watershed segmentation is also limited by the aperture occlusion (results shown in supplementary Fig. 6). Thus, the deep learning framework presented by authors in [2] is employed. Since, the framework was trained on HR dataset collected at 9.00 KX magnification, the detected stomata are super-resolved using SRCNN model (details mentioned in Sect. 3.2). The model is trained using Data-1 and MSE of .0072 is obtained after 2000 epochs (sample output shown in Fig. 5).

Fig. 5.
figure 5

An example of the reconstructed high-resolution image (magnification of \(\times 2.5\)): (a) Low-resolution image, (b) reconstructed output based on SRCNN scheme and (c) original test image.

We hypothesize that this model is qualitatively sufficient as a pre-processing step for subsequent stomatal aperture segmentation and quantitatively evaluate the accuracy of the segmented mask (results shown in supplementary Figs. 7 and 8) in terms of IoU. The mean IoU with the ground truth is computed to be 0.8602 on 50 test images. The primary morphological features (length and width) is computed by ellipse fitting on the generated segmented mask (supplementary Figs. 7(d) and 8(d)). These computed features are compared with the corresponding ground truth features, collected by an expert biologist using ImageJ. The major-axis (corresponding to the length of stomata) and minor-axis (corresponding to the width of stomata) of the ellipse shows a correlation of 0.95 (result shown in supplementary Fig. 9) and 0.91 respectively using 50 detected stomata.

The proposed approach is fully automatic and computationally efficient, that enabled high-throughput phenotyping (Fig. 6) in drought stress environment. This shows the potential of our framework to be employed in other biotic and abiotic stress experiments (salt, \(SO_2\) etc.) based on stomatal responses.

Fig. 6.
figure 6

High throughput stomata phenotyping (a) SSD with NMS (b) LR stomata (input to SRCNN) (c) HR stomata (SRCNN output) and (d) Ellipse fitting on segmented output.

5 Conclusion

In this paper, we presented a novel deep learning framework for automated stomata phenotyping. The framework comprises of SSD based detection, SRCNN for super-resolving small detected stomata followed by its segmentation, based on Fully Convolutional Neural Network. We have shown that our method performs better on a challenging data-set, that includes varying degrees of stomatal occlusion, dynamic feature rich background and small size of stomata, compared with existing approaches. This shows its potential application in various biotic and abiotic stress studies. We plan to apply this framework on SEM images from other cultivars.