Keywords

1 Introduction

Plant phenotyping is essential in the study of plant biology, plant functional genomic and plant breeding (Dhondt et al. 2013; Yang et al. 2013; Bolger et al. 2014). Yet plant phenotyping has become a new bottleneck in plant biology. In the recent 5 years, lots of efforts have been done on automatic phenotyping (Duan et al. 2011a, 2011b; Jiang et al. 2012; Huang et al. 2013). However, much work still needs to be done to fill the genotype-phenotype gap.

Biomass is an important phenotypic trait in functional plant biology and plant growth analysis (Honsdorf et al. 2014). Shoot dry weight (DW) is a popular measure of biomass in studying biomass of individual plants (Golzarian et al. 2011). In traditional measurement of DW, the shoot of the plant is cut off, oven-dried to constant weight and weighed by a balance. The low efficiency of the traditional method makes it almost impossible for investigation of a large population of plants. In addition, because the traditional measurement is destructive, continuous inspection of DW over time for an individual plant is infeasible.

Inference of biomass based on machine vision and image analysis allows for non-destructive, high-throughput and continuous measurement of a large quantity of samples. There are researches contributing to automatic measurement of plant biomass (Rajendran et al. 2009; Munns et al. 2010; Hairmansis et al. 2014). However, these researches were only satisfying for young plant (several weeks after sowing) of few varieties.

Based on the statistical learning theory, Support Vector Machine (SVM) is advantageous in robustness to high input space dimension and generalization capabilities (Vapnik, 1995). Support Vector Regression (SVR) is an extension of SVM for regression application and is especially useful in presence of outliers and non-linearities (Brereton and Lloyd 2010).

This study aims to establish a model for measuring aboveground biomass of different rice varieties based on SVR. To the best of our knowledge, no publication available use SVR for biomass measurement.

2 Materials and Methods

2.1 Plant Materials and Image Acquisition

402 rice plants (402 accessions with 1 replicate) were grown in the greenhouse. At late booting stage, all the plants were imaged with a rice automatic phenotyping platform (RAP) (Yang et al. 2014). A turntable rotated the plant and a charge-coupled device (CCD) camera (Stingray F-504C, Applied Vision Technologies, Germany) acquires images at 30° intervals. For each plant, 12 color images at different angles were taken. Simultaneously, a linear X-ray CT captured sinogram of the plant, from which section image were reconstructed and used to extract the tiller number (Yang et al. 2011). The images were saved in the computer for further processing. Next, the plants were harvested and manually measured for the shoot dry weight (DW).

2.2 Feature Extraction and Feature Selection

After image acquisition, the images were analyzed and 39 features, including tiller number, 8 texture features and 30 morphological features, were extracted for each plant. The 39 features included 33 features introduced in Yang et al. (Yang et al. 2014), differential boxing counting dimension (DBC), ratio of plant area to area of bounding rectangle (ABR), greenness area (A_G), yellowness area (A_Y), information fractal dimension (IFD), ratio of perimeter to area (PAR). The features were then used as the potential predictors for DW.

To determine the effective predictors, partial least squares (PLS) regression (Cho et al. 2007) and all subsets regression (ASR) were carried out (Montgomery et al. 2012). PLS regression was accomplished using Matlab 2012b. Prior to performing the PLS regression, the data were normalized so that the mean value and standard deviation of the data was zero and one, respectively. The leave-one-out cross-validation method was performed to determine the optimal number of PLS factors. ASR was done using SAS 9.3. The Cp criterion was used for selecting the best subset. The effective predictors were then used for model input.

2.3 Model Construction and Comparison

The 402 samples were randomly divided into two subsets at 2:1 ratio: 268 samples for training set and 134 samples for testing set. The training set and the testing set was applied for constructing model and evaluating the performance of the model, respectively.

6 models were developed based on support vector regression (SVR). The radial basis function (RBF) only needs to optimize one parameter (the value of \( \upgamma \)) and was adopted as the kernel function in this study. Penalty error C and RBF \( \upgamma \) were key to the performance of SVR (Brereton and Lloyd 2010). A larger C generates more significant misclassifications but meanwhile leads to a more complex boundary. And inappropriate RBF \( \upgamma \) may lead to overfitting. In this study, three different optimization methods, K-fold Cross Validation (K-CV, in this study 5-CV), Genetic Algorithm (GA) (Storn and Price 1997) and Particle Swarm Optimization (PSO) (Clerc and Kennedy 2002), were applied and compared to optimize C and \( \upgamma \). The fitness function for GA and PSO was set as the mean squared error under 5-CV in this study. Libsvm, a popular SVM software package for Matlab designed by professor Lin Chih-Jen was used to accomplish SVR in this study.

In comparison with SVM models, we also built models based on PLS regression and multiple linear regression (MLR). In total, 8 models were developed and compared in this study (Table 1). Figure 1 shows the flowchart of the model construction.

Table 1. 8 models developed in this study
Fig. 1.
figure 1

Flowchart of the model construction

For model comparison, coefficient of determination (R2), mean absolute percentage error (MAPE, Eqs. 1-2) and standard deviation of the absolute percentage error (SAPE, Eq. 3) for training set and testing set were computed for each model.

$$ APE_{i} = \frac{{|DW_{i.manual} - DW_{i.automatic} |}}{{DW_{i.manual} }} $$
(1)
$$ MAPE = \frac{1}{n}\sum\limits_{i = 1}^{n} {APE_{i} } $$
(2)
$$ SAPE = \sqrt {\frac{1}{n - 1}\sum\limits_{i = 1}^{n} {(APE_{i} - MAPE)^{2} } } $$
(3)

where DW i.automatic represents the dry weight measured automatically using the method described, DW i.manual represents the dry weight measured manually, and n represents the number of samples.

3 Results and Discussion

After PLS regression, 4 PLS factors were selected. And a subset with 18 features was selected as the best subset using ASR. The selected predictors were used as input for the models.

Table 2 illustrates the comparison of the 8 models. Note that when using the best subset by the Cp criterion as independent variables for MLR modelling, the model suffered from multi-collinearity problem. So the following strategy was used to select the feature subset for MLR: (1) the subset that has the maximum R2 among all subsets with i \( i = 1,\,2,\, \cdots ,\,39 \) features was deemed as the best subset with i features, (2) the 39 best subsets were used as independent variables to build MLR models and the model was chosen as the optimal MLR model if it had the largest number of independent variables and did not present multi-collinearity. Finally, a model with 3 independent variables (exclude the constant) was chosen as the optimal MLR model.

Table 2. Comparison of performance of the 8 models

As seen from the Table 2, the ASR-GA-SVR model outperformed other models, with R2 of 0.85, MAPE of 10.20 % and SAPE of 9.20 % for the training set and R2 of 0.79, MAPE of 12.44 % and SAPE of 9.79 %for the testing set, respectively. Consequently, the ASR-GA-SVR model was chosen as the optimal DW model. The SVR models were generally noticeably advantageous for the training set compared with PLS and ASR model. However, for the testing set, the performance of the PLS and ASR model were comparative to the SVR models. This was because the optimal C and γ were chosen to obtain the best performance (minimum mean squared error) for the training set but could not guarantee to get the best performance for the testing set under the optimal C and γ.

Figures 2, 3 and 4 show the performance of the final DW model (ASR-GA-SVR model), the PLS model and ASR model, respectively.

Fig. 2.
figure 2

Performance of the final DW model (ASR-GA-SVR model)

Fig. 3.
figure 3

Performance of the PLS model

Fig. 4.
figure 4

Performance of the ASR model

4 Conclusions

This paper presents 8 models based on SVR, PLS and ASR for measuring aboveground biomass of different rice varieties. The result showed the ASR-GA-SVR model outperformed other models with R2 of 0.85, MAPE of 10.20 % and SAPE of 9.20 % for the training set and R2 of 0.79, MAPE of 12.44 % and SAPE of 9.79 %for the testing set, respectively. The study extends the application of SVR and intelligent algorithm in the measurement of plant biomass. The method has the potential to promote the accuracy of biomass measurement for different varieties and thus contributes to automatic plant phenotyping.