Keywords

1 Introduction

Rice is one of the most significant cereals in the world, especially for china (Zhu et al. 2011). Thousands of rice varieties could be produced daily by modern breeding technique (Bagge and Lubberstedt, 2008). And a large number of rice germplasm-resources need to be exploited by breeders for the rice improvement (Xing and Zhang, 2010). However, the characterization for the various rice varieties are technically challenging due to the slight difference (Tanabata et al. 2012). Rice variety is also regarded as one of the most important factors related to cooking and processing quality, which was resulted by the variations in size, shape, and constitution (Zhang, 2007). Therefore rice variety identification is of great significance.

Since the identification of rice varieties is so important for rice-related research. A lot of work had been reported about it. Namaporn Attaviroj tried to identify the rough and pure rice varieties using fourier-transform NIR (Attaviroj et al., 2011). Liu Hongyun had tried to indentify rice varieties by tolerance and sensitivity to copper (Liu et al. 2007). Liu Feng tried to identify rice vinegar variety using visible and near infrared spectroscopy (Liu et al. 2011). However, the above study only focused on a few of special rice varieties, and the identification for the massive ordinary rice varieties were still an urgent problem.

Machine vision was a practical technology and had recently been widely applied in the agriculture. Dual-camera rice panicle length measuring system was proposed by Dr. Huang (Huang et al. 2013). A machine-vision-facility was developed for rice traits evaluation (Duan et al. 2011). A hyperspectral imaging system was designed for biomass prediction (Feng et al. 2013). Yang et al. applied x-ray computed tomography for rice tiller measurement (Yang et al. 2011). Duan et al. had counted filled/unfilled spikelets using Bi-modal imaging (Duan et al. 2011). Support vector machine (SVM), first proposed in 1995 by Cortes and Vapkin, has a lot of advantages, such as nonlinear, small-sample, and high dimensional pattern recognition and can be easily extended to other machine learning problems. However, since it is originally used for binary classification (Cortes and Vapnik 1995; Vapnik, 1999), it requires extra algorithm support to meet practical needs.

This research aimed to propose a feasible method for rapid identification of rice varieties. In this study, the features of grain shape and yield-related traits were extracted by image analysis. And the specific Muti-SVM classifier was developed to discriminate the rice variety.

2 Materials and Methods

The Rice varieties used in this study are selected from the Chinese core-germplasm resources. 79 rice varieties were tested and each variety had four samples. Three quarter of the rice samples were taken as training set, meanwhile the other were testing set. The rice grains were threshed from the panicles manually. And the filled spikelets were selected out by wind separator.

The technical method for rice variety identification is described as Fig. 1. Firstly, the rice grain were imaged and analyzed for shape and weight parameters. Then features of the training set were applied to build the SVM model. With the SVM model, the testing set was applied to evaluate the rice variety identification accuracy.

Fig. 1.
figure 1

The technical method for rice variety identification

2.1 Rice Feature Extraction

The features of each rice sample were obtained as shown in Fig. 2. The rice grains per sample were spread on the scanner manually (Fig. 2a). And the image was acquired and transferred to the computer. Then grain image was analyzed for grain number (GN), grain width (GW), grain length (GL), grain area (GA) (Figs. 2b and c). And the grain weight was obtained by the electro-weighing device (Fig. 2d).

Fig. 2.
figure 2

Rice grain features acquisition

2.2 Multi-SVM Classifier

Since we focus on the classification of rice varieties, we should know the biological classification criteria of rice. Generally, different type of rice will have different genotypes, which usually leads to different phenotypes. Obviously, one of the important problems is how to divide the training rice samples into different subsets by binary tree based SVM-BTA algorithms. When it comes to plants classification, we wish the two subsets will have at least one totally different genotype. Knowing the fact that disparate varieties all have different genotypes, one possible way is using the K-Means clustering method, which will resign each data to the nearest cluster repeatedly, just like combining analogous genes. The problem is to determine the evaluation function.

In order to reduce the algorithm running time of the partitioning process, it’s necessary to improve the KMeans clustering. Since we always want to divide the input set into two clusters, a pretreatment of the data using average threshold algorithm will work. In the next section, we introduce one partition function for evaluation, and then propose MBT-SVM based on the K-Means clustering with optimal partition function.

2.2.1 Partition Function

Suppose the problem’s center \( c_{problem} \) as the all data mean value in the i.th column of the input of a non-leaf node. The following partition function can be adopted to split the node (Huang et al. 2013):

$$ PF(I_{1} \cup I_{2} ) = \sum\limits_{j = 1}^{2} {\sum\limits_{i = 1}^{l} {\frac{{d(c_{i} ,c_{problem} )}}{{\sigma_{i} }}} } $$
(1)

Where \( d(c_{i} ,c_{problem} ) \) is the Euclidean distance and \( \sigma_{i} \) is the variance of column i.

The larger PF is, the better it works. So we have also determined our termination criterion, just to traversal all the possible combinations to find the largest PF, or to reassign the object one by one until the value PF won’t become larger in a whole round. An initial partition is needed to reduce the algorithm running time. The judging function using average threshold algorithm was described as followed:

$$ J(x) = \left\{ {\begin{array}{*{20}c} {1,} & {\sum\limits_{i = 1}^{v} {x_{i} \ge \frac{1}{l}\sum\limits_{j = 1}^{l} {\sum\limits_{i = 1}^{v} {x_{ij} } } } } \\ { - 1,} & {else} \\ \end{array} } \right. $$
(2)

Where v is the number of features. And \( x_{li} \) is the i.th feature of the j.th sample in the training set.

The judging function compares the average value of all the features of a sample with the average value obtained by the whole training set. Since different features of the rice will have different magnitude, we have to standardize the input data. The standardization includes data integrity check, linear unification and repacking.

2.3 Kernel Function

The efficiency and accuracy of SVM is determined by the kernel type and parameters, as well as the parameter c. To determine the best type of kernel function, we can try the three basic kernel functions and pick the one with the best accuracy according to cross-valid. In general, the Gaussian kernel with a single parameter γ is a good choice. The c and γ is usually calculated by a grid searching method, in which the cross validation is applied, then we will pick the one with the highest accuracy, such as \( c = \{ 2^{ - 5} ,\,2^{ - 4} ,\,2^{ - 3} , \ldots 2^{8} \} \); \( \gamma = \{ 2^{ - 10} ,\,2^{ - 9} ,\,2^{ - 8} , \ldots 2^{3} \} \). The final model, will then training set was applied by the chosen type of kernel function and with the optimized parameters Duan et al. 2011. As is shown in Fig. 3, an inappropriate combination of kernel type and parameters will cause under-fitting or over-fitting problems.

Fig. 3.
figure 3

Classification models generated by different kernel type and parameters

3 Results and Discussion

3.1 Grain Traits Extraction Accuracy

Totally, 79 copys of rice grains were measured automatically and manually, and the parameters of the GN, GL, GW, GA, grain weight were all obtained for each copy. The measurement accuracy for each traits were shown in Fig. 4, the MAPE were calculated according tho the Eq. 3.

Fig. 4.
figure 4

System measurements accuracy evaluation. (a) Grain number, (b) Grain length, (c) Grain width

$$ MAPE = \frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{\left| {x_{ai} - x_{mi} } \right|}}{{x_{mi} }}} $$
(3)

The measurement results showed that the average MAPE for grain number measurements was 1.33 %; the average MAPE for grain length measurement was 1.25 %; the average MAPE for grain width measurement was 2.20 %. The results demonstrated that the automatical measurements performed a good relationship with manual measurements.

3.2 Rice Varieties Identification

The rice feature measurements were shown above. From the results, it was proved that the difference between outer-varieties and inner-varieties were both slight. In the SVM identification, for each tree node, an RBF-kernel SVM was first adopted after partitioning the input data by a clustering method and the linear function was the last one to try. A grid search based on cross-validation was employed for parameter optimization. The result for the whole experiment is shown in Table 1. The training time includes partitioning, tree construction and SVM training. It represented the CPU running time (milli second), for a quad core computer, the real-time approximately equals to a quarter of the CPU time.

Table 1. The class identification results by Muti-SVM

The standard data from UCI and the rice sample data were all tested and the classifying results were shown in Table 1. It was proved that the algorithm had high classification accuracy when processing these standard data. Since the testing samples are randomly selected, every kind of data set was tested only three times and the worst result was recorded in order to avoid anthropogenic interference (like continually running the algorithm until it gets a good output). From the data associated to rice in Table 1, it was seen that building a classification binary tree for a set of data with many classes needed a relatively long time.

Rice-s79-1 is a basic pre-experiment focusing on verification of the algorithm compatibility. There was not a linear kernel function and the number of attributions was fixed to 13. Rice-s79-2 was a grid-search experiment aiming at finding a suitable number of attributions for the next experiment. Since the algorithm needed to traverse all the possible combinations of the attribution, and there were four kernel functions to be examined for each combination, it’s naturally to have a very large training time (nearly ten hours of real-time). After tracking the misclassified samples, we find that about half of the errors occurred in the tree nodes with a relatively high cross-valid accuracy other than the low ones. Clearly, the model is over-fitted. It is necessary to improve the optimization function. We needed to pick the SVM classifier with an appropriate cross-valid accuracy instead of the ones with the highest. Rice-s79-3 was the result of the formal experiment. As was mentioned above, we repeat the experiment three times and record the result with the worst accuracy rate. And the average accuracy was about 79.74 %.

4 Conclusions

In this paper, a support vector machine working in the multi-space-mapped mode (MBT) was proposed for a rice multi-class classification task. The result showed that this study performed high accuracy for the grain traits extraction and also proved a good performance for the rice varieties classification. In future work, we will further analyze the data of tree nodes from experiments to develop more effective algorithms. The range of the parameters will change according to the size of the input training set which will greatly reduce the computation time, which will reduce the time complexity of the algorithm. Therefore we could apply this method for larger rice sample sets for varieties recognition. The results also demonstrated that this method would provide new knowledge for automated rice-vision-evaluated system.