Efficient genomic selection using ensemble learning and ensemble feature reduction

Abstract

Genomic selection (GS) is a popular breeding method that uses genome-wide markers to predict plant phenotypes. Empirical studies and simulations have shown that GS can greatly accelerate the breeding cycle, beyond what is possible with traditional quantitative trait locus (QTL) approaches. GS is a regression problem, where one often uses SNPs to predict the phenotypes. Since the SNP data are extremely high-dimensional, of the order of 100 K dimensions, it is difficult to make accurate phenotypic predictions. Moreover, finding the optimal prediction model is computationally very costly. Out of thousands of SNPs, usually only a few influence a particular phenotypic trait. We first of all show how ensemble-based regression techniques give better prediction accuracy compared to traditional regression methods, which have been used in existing papers. We then further improve the prediction accuracy by using an ensemble of feature selection and feature extraction techniques, which also reduces the time to compute the regression model parameters. We predict three traits: grain yield, time to 50% flowering and plant height for which the existing methods give an accuracy of 0.304, 0.627 and 0.341, respectively. Our proposed regression model gives an accuracy of 0.330, 0.674 and 0.458 for these traits. Additionally, we also propose a computationally efficient regression model that reduces the computation time by as much as 90% and gives an accuracy of 0.342, 0.580 and 0.411, respectively.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  1. Aggarwal CC (2015) Data mining: the textbook. Springer Publishing Company, Berlin

    Google Scholar 

  2. Alpaydin E (2004) Introduction to machine learning (OIP). MIT Press, Cambridge

    Google Scholar 

  3. Bermingham ML, Pong-Wong R, Spiliopoulou A, Hayward C, Rudan I et al (2015) Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci Rep 5:1

    Google Scholar 

  4. Beukert U, Li Z, Liu G, Zhao Y, Ramachandra N, Mirdita V et al (2017) Genome-based identification of heterotic patterns in rice. Rice 10:1

    Article  Google Scholar 

  5. Bishop CM (2016) Pattern recognition and machine learning. Springer, New York

    Google Scholar 

  6. Blondel M, Onogi A, Iwata H, Ueda N (2015) A ranking approach to genomic selection. PLoS ONE 10:6

    Article  Google Scholar 

  7. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (KDD’16). ACM, New York, pp 785–794

  8. Collard BC, Mackill DJ (2008) Marker-assisted selection: an approach for precision plant breeding in the twenty-first century. Philos Trans R Soc B Biol Sci 363(1491):557–572

    CAS  Article  Google Scholar 

  9. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings of the thirteenth international conference on international conference on machine learning (ICML'96), pp 148–156

  10. González-Camacho JM, Ornella L, Pérez-Rodríguez P, Gianola D, Dreisigacker S, Crossa J (2018) Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. Plant Genome 11:2

    Article  Google Scholar 

  11. Gregorio GB, Islam MR, Vergara GV, Thirumeni S (2013) Recent advances in rice science to design salinity and other abiotic stress tolerant rice varieties. SABRAO J Breed Genetics 45(1):31–40

    Google Scholar 

  12. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Google Scholar 

  13. Hall MA (2000) Correlation-based feature selection for discrete and numeric class machine learning. In: Proceedings of the 17th international conference on machine learning, pp 359–366

  14. Hastie T, Tibshirani R, Friedman JH (2017) The elements of statistical learning: data mining, inference, and prediction. Springer, New York, NY

    Google Scholar 

  15. James G, Witten D, Hastie T, Tibshirani R (2017) An introduction to statistical learning with applications in R. Springer, New York

    Google Scholar 

  16. Jannink J-L, Lorenz AJ, Iwata H (2010) Genomic selection in plant breeding: from theory to practice. Brief Funct Genomics 9(2):166–177

    CAS  Article  Google Scholar 

  17. Jena KK, Mackill DJ (2008) Molecular markers and their use in marker-assisted selection in rice. Crop Sci 48(4):1266

    Article  Google Scholar 

  18. Kadam DC, Potts SM, Bohn MO, Lipka AE, Lorenz AJ (2016) Genomic prediction of single crosses in the early stages of a maize hybrid breeding pipeline. G3 (Bethesda) 6(11):3443–3453

    Article  Google Scholar 

  19. Khush GS (2005) IR varieties and their impact. International Rice Research Inst, Los Baños

    Google Scholar 

  20. Mackill DJ, Coffman WR, Garrity DP (1996) Rainfed lowland rice improvement. IRRI, Manila

    Google Scholar 

  21. Meuwissen THE, Hayes BJ, Goddard ME (2001) Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819–1829

    CAS  PubMed  PubMed Central  Google Scholar 

  22. Pedregosa F et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    Google Scholar 

  23. Peng S, Khushg G (2003) Four Decades of breeding for varietal improvement of irrigated lowland rice in the international rice research institute. Plant Prod Sci 6(3):157–164

    Article  Google Scholar 

  24. Saeys Y, Inza I, Larranaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517

    CAS  Article  Google Scholar 

  25. Spindel J, Begum H, Akdemir D, Virk P, Collard B, Redoña E et al (2015) Genomic selection and association mapping in rice (Oryza sativa): effect of trait genetic architecture, training population composition, marker number and statistical model on accuracy of rice genomic selection in elite, tropical rice breeding lines. PLoS Genet 11:2

    Google Scholar 

  26. Wang X, Xu Y, Hu Z, Xu C (2018) Genomic selection methods for crop improvement: current status and prospects. Crop J 6:330–340

    Article  Google Scholar 

  27. Wray NR, Yang J, Hayes BJ, Price AL, Goddard ME, Visscher PM (2013) Pitfalls of predicting complex traits from SNPs. Nat Rev Genet 14(7):507–515

    CAS  Article  Google Scholar 

Download references

Funding

This research did not receive any specific funding.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Rohan Banerjee.

Ethics declarations

Conflict of interest

The authors declare no conflicts of interest.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Banerjee, R., Marathi, B. & Singh, M. Efficient genomic selection using ensemble learning and ensemble feature reduction. J. Crop Sci. Biotechnol. 23, 311–323 (2020). https://doi.org/10.1007/s12892-020-00039-4

Download citation

Keywords

  • Genomic selection
  • Machine learning
  • Rice
  • Dimensionality reduction