Journal of Analysis and Testing

, Volume 2, Issue 3, pp 274–289 | Cite as

Analysis of NIR spectroscopic data using decision trees and their ensembles

  • Sergey KucheryavskiyEmail author
Original Paper


Decision trees and their ensembles became quite popular for data analysis during the past decade. One of the main reasons for that is current boom in big data, where traditional statistical methods (such as, e.g., multiple linear regression) are not very efficient. However, in chemometrics these methods are still not very widespread, first of all because of several limitations related to the ratio between number of variables and observations. This paper presents several examples on how decision trees and their ensembles can be used in analysis of NIR spectroscopic data both for regression and classification. We will try to consider all important aspects including optimization and validation of models, evaluation of results, treating missing data and selection of most important variables. The performance and outcome of the decision tree-based methods are compared with more traditional approach based on partial least squares.


NIR spectroscopy Decision trees Classification and regression trees Random forests 


  1. 1.
    Kotsiantis SB. Decision trees: a recent overview. Artif Intell Rev. 2013;39:261–83. Scholar
  2. 2.
    Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Boca Raton: Taylor & Francis; 1984.Google Scholar
  3. 3.
    Kegelmeyer W, Banfield RE, Hall LO, Bowyer KW. A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell. 2007;29:173–80. Scholar
  4. 4.
    Breiman L. Random forests. Mach Learn. 2001;45:5–32. Scholar
  5. 5.
    Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38:367–78. Scholar
  6. 6.
    Chan JC-W, Paelinckx D. Evaluation of random forest and Adaboost tree-based ensemble classification and spectral band selection for ecotope mapping using airborne hyperspectral imagery. Remote Sens Environ. 2008;112:2999–3011. Scholar
  7. 7.
    Menze BH, Kelm BM, Masuch R, Himmelreich U, Bachert P, Petrich W, Hamprecht FA. A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinform. 2009;10:213. Scholar
  8. 8.
    Mu K-X, Feng Y-Z, Chen W, Yu W. Near infrared spectroscopy for classification of bacterial pathogen strains based on spectral transforms and machine learning. Chemom Intell Lab Syst. 2018;179:46–53. Scholar
  9. 9.
    Douglas RK, Nawar S, Cipullo S, Alamar MC, Coulon F, Mouazen AM. Evaluation of vis-NIR reflectance spectroscopy sensitivity to weathering for enhanced assessment of oil contaminated soils. Sci Total Environ. 2018;626:1108–20. Scholar
  10. 10.
    R Core Team. R: A language and environment for statistical computing, R foundation for statistical computing, Vienna, Austria, 2018. Accessed 19 Nov 2018.
  11. 11.
    Tecator dataset. Accessed 19 Nov 2018.
  12. 12.
    Nørgaard L, Saudland A, Wagner J, Nielsen JP, Munck L, Engelsen SB. Interval partial least-squares regression (iPLS): a comparative chemometric study with an example from near-infrared spectroscopy. Appl Spectrosc. 2000;54:413–9. Scholar
  13. 13.
    Borggaard C, Thodberg HH. Optimal minimal neural interpretation of spectra. Anal Chem. 1992;64:545–51. Scholar
  14. 14.
    Andersen CM, Bro R. Variable selection in regression-a tutorial. J Chemom. 2010;24:728–37. Scholar
  15. 15.
    Oliveri P, López MI, Casolino MC, Ruisánchez I, Callao MP, Medini L, Lanteri S. Partial least squares density modeling (PLS-DM)—A new class-modeling strategy applied to the authentication of olives in brine by near-infrared spectroscopy. Anal Chim Acta. 2014;851:30–6. Scholar
  16. 16.
    Rodionova OY, Oliveri P, Pomerantsev AL. Rigorous and compliant approaches to one-class classification. Chemom Intell Lab Syst. 2016;159:89–96. Scholar
  17. 17.
    Brereton RG. Chemometrics for pattern recognition. Chichester: Wiley; 2009.CrossRefGoogle Scholar
  18. 18.
    Wold S, Sjöström M, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58:109–30. Scholar
  19. 19.
    Geladi P, Kowalski BR. Partial least-squares regression: a tutorial. Anal Chim Acta. 1986;185:1–17. Scholar
  20. 20.
    Barker M, Rayens W. Partial least squares for discrimination. J Chemom. 2003;17:166–73. Scholar
  21. 21.
    Rajalahti T, Arneberg R, Berven FS, Myhr K-M, Ulvik RJ, Kvalheim OM. Biomarker discovery in mass spectral profiles by means of selectivity ratio plot. Chemom Intell Lab Syst. 2009;95:35–48. Scholar
  22. 22.
    Gini C. On the measure of concentration with special reference to income and statistics. Colo Coll Publ Gen Ser. 1936;208:73–9.Google Scholar
  23. 23.
    B. de Harrington P, Voorhees KJ. Multivariate rule building expert system. Anal Chem. 1990;62:729–34. Scholar
  24. 24.
    Harrington PB. Fuzzy multivariate rule-building expert systems: minimal neural networks. J Chemom. 1991;5:467–86. Scholar
  25. 25.
    R. Genuer, J.-M. Poggi, C. Tuleau. Random forests: some methodological insights, ArXiv08113619 Stat. 2008. Accessed 8 Aug 2018.

Copyright information

© The Nonferrous Metals Society of China 2018

Authors and Affiliations

  1. 1.Department of Chemistry and BioscienceAalborg UniversityEsbjergDenmark

Personalised recommendations