Supervised learning via smoothed Polya trees

  • William CipolliIIIEmail author
  • Timothy Hanson
Regular Article


We propose a generative classification model that extends Quadratic Discriminant Analysis (QDA) (Cox in J R Stat Soc Ser B (Methodol) 20:215–242, 1958) and Linear Discriminant Analysis (LDA) (Fisher in Ann Eugen 7:179–188, 1936; Rao in J R Stat Soc Ser B 10:159–203, 1948) to the Bayesian nonparametric setting, providing a competitor to MclustDA (Fraley and Raftery in Am Stat Assoc 97:611–631, 2002). This approach models the data distribution for each class using a multivariate Polya tree and realizes impressive results in simulations and real data analyses. The flexibility gained from further relaxing the distributional assumptions of QDA can greatly improve the ability to correctly classify new observations for models with severe deviations from parametric distributional assumptions, while still performing well when the assumptions hold. The proposed method is quite fast compared to other supervised classifiers and very simple to implement as there are no kernel tricks or initialization steps perhaps making it one of the more user-friendly approaches to supervised learning. This highlights a significant feature of the proposed methodology as suboptimal tuning can greatly hamper classification performance; e.g., SVMs fit with non-optimal kernels perform significantly worse.


Bayesian nonparametric Density estimation Classification 

Mathematics Subject Classification

62H30 – Classification and discrimination; cluster analysis 62G99 – Nonparametric inference 62C10 – Bayesian problems; characterization of Bayes procedures 

Supplementary material

11634_2018_344_MOESM1_ESM.pdf (131 kb)
Supplementary material 1 (pdf 130 KB)


  1. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker PA, Vasudevan V, Warden P, Wicke M, Yu Y, Zhang X (2016) Tensorflow: a system for large-scale machine learning. In: OSDI, vol 16, pp 265–283Google Scholar
  2. Alpaydin E (2014) Introduction to machine learning (adaptive computation and machine learning). The MIT Press, CambridgezbMATHGoogle Scholar
  3. Anderson JA, Rosenfeld E (eds) (1988) Neurocomputing: foundations of research. MIT Press, CambridgeGoogle Scholar
  4. Bensmail H, Celeux G (1996) Regularized Gaussian discriminant analysis through eigenvalue decomposition. J Am Stat Assoc 91:1743–1748MathSciNetCrossRefGoogle Scholar
  5. Bergé L, Bouveyron C, Girard S (2012) HDclassif: an R package for model-based clustering and discriminant analysis of high-dimensional data. J Stat Softw 46(6):1–29CrossRefGoogle Scholar
  6. Beygelzimer A, Kakadet S, Langford J, Arya S, Mount D, Li S (2013) FNN: fast nearest neighbor search algorithms and applications. R package version 1:1Google Scholar
  7. Blackwell D, MacQueen JB (1973) Ferguson distributions via Polya urn schemes. Ann Stat 1:353–355CrossRefGoogle Scholar
  8. Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on computational learning theory. ACM, pp 144–152Google Scholar
  9. Bouveyron C, Girard S, Schmid C (2007) High-dimensional discriminant analysis. Commun Stat Theory Methods 36:2607–2623MathSciNetCrossRefGoogle Scholar
  10. Breiman L (1996) Bagging predictors. Mach Learn 24:123–140MathSciNetzbMATHGoogle Scholar
  11. Breiman L (2001) Random forests. Mach Learn 45:5–32CrossRefGoogle Scholar
  12. Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2:121–167CrossRefGoogle Scholar
  13. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28CrossRefGoogle Scholar
  14. Cipolli W, Hanson T (2017) Computationally tractable approximate and smoothed Polya trees. Stat Comput 27(1):39–51MathSciNetCrossRefGoogle Scholar
  15. Cipolli W, Hanson T, McLain A (2016) Bayesian nonparametric multiple testing. Comput Stat Data Anal 101:64–79MathSciNetCrossRefGoogle Scholar
  16. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297zbMATHGoogle Scholar
  17. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27CrossRefGoogle Scholar
  18. Cox DR (1958) The regression analysis of binary sequences. J R Stat Soc Ser B (Methodol) 20:215–242MathSciNetzbMATHGoogle Scholar
  19. Cox DR (1966) Some procedures associated with the logistic qualitative response curve. Wiley, New YorkzbMATHGoogle Scholar
  20. Deng H (2014) Interpreting tree ensembles with intrees. arXiv preprint arXiv:1408.5456
  21. Duan K, Keerthi SS (2005) Which is the best multiclass SVM method? An empirical study. In: Proceedings of the sixth international workshop on multiple classifier systems, pp 278–285CrossRefGoogle Scholar
  22. Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley, New YorkzbMATHGoogle Scholar
  23. Dudani SA (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:325–327CrossRefGoogle Scholar
  24. Ferguson TS (1974) Prior distributions on spaces of probability measures. Ann Stat 02:615–629MathSciNetCrossRefGoogle Scholar
  25. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188CrossRefGoogle Scholar
  26. Florida R (2011) America’s great passport divide. Accessed 15 Mar 2011
  27. Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611–631MathSciNetCrossRefGoogle Scholar
  28. Friedman JH (1989) Regularized discriminant analysis. J Am Stat Assoc 84:165–175MathSciNetCrossRefGoogle Scholar
  29. Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, BaltimorezbMATHGoogle Scholar
  30. Hannah LA, Blei DM, Powell WB (2011) Dirichlet process mixtures of generalized linear models. J Mach Learn Res 12:1923–1953MathSciNetzbMATHGoogle Scholar
  31. Hanson T (2006) For mixtures of finite Polya tree models. J Am Stat Assoc 101:1548–1565MathSciNetCrossRefGoogle Scholar
  32. Hanson T, Branscum A, Gardner I (2008) Multivariate mixtures of Polya trees for modelling ROC data. Stat Model 8:81–96MathSciNetCrossRefGoogle Scholar
  33. Hanson T, Chen Y (2014) Bayesian nonparametric k-sample tests for censored and uncensored data. Comput Stat Data Anal 71:335–346MathSciNetCrossRefGoogle Scholar
  34. Hanson T, Monteiro J, Jara A (2011) The Polya tree sampler: towards efficient and automatic independent Metropolis-Hastings proposals. J Comput Graph Stat 20:41–62MathSciNetCrossRefGoogle Scholar
  35. Hastie T, Tibshirani R (1996) Discriminant analysis by Gaussian mixtures. J R Stat Soc Series B (Methodol) 58:155–176MathSciNetzbMATHGoogle Scholar
  36. Hastie T, Tibshirani R (1998) Classification by pairwise coupling. Ann Stat 26:451–471MathSciNetCrossRefGoogle Scholar
  37. Hastie T, Tibshirani R, Friedman J (2001) The Elements of statistical learning: data mining, inference, and prediction. Springer, BerlinCrossRefGoogle Scholar
  38. Ho TK (1995) Random decision forests. In: Third international conference on document analysis and recognition, ICDAR 1995, August 14–15, 1995, Montreal, Canada. Vol I, pp 278–282Google Scholar
  39. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417–441CrossRefGoogle Scholar
  40. Izenman AJ (1991) Recent developments in nonparametric density estimation. J Am Stat Assoc 86:205–224MathSciNetzbMATHGoogle Scholar
  41. Jara A, Hanson T, Lesaffre E (2009) Robustifying generalized linear mixed models using a new class of mixtures of multivariate Polya trees. J Comput Graph Stat 18:838–860MathSciNetCrossRefGoogle Scholar
  42. Jiang L, Wang D, Cai Z, Yan X (2007) Survey of improving naive Bayes for classification. In: Proceedings of the 3rd international conference on advanced data mining and applications. Springer, pp 134–145Google Scholar
  43. Karsoliya S (2012) Approximating number of hidden layer neurons in multiple hidden layer BPNN architecture. Int J Eng Trends Technol 12:714–717Google Scholar
  44. Kotsiantis SB (2007) Supervised machine learning: a review of classification. Informatica 31:249–268MathSciNetzbMATHGoogle Scholar
  45. Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A (2006) Machine learning in bioinformatics. Brief Bioinform 17:86–112CrossRefGoogle Scholar
  46. Lavine M (1992) Some aspects of Polya tree distributions for statistical modelling. Ann Stat 20:1222–1235CrossRefGoogle Scholar
  47. Lavine M (1994) More aspects of Polya tree distributions for statistical modelling. Ann Stat 22:1161–1176CrossRefGoogle Scholar
  48. Ledl T (2004) Kernel density estimation: theory and application in discriminant analysis. Austrian J Stat 33:267–279CrossRefGoogle Scholar
  49. Leisch F, Dimitriadou E (2015) mlbench: machine learning benchmark problems. R package version 2.1-1Google Scholar
  50. Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2:18–22Google Scholar
  51. Ma J, Yu MK, Fong S, Ono K, Sage E, Demchak B, Sharan R, Ideker T (2018) Using deep learning to model the hierarchical structure and function of a cell. Nat Methods 15:290–298CrossRefGoogle Scholar
  52. Ma Y, Guo G (2014) Support vector machines applications. Springer, BerlinCrossRefGoogle Scholar
  53. Mantel N (1966) Models for complex contingency tables and polychotomous dosage response curves. Biometrics 22:83–95CrossRefGoogle Scholar
  54. Marzio M, Taylor CC (2005) On boosting kernel density methods for multivariate data: density estimation and classification. Stat Methods Appl 14:163–178MathSciNetCrossRefGoogle Scholar
  55. Mauldin RD, Sudderth WD, Williams SC (1992) Polya trees and random distributions. Ann Stat 20:1203–1221CrossRefGoogle Scholar
  56. Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2015) e1071: Misc functions of the department of statistics, probability theory group (Formerly: E1071), TU Wien. R package version 1.6-7Google Scholar
  57. Migration Policy Institute (2014). State immigration data profiles. Accessed 13 Mar 2016
  58. Mohri M, Rostamizadeh A, Talwalkar A (2012) Foundations of machine learning. The MIT Press, CambridgezbMATHGoogle Scholar
  59. Montavon G, Lapuschkin S, Binder A, Samek W, Müller K-R (2017) Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recognit 65:211–222CrossRefGoogle Scholar
  60. Montavon G, Samek W, Müller K-R (2018) Methods for interpreting and understanding deep neural networks. Digit Sig Process 73:1–15MathSciNetCrossRefGoogle Scholar
  61. Mukhopadhyay S, Ghosh A (2011) Bayesian multiscale smoothing in supervised and semi-supervised kernel discriminant analysis. Comput Stat Data Anal 55:2344–2353MathSciNetCrossRefGoogle Scholar
  62. Müller P, Rodriguez A (2013) Chapter 4: Polya Trees, volume 9 of NSF-CBMS regional conference series in probability and statistics. Institute of Mathematical Statistics and American Statistical Assocation, pp 43–51Google Scholar
  63. National Archives and Records Administration (2012) Historical election results. Accessed 13 Mar 2016
  64. Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In: Advances in neural information processing systems, pp 841–848Google Scholar
  65. Paddock S, Ruggeri F, Lavine M, West M (2003) Randomised Polya tree models for nonparametric Bayesian inference. Statistica Sinica 13:443–460MathSciNetzbMATHGoogle Scholar
  66. Pati D, Bhattacharya A, Pillai NS, Dunson D (2014) Posterior contraction in sparse bayesian factor models for massive covariance matrices. Ann Stat 42(3):1102–1130MathSciNetCrossRefGoogle Scholar
  67. Plastria F, De Bruyne S, Carrizosa E (2008) Dimensionality reduction for classification. In: International conference on advanced data mining and applications. Springer, pp 411–418Google Scholar
  68. R Core Team (2014) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, AustriaGoogle Scholar
  69. Rao CR (1948) The utilization of multiple measurements in problems of biological classification. J R Stat Soc Ser B 10:159–203MathSciNetzbMATHGoogle Scholar
  70. Ripley BD (2007) Pattern recognition and neural networks. Cambridge University Press, CambridgezbMATHGoogle Scholar
  71. Rish I (2001) An empirical study of the naive Bayes classifier. Technical report, IBMGoogle Scholar
  72. Rojas R (1996) Neural networks: a systematic introduction. Springer, New YorkCrossRefGoogle Scholar
  73. Runcie DE, Mukherjee S (2013) Dissecting high-dimensional phenotypes with bayesian sparse factor analysis of genetic covariance matrices. Genetics 194(3):753–767CrossRefGoogle Scholar
  74. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117CrossRefGoogle Scholar
  75. Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, CambridgeGoogle Scholar
  76. Scrucca L, Fop M, Murphy TB, Raftery AE (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8(1):205–233Google Scholar
  77. Shahbaba B, Neal R (2009) Nonlinear models using Dirichlet process mixtures. J Mach Learn Res 10:1829–1850MathSciNetzbMATHGoogle Scholar
  78. Steinwart I, Christmann A (2008) Support vector machines. Springer, BerlinzbMATHGoogle Scholar
  79. Tax Foundation (2007). Federal taxes paid vs. federal spending received by state, 1981–2005. Accessed 13 Mar 2016
  80. Tsang IW, Kwok JT, Cheung P-M (2005) Core vector machines: fast SVM training on very large data sets. J Mach Learn Res 6:363–392MathSciNetzbMATHGoogle Scholar
  81. United States Census Bureau (2010) American community survey, education attainment for states, percent with high school diploma and with bachelor’s degree: 2010. Accessed 13 Mar 2016
  82. United States Census Bureau (2014) State median income. Accessed 13 Mar 2016
  83. United States Department of State Bureau of Consular Affairs (2015) U.S. passports and international travel: passport statistics. Accessed 13 Mar 2016
  84. Vapnik VN (1979) Estimation of dependences based on empirical data. Nauka, USSR (in Russian)zbMATHGoogle Scholar
  85. Vapnik VN, Chervonenkis A (1963) A note on one class of perceptrons. Autom Remote Control 25:774–780Google Scholar
  86. Vapnik VN, Lerner A (1962) Pattern recognition using generalized portrait method. Autom Remote Control 24:709–715Google Scholar
  87. Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York. ISBN 0-387-95457-0. CrossRefGoogle Scholar
  88. Wong WH, Ma L (2010) Optional Polya tree and Bayesian inference. Ann Stat 38:1433–1459CrossRefGoogle Scholar
  89. Yegnanarayana B (2004) Artificial neural networks. Prentice-Hall, New JerseyGoogle Scholar
  90. Zambom AZ, Dias R (2013) A review of kernel density estimation with applications to econometrics. Int Econ Rev (IER) 5:20–42Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of MathematicsColgate UniversityHamiltonUSA
  2. 2.Department of StatisticsUniversity of South CarolinaColumbiaUSA

Personalised recommendations