Applied Intelligence

, Volume 49, Issue 11, pp 3783–3800 | Cite as

Hybrid generative discriminative approaches based on Multinomial Scaled Dirichlet mixture models

  • Nuha ZamzamiEmail author
  • Nizar Bouguila


Developing both generative and discriminative techniques for classification has achieved significant progress in the last few years. Considering the capabilities and limitations of both, hybrid generative discriminative approaches have received increasing attention. Our goal is to combine the advantages and desirable properties of generative models, i.e. finite mixture, and the Support Vector Machines (SVMs) as powerful discriminative techniques for modeling count data that appears in many domains in machine learning and computer vision applications. In particular, we select accurate kernels generated from mixtures of Multinomial Scaled Dirichlet distribution and its exponential approximation (EMSD) for support vector machines. We demonstrate the effectiveness and the merits of the proposed framework through challenging real-world applications namely; object recognition and visual scenes classification. Large scale datasets have been considered in the empirical study such as Microsoft MOCR, Fruits-360 and MIT places.


Generative/discriminative learing Count data Exponential family Finite mixtures Multinomial Scaled Dirichlet SVMs Kernels 



  1. 1.
    Agarwal A, Daumà H et al (2011) Generative kernels for exponential families. In: Proceedings of the 14th international conference on artificial intelligence and statistics, pp 85–92Google Scholar
  2. 2.
    Amayri O, Bouguila N (2015) Beyond hybrid generative discriminative learning: spherical data classification. Pattern Anal Appl 18(1):113–133CrossRefMathSciNetGoogle Scholar
  3. 3.
    Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von mises-fisher distributions. J Mach Learn Res 6:1345–1382MathSciNetzbMATHGoogle Scholar
  4. 4.
    Bdiri T, Bouguila N (2013) Bayesian learning of inverted dirichlet mixtures for svm kernels generation. Neural Comput Appl 23(5):1443–1458CrossRefGoogle Scholar
  5. 5.
    Berk RA (2016) Support vector machines. In: Statistical learning from a regression perspective. Springer, pp 291–310Google Scholar
  6. 6.
    Bishop C (2006) Pattern recognition and machine learning. Information science and statistics. Springer, New YorkzbMATHGoogle Scholar
  7. 7.
    Bishop C, Bishop CM et al (1995) Neural networks for pattern recognition. Oxford University Press, OxfordzbMATHGoogle Scholar
  8. 8.
    Bosch A, Muñoz X, Martí R (2007) Which is the best way to organize/classify images by content? Image Vis Comput 25(6):778–791CrossRefGoogle Scholar
  9. 9.
    Bouguila N (2008) Clustering of count data using generalized dirichlet multinomial distributions. IEEE Trans Knowl Data Eng 20(4):462–474CrossRefGoogle Scholar
  10. 10.
    Bouguila N (2011) Bayesian hybrid generative discriminative learning based on finite liouville mixture models. Pattern Recogn 44(6):1183–1200CrossRefzbMATHGoogle Scholar
  11. 11.
    Bouguila N (2011) Count data modeling and classification using finite mixtures of distributions. IEEE Trans Neural Netw 22(2):186–198CrossRefGoogle Scholar
  12. 12.
    Bouguila N (2012) Hybrid generative/discriminative approaches for proportional data modeling and classification. IEEE Trans Knowl Data Eng 24(12):2184–2202CrossRefGoogle Scholar
  13. 13.
    Bouguila N (2013) Deriving kernels from generalized dirichlet mixture models and applications. Inf Process Manag 49(1):123–137CrossRefGoogle Scholar
  14. 14.
    Bouguila N, Amayri O (2009) A discrete mixture-based kernel for svms: application to spam and image categorization. Inf Process Manag 45(6):631–642CrossRefGoogle Scholar
  15. 15.
    Bouguila N, Ziou D (2007) Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization. J Vis Commun Image Represent 18(4):295–309CrossRefGoogle Scholar
  16. 16.
    Brown LD (1986) Fundamentals of statistical exponential families: with applications in statistical decision theory. ImsGoogle Scholar
  17. 17.
    Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Disc 2 (2):121–167CrossRefGoogle Scholar
  18. 18.
    Campbell WM, Sturim DE, Reynolds DA (2006) Support vector machines using gmm super vectors for speaker verification. IEEE Signal Process Lett 13(5):308–311CrossRefGoogle Scholar
  19. 19.
    Chan AB, Vasconcelos N, Moreno PJ (2004) A family of probabilistic kernels based on information divergence. Univ. California, San Diego, CA, Tech. Rep SVCL-TR-2004-1Google Scholar
  20. 20.
    Chang SK, Hsu A (1992) Image information systems: where do we go from here? IEEE Trans Knowl Data Eng 4(5):431–442CrossRefGoogle Scholar
  21. 21.
    Christianini N, Shawe-Taylor J (2000) Support vector machines, vol 93. Cambridge University Press, Cambridge, pp 935–948Google Scholar
  22. 22.
    Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190CrossRefMathSciNetGoogle Scholar
  23. 23.
    Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV. Prague, vol 1, pp 1–2Google Scholar
  24. 24.
    Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B Methodol 39(1):1–22MathSciNetzbMATHGoogle Scholar
  25. 25.
    Deng J, Xu X, Zhang Z, Frühholz S., Grandjean D, Schuller B (2017) Fisher kernels on phase-based features for speech emotion recognition. In: Dialogues with social robots. Springer, pp 195–203Google Scholar
  26. 26.
    Elisseeff A, Weston J (2002) A kernel method for multi-labelled classification. In: Advances in neural information processing systems, pp 681–687Google Scholar
  27. 27.
    Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 289–296Google Scholar
  28. 28.
    Erhan D, Bengio Y, Courville A, Manzagol PA, Vincent P, Bengio S (2010) Why does unsupervised pre-training help deep learning? J Mach Learn Res 11(Feb):625–660MathSciNetzbMATHGoogle Scholar
  29. 29.
    Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput Vis Image Underst 106(1):59–70CrossRefGoogle Scholar
  30. 30.
    Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition, 2005. CVPR 2005, vol 2. IEEE, pp 524–531Google Scholar
  31. 31.
    Ferrari V, Tuytelaars T, Van Gool L (2006) Object detection by contour segment networks. In: European conference on computer vision. Springer, pp 14–28Google Scholar
  32. 32.
    Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features. In: 10th IEEE international conference on computer vision, 2005. ICCV 2005, vol 2. IEEE, pp 1458–1465Google Scholar
  33. 33.
    Gupta RD, Richards DSP (1987) Multivariate liouville distributions. J Multivar Anal 23(2):233–256CrossRefMathSciNetzbMATHGoogle Scholar
  34. 34.
    Han X, Dai Q (2018) Batch-normalized mlpconv-wise supervised pre-training network in network. Appl Intell 48(1):142–155CrossRefGoogle Scholar
  35. 35.
    Hankin RK et al (2010) A generalization of the dirichlet distribution. J Stat Softw 33(11):1–18CrossRefGoogle Scholar
  36. 36.
    Jaakkola T, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: Advances in neural information processing systems, pp 487–493Google Scholar
  37. 37.
    Jebara T (2003) Images as bags of pixels. In: ICCV, pp 265–272Google Scholar
  38. 38.
    Jebara T, Kondor R, Howard A (2004) Probability product kernels. J Mach Learn Res 5(Jul):819–844MathSciNetzbMATHGoogle Scholar
  39. 39.
    Jégou H, Douze M, Schmid C (2009) On the burstiness of visual elements. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 1169–1176Google Scholar
  40. 40.
    Kailath T (1967) The divergence and bhattacharyya distance measures in signal selection. IEEE Trans Commun Technol 15(1):52–60CrossRefGoogle Scholar
  41. 41.
    Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2 (1):15–59CrossRefGoogle Scholar
  42. 42.
    Keerthi SS, Lin CJ (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural Comput 15(7):1667–1689CrossRefzbMATHGoogle Scholar
  43. 43.
    Lin HT, Lin CJ (2003) A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods. submitted to Neural Computation 3:1–32Google Scholar
  44. 44.
    Lin J (1991) Divergence measures based on the shannon entropy. IEEE Trans Inf Theor 37(1):145–151CrossRefMathSciNetzbMATHGoogle Scholar
  45. 45.
    Lochner RH (1975) A generalized dirichlet distribution in bayesian life testing. J R Stat Soc Ser B Methodol 37(1):103–113MathSciNetzbMATHGoogle Scholar
  46. 46.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110CrossRefGoogle Scholar
  47. 47.
    Ma Y, Guo G (2014) Support vector machines applications. Springer, New YorkCrossRefGoogle Scholar
  48. 48.
    Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 545–552Google Scholar
  49. 49.
    McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: Proceedings of the AAAI-98 workshop on learning for text categorization, vol 752. Citeseer, pp 41–48Google Scholar
  50. 50.
    McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering.∼mccallum/bow
  51. 51.
    McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley, New JerseyGoogle Scholar
  52. 52.
    Migliorati S, Monti GS, Ongaro A (2008) E–m algorithm: an application to a mixture model for compositional data. In: Proceedings of the 44th scientific meeting of the italian statistical societyGoogle Scholar
  53. 53.
    Moguerza JM, Muñoz A, et al. (2006) Support vector machines with applications. Stat Sci 21(3):322–336CrossRefMathSciNetzbMATHGoogle Scholar
  54. 54.
    Monti GS, Mateu-Figueras G, Pawlowsky-Glahn V (2011) Compositional Data Analysis: Theory and Applications, chap. Notes on the scaled Dirichlet distribution. Wiley, Chichester. Google Scholar
  55. 55.
    Moreno PJ, Ho PP, Vasconcelos N (2004) A kullback-leibler divergence based kernel for svm classification in multimedia applications. In: Advances in neural information processing systems, pp 1385–1392Google Scholar
  56. 56.
    Mosimann JE (1962) On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika 49(1/2):65–82CrossRefMathSciNetzbMATHGoogle Scholar
  57. 57.
    Mureşan H, Oltean M (2018) Fruit recognition from images using deep learning. Acta Universitatis Sapientiae, Informatica 10(1):26–42CrossRefGoogle Scholar
  58. 58.
    Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Advances in neural information processing systems, pp 841–848Google Scholar
  59. 59.
    Oboh BS, Bouguila N (2017) Unsupervised learning of finite mixtures using scaled dirichlet distribution and its application to software modules categorization. In: Proceedings of the 2017 IEEE international conference on industrial technology (ICIT). IEEE, pp 1085–1090Google Scholar
  60. 60.
    Van den Oord A, Schrauwen B (2014) Factoring variations in natural images with deep gaussian mixture models. In: Advances in neural information processing systems, pp 3518–3526Google Scholar
  61. 61.
    Penny WD (2001) Kullback-liebler divergences of normal, gamma, dirichlet and wishart densities. Wellcome Department of Cognitive NeurologyGoogle Scholar
  62. 62.
    Pérez-Cruz F (2008) Kullback-leibler divergence estimation of continuous distributions. In: IEEE international symposium on information theory, 2008. ISIT 2008. IEEE, pp 1666–1670Google Scholar
  63. 63.
    Raina R, Shen Y, Mccallum A, Ng AY (2004) Classification with hybrid generative/discriminative models. In: Advances in neural information processing systems, pp 545–552Google Scholar
  64. 64.
    Rennie JDM, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th international conference on machine learning ICML, vol 3, pp 616–623Google Scholar
  65. 65.
    Rényi A et al (1961) On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of CaliforniaGoogle Scholar
  66. 66.
    Rubinstein YD, Hastie T et al (1997) Discriminative vs informative learning. In: KDD, vol 5, pp 49–53Google Scholar
  67. 67.
    Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, CambridgeGoogle Scholar
  68. 68.
    Shmilovici A (2010) Support vector machines. In: Data mining and knowledge discovery handbook. Springer, pp 231–247Google Scholar
  69. 69.
    Sivazlian B (1981) On a multivariate extension of the gamma and beta distributions. SIAM J Appl Math 41 (2):205–209CrossRefMathSciNetzbMATHGoogle Scholar
  70. 70.
    Song G, Dai Q (2017) A novel double deep elms ensemble system for time series forecasting. Knowl-Based Syst 134:31–49CrossRefGoogle Scholar
  71. 71.
    Van Der Maaten L (2011) Learning discriminative fisher kernels. In: ICML, vol 11, pp 217–224Google Scholar
  72. 72.
    Vapnik V (2013) The nature of statistical learning theory. Springer Science & Business Media, New YorkzbMATHGoogle Scholar
  73. 73.
    Vapnik VN (1995) The nature of statistical learning theoryGoogle Scholar
  74. 74.
    Variani E, McDermott E, Heigold G (2015) A gaussian mixture model layer jointly optimized with discriminative features within a deep neural network architecture. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4270–4274Google Scholar
  75. 75.
    Vasconcelos N, Ho P, Moreno P (2004) The kullback-leibler kernel as a framework for discriminant and localized representations for visual recognition. In: European conference on computer vision. Springer, pp 430–441Google Scholar
  76. 76.
    Wang P, Sun L, Yang S, Smeaton AF (2015) Improving the classification of quantified self activities and behaviour using a fisher kernel. In: Adjunct Proceedings of the 2015 ACM international joint conference on pervasive and ubiquitous computing and proceedings of the 2015 ACM international symposium on wearable computers. ACM, pp 979–984Google Scholar
  77. 77.
    Winn J, Criminisi A, Minka T (2005) Object categorization by learned universal visual dictionary. In: 10th IEEE international conference on computer vision, 2005. ICCV 2005, vol 2. IEEE, pp 1800–1807Google Scholar
  78. 78.
    Wong TT (2009) Alternative prior assumptions for improving the performance of naïve bayesian classifiers. Data Min Knowl Disc 18(2):183–213CrossRefMathSciNetGoogle Scholar
  79. 79.
    Zamzami N, Bouguila N (2018) Text modeling using multinomial scaled dirichlet distributions. In: International conference on industrial, engineering and other applications of applied intelligent systems. Springer, pp 69–80Google Scholar
  80. 80.
    Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems , pp 487–495Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Concordia Institute for Information Systems Engineering (CIISE)Concordia UniversityMontrealCanada
  2. 2.Faculty of Computing and Information TechnologyKing Abdulaziz UniversityJeddahSaudi Arabia

Personalised recommendations