Deriving Probabilistic SVM Kernels from Exponential Family Approximations to Multivariate Distributions for Count Data

  • Nuha ZamzamiEmail author
  • Nizar Bouguila
Part of the Unsupervised and Semi-Supervised Learning book series (UNSESUL)


This work aims to propose a robust hybrid probabilistic learning approach that combines appropriately the advantages of both the generative and discriminative models for modeling count data. We build new probabilistic kernels based on information divergences and Fisher score from efficient approximations to multivariate distributions for support vector machines (SVMs). More precisely, we drive probabilistic kernels from the mixture of exponential family approximation to two powerful generative models for count data, namely the multinomial compound Dirichlet (DCM) and the generalized Dirichlet multinomial (GDM). The developed hybrid models are introduced as effective SVM kernels able to incorporate prior knowledge about the nature of data involved in the problem at hand and, therefore, permits a good data discrimination. We demonstrate the flexibility and the merits of the proposed frameworks for the problem of analyzing activities in surveillance scenes.


Generative/discriminative learning Count data Exponential family approximation Finite mixtures Multinomial compound Dirichlet (DCM) Generalized Dirichlet multinomial (GDM) Support vector machines (SVMs) Probabilistic kernels Activity analysis 


  1. 1.
    Agarwal, A., Daumé III, H.: Generative kernels for exponential families. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 85–92 (2011)Google Scholar
  2. 2.
    Ali, S., Shah, M.: Floor fields for tracking in high density crowd scenes. In: European Conference on Computer Vision, pp. 1–14. Springer, Berlin (2008)Google Scholar
  3. 3.
    Baktashmotlagh, M., Harandi, M., Lovell, B.C., Salzmann, M.: Discriminative non-linear stationary subspace analysis for video classification. IEEE Trans. Pattern Anal. Mach. Intell. 36(12), 2353–2366 (2014)CrossRefGoogle Scholar
  4. 4.
    Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6(Oct), 1705–1749 (2005)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Bdiri, T., Bouguila, N.: Bayesian learning of inverted Dirichlet mixtures for SVM kernels generation. Neural Comput. Appl. 23(5), 1443–1458 (2013)CrossRefGoogle Scholar
  6. 6.
    Bishop, C.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)Google Scholar
  7. 7.
    Bishop, C.M., et al.: Neural Networks for Pattern Recognition. Oxford University Press, Oxford (1995)zbMATHGoogle Scholar
  8. 8.
    Bouguila, N.: Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 20(4), 462–474 (2008)CrossRefGoogle Scholar
  9. 9.
    Bouguila, N.: Hybrid generative/discriminative approaches for proportional data modeling and classification. IEEE Trans. Knowl. Data Eng. 24(12), 2184–2202 (2012)CrossRefGoogle Scholar
  10. 10.
    Bouguila, N., Amayri, O.: A discrete mixture-based kernel for SVMs: application to spam and image categorization. Inf. Process. Manag. 45(6), 631–642 (2009)CrossRefGoogle Scholar
  11. 11.
    Brown, L.D.: Fundamentals of statistical exponential families: with applications in statistical decision theory. Institute of Mathematical Statistics, Hayward (1986)Google Scholar
  12. 12.
    Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)CrossRefGoogle Scholar
  13. 13.
    Caballero, K.L., Barajas, J., Akella, R.: The generalized Dirichlet distribution in enhanced topic detection. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 773–782. ACM, New York (2012)Google Scholar
  14. 14.
    Campbell, W.M., Sturim, D.E., Reynolds, D.A.: Support vector machines using GMM super vectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006)CrossRefGoogle Scholar
  15. 15.
    Chan, A.B., Vasconcelos, N.: Probabilistic kernels for the classification of auto-regressive visual processes. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 846–851. IEEE, Piscataway (2005)Google Scholar
  16. 16.
    Chan, A.B., Vasconcelos, N., Moreno, P.J.: A family of probabilistic kernels based on information divergence. University of California, San Diego, CA, Technical Report. SVCL-TR-2004-1 (2004)Google Scholar
  17. 17.
    Christianini, N., Shawe-Taylor, J.: Support Vector Machines, vol. 93(443), pp. 935–948. Cambridge University Press, Cambridge (2000)Google Scholar
  18. 18.
    Church, K.W., Gale, W.A.: Poisson mixtures. Nat. Lang. Eng. 1(2), 163–190 (1995)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Cong, Y., Yuan, J., Liu, J.: Abnormal event detection in crowded scenes using sparse representation. Pattern Recogn. 46(7), 1851–1864 (2013)CrossRefGoogle Scholar
  20. 20.
    Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, (ECCV), Prague, vol. 1, pp. 1–2 (2004)Google Scholar
  21. 21.
    DasGupta, A.: The exponential family and statistical applications. In: Probability for Statistics and Machine Learning, pp. 583–612. Springer, New York (2011)Google Scholar
  22. 22.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 39, 1–38 (1977)MathSciNetzbMATHGoogle Scholar
  23. 23.
    Deng, J., Xu, X., Zhang, Z., Frühholz, S., Grandjean, D., Schuller, B.: Fisher kernels on phase-based features for speech emotion recognition. In: Dialogues with Social Robots, pp. 195–203. Springer, Singapore (2017)Google Scholar
  24. 24.
    Dong, Z., Kong, Y., Liu, C., Li, H., Jia, Y.: Recognizing human interaction by multiple features. In: The First Asian Conference on Pattern Recognition, pp. 77–81. IEEE, Piscataway (2011)Google Scholar
  25. 25.
    Elkan, C.: Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 289–296. ACM, New York (2006)Google Scholar
  26. 26.
    Fayyad, U.M., Reina, C., Bradley, P.S.: Initialization of iterative refinement clustering algorithms. In: KDD, pp. 194–198 (1998)Google Scholar
  27. 27.
    Figueiredo, M.A.T., Jain, A.K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 381–396 (2002)CrossRefGoogle Scholar
  28. 28.
    Geary, D.: Mixture models: inference and applications to clustering. J. R. Stat. Soc. Ser. A 152(1), 126–127 (1989)CrossRefGoogle Scholar
  29. 29.
    Grauman, K., Darrell, T.: The pyramid match kernel: discriminative classification with sets of image features. In: Tenth IEEE International Conference on Computer Vision (ICCV), vol. 2, pp. 1458–1465. IEEE, Piscataway (2005)Google Scholar
  30. 30.
    Holub, A.D., Welling, M., Perona, P.: Hybrid generative-discriminative visual categorization. Int. J. Comput. Vis. 77(1–3), 239–258 (2008)CrossRefGoogle Scholar
  31. 31.
    Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Advances in Neural Information Processing Systems, pp. 487–493. The MIT Press, Cambridge (1999)Google Scholar
  32. 32.
    Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recogn. Lett. 31(8), 651–666 (2010)CrossRefGoogle Scholar
  33. 33.
    Jebara, T.: Images as bags of pixels. In: ICCV, pp. 265–272 (2003)Google Scholar
  34. 34.
    Jebara, T., Kondor, R.: Bhattacharyya and expected likelihood kernels. In: Learning Theory and Kernel Machines, pp. 57–71. Springer, Berlin (2003)zbMATHCrossRefGoogle Scholar
  35. 35.
    Jebara, T., Kondor, R., Howard, A.: Probability product kernels. J. Mach. Learn. Res. 5(Jul), 819–844 (2004)MathSciNetzbMATHGoogle Scholar
  36. 36.
    Jégou, H., Douze, M., Schmid, C.: On the burstiness of visual elements. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1169–1176. IEEE, Piscataway (2009)Google Scholar
  37. 37.
    Johnston, J., Hamerly, G.: Improving SimPoint accuracy for small simulation budgets with EDCM clustering. In: Workshop on Statistical and Machine Learning Approaches to ARchitectures and compilaTion (SMART08) (2008)Google Scholar
  38. 38.
    Kailath, T.: The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 15(1), 52–60 (1967)CrossRefGoogle Scholar
  39. 39.
    Katz, S.M.: Distribution of content words and phrases in text and language modelling. Nat. Lang. Eng. 2(1), 15–59 (1996)CrossRefGoogle Scholar
  40. 40.
    Keerthi, S.S., Lin, C.J.: Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Comput. 15(7), 1667–1689 (2003)zbMATHCrossRefGoogle Scholar
  41. 41.
    Kim, J., Grauman, K.: Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2928. IEEE, Piscataway (2009)Google Scholar
  42. 42.
    Kong, D., Gray, D., Tao, H.: Counting pedestrians in crowds using viewpoint invariant training. In: BMVC, vol. 1, p. 2. Citeseer (2005)Google Scholar
  43. 43.
    Kullback, S.: Information Theory and Statistics. Courier Corporation, Chelmsford, MA (1997)zbMATHGoogle Scholar
  44. 44.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  45. 45.
    Laptev, I., Caputo, B., et al.: Recognizing human actions: a local SVM approach. In: null, pp. 32–36. IEEE, Piscataway (2004)Google Scholar
  46. 46.
    Li, Y., Shapiro, L., Bilmes, J.A.: A generative/discriminative learning algorithm for image classification. In: Tenth IEEE International Conference on Computer Vision (ICCV’05), vol. 2, pp. 1605–1612. IEEE, Piscataway (2005)Google Scholar
  47. 47.
    Lin, J.: Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37(1), 145–151 (1991)MathSciNetzbMATHCrossRefGoogle Scholar
  48. 48.
    Lin, H.T., Lin, C.J.: A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods. Neural Comput. 3, 1–32 (2003)Google Scholar
  49. 49.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
  50. 50.
    Loy, C.C., Xiang, T., Gong, S.: Stream-based active unusual event detection. In: Asian Conference on Computer Vision, pp. 161–175. Springer, Berlin (2010)CrossRefGoogle Scholar
  51. 51.
    Ma, Y., Guo, G.: Support Vector Machines Applications. Springer, Cham (2014)CrossRefGoogle Scholar
  52. 52.
    Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 545–552. ACM, New York (2005)Google Scholar
  53. 53.
    Mahadevan, V., Li, W., Bhalodia, V., Vasconcelos, N.: Anomaly detection in crowded scenes. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1975–1981. IEEE, Piscataway (2010)Google Scholar
  54. 54.
    Margaritis, D., Thrun, S.: A Bayesian multiresolution independence test for continuous variables. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 346–353. Morgan Kaufmann Publishers, Burlington (2001)Google Scholar
  55. 55.
    McLachlan, G.J.: Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York (1988)zbMATHGoogle Scholar
  56. 56.
    McLachlan, G., Krishnan, T.: The EM algorithm and extensions, vol. 382. Wiley, Hoboken (2007)zbMATHGoogle Scholar
  57. 57.
    Mehran, R., Oyama, A., Shah, M.: Abnormal crowd behavior detection using social force model. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 935–942. IEEE, Piscataway (2009)Google Scholar
  58. 58.
    Melnykov, V., Maitra, R., et al.: Finite mixture models and model-based clustering. Stat. Surv. 4, 80–116 (2010)MathSciNetzbMATHCrossRefGoogle Scholar
  59. 59.
    Moguerza, J.M., Muñoz, A., et al.: Support vector machines with applications. Stat. Sci. 21(3), 322–336 (2006)MathSciNetzbMATHCrossRefGoogle Scholar
  60. 60.
    Moreno, P.J., Ho, P.P., Vasconcelos, N.: A Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. In: Advances in Neural Information Processing Systems, pp. 1385–1392 (2004)Google Scholar
  61. 61.
    Ozkurt, C., Camci, F.: Automatic traffic density estimation and vehicle classification for traffic surveillance systems using neural networks. Math. Comput. Appl. 14(3), 187–196 (2009)Google Scholar
  62. 62.
    Penny, W.D.: Kullback-Leibler divergences of normal, gamma, Dirichlet and Wishart densities. Technical report, Wellcome Department of Cognitive Neurology (2001)Google Scholar
  63. 63.
    Pérez-Cruz, F.: Kullback-Leibler divergence estimation of continuous distributions. In: IEEE International Symposium on Information Theory (ISIT), pp. 1666–1670. IEEE, Piscataway (2008)Google Scholar
  64. 64.
    Raina, R., Shen, Y., Mccallum, A., Ng, A.Y.: Classification with hybrid generative/discriminative models. In: Advances in Neural Information Processing Systems, pp. 545–552 (2004)Google Scholar
  65. 65.
    Rényi, A., et al.: On measures of entropy and information. In: Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, Oakland (1961)Google Scholar
  66. 66.
    Rubinstein, Y.D., Hastie, T., et al.: Discriminative vs informative learning. In: KDD, vol. 5, pp. 49–53 (1997)Google Scholar
  67. 67.
    Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: IEEE 12th International Conference on Computer Vision (ICCV), pp. 1593–1600. IEEE, Piscataway (2009)Google Scholar
  68. 68.
    Sankaranarayanan, A.C., Turaga, P.K., Baraniuk, R.G., Chellappa, R.: Compressive acquisition of dynamic scenes. In: European Conference on Computer Vision, pp. 129–142. Springer, Berlin (2010)CrossRefGoogle Scholar
  69. 69.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, pp. 357–360. ACM, New York (2007)Google Scholar
  70. 70.
    Shmilovici, A.: Support vector machines. In: Data Mining and Knowledge Discovery Handbook, pp. 231–247. Springer, New York (2010)CrossRefGoogle Scholar
  71. 71.
    Titterington, D.M., Smith, A.F., Makov, U.E.: Statistical Analysis of Finite Mixture Distributions. Wiley, London (1985)zbMATHGoogle Scholar
  72. 72.
    Tsuda, K., Akaho, S., Kawanabe, M., Müller, K.R.: Asymptotic properties of the fisher kernel. Neural Comput. 16(1), 115–137 (2004)zbMATHCrossRefGoogle Scholar
  73. 73.
    Ueda, N., Nakano, R.: Deterministic annealing EM algorithm. Neural Netw. 11(2), 271–282 (1998)CrossRefGoogle Scholar
  74. 74.
    Van Der Maaten, L.: Learning discriminative fisher kernels. In: ICML, vol. 11, pp. 217–224 (2011)Google Scholar
  75. 75.
    Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)zbMATHCrossRefGoogle Scholar
  76. 76.
    Vasconcelos, N., Ho, P., Moreno, P.: The Kullback-Leibler kernel as a framework for discriminant and localized representations for visual recognition. In: European Conference on Computer Vision, pp. 430–441. Springer, Berlin (2004)Google Scholar
  77. 77.
    Wang, Y., Mori, G.: Human action recognition by semilatent topic models. IEEE Trans. Pattern Anal. Mach. Intell. 31(10), 1762–1774 (2009)CrossRefGoogle Scholar
  78. 78.
    Wong, T.T.: Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Min. Knowl. Discov. 18(2), 183–213 (2009)MathSciNetCrossRefGoogle Scholar
  79. 79.
    Wong, T.T.: Generalized Dirichlet priors for naïve Bayesian classifiers with multinomial models in document classification. Data Min. Knowl. Discov. 28(1), 123–144 (2014)MathSciNetzbMATHCrossRefGoogle Scholar
  80. 80.
    Zamzami, N., Bouguila, N.: Consumption behavior prediction using hierarchical Bayesian frameworks. In: First International Conference on Artificial Intelligence for Industries (AI4I), pp. 31–34. IEEE, Piscataway (2018)Google Scholar
  81. 81.
    Zamzami, N., Bouguila, N.: Hybrid generative discriminative approaches based on multinomial scaled Dirichlet mixture models. Appl. Intell., 1–18 (2019, in press)Google Scholar
  82. 82.
    Zamzami, N., Bouguila, N.: Model selection and application to high-dimensional count data clustering – via finite EDCM mixture models. Appl. Intell. 49(4), 1467–1488 (2019)CrossRefGoogle Scholar
  83. 83.
    Zamzami, N., Bouguila, N.: Sparse count data clustering using an exponential approximation to generalized Dirichlet multinomial distributions. Manuscript submitted to IEEE Transactions on Neural Networks and Learning Systems for review (2019)Google Scholar
  84. 84.
    Zhou, H., Lange, K.: MM algorithms for some discrete multivariate distributions. J. Comput. Graph. Stat. 19(3), 645–665 (2010)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Concordia Institute for Information Systems EngineeringConcordia UniversityMontrealCanada
  2. 2.Faculty of Computing and Information TechnologyKing Abdulaziz UniversityJeddahSaudi Arabia

Personalised recommendations