Text Modeling Using Multinomial Scaled Dirichlet Distributions

Zamzami, Nuha; Bouguila, Nizar

doi:10.1007/978-3-319-92058-0_7

Nuha Zamzami^17,18 &
Nizar Bouguila¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10868))

Included in the following conference series:

International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems

3185 Accesses
10 Citations

Abstract

The Dirichlet Compound Multinomial (DCM), the composition of the Dirichlet and the multinomial, is a widely accepted generative model for text documents that takes into account burstiness. However, recent research showed that the Dirichlet is not the best to be chosen as a prior to multinomial. In this paper, we propose a novel model called the Multinomial Scaled Dirichlet (MSD) distribution that is the composition of the scaled Dirichlet distribution and the multinomial. Moreover, we investigate the Expectation Maximization (EM) with the MSD mixture model as a new clustering algorithm for documents. Experiments show that the new model is competitive with the best state-of-the-art methods on different text data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Cerchiello, P., Giudici, P.: Dirichlet compound multinomials statistical models. Appl. Math. 3(12), 2089–2097 (2012)
Article Google Scholar
Aggarwal, C.C., Zhai, C.: An introduction to text mining. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 1–10. Springer, Boston (2012). https://doi.org/10.1007/978-1-4614-3223-4_1
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)
Google Scholar
Church, K.W., Gale, W.A.: Poisson mixtures. Nat. Lang. Eng. 1(2), 163–190 (1995)
Article MathSciNet Google Scholar
Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of Naive Bayes text classifiers. In: Proceedings of the Twentieth International Conference on Machine Learning ICML, vol. 3, pp. 616–623 (2003)
Google Scholar
Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 545–552. ACM (2005)
Google Scholar
Margaritis, D., Thrun, S.: A Bayesian multiresolution independence test for continuous variables. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, pp. 346–353. Morgan Kaufmann Publishers Inc. (2001)
Google Scholar
Mosimann, J.E.: On the compound multinomial distribution, the multivariate \(\beta \)-distribution, and correlations among proportions. Biometrika 49(1/2), 65–82 (1962)
Article MathSciNet Google Scholar
Migliorati, S., Monti, G.S., Ongaro, A.: E-M algorithm: an application to a mixture model for compositional data. In: Proceedings of the 44th Scientific Meeting of the Italian Statistical Society (2008)
Google Scholar
Lochner, R.H.: A generalized Dirichlet distribution in Bayesian life testing. J. Royal Stat. Soc. Ser. B (Methodological) 37, 103–113 (1975)
MathSciNet MATH Google Scholar
Bouguila, N.: Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 20(4), 462–474 (2008)
Article Google Scholar
Bouguila, N.: Count data modeling and classification using finite mixtures of distributions. IEEE Trans. Neural Netw. 22(2), 186–198 (2011)
Article Google Scholar
Teevan, J., Karger, D.R.: Empirical development of an exponential probabilistic model for text retrieval: using textual analysis to build a better model. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 18–25. ACM (2003)
Google Scholar
Jansche, M.: Parametric models of linguistic count data. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 288–295. Association for Computational Linguistics (2003)
Google Scholar
Katz, S.M.: Distribution of content words and phrases in text and language modelling. Nat. Lang. Eng. 2(1), 15–59 (1996)
Article Google Scholar
Monti, G.S., Mateu-Figueras, G., Pawlowsky-Glahn, V.: Notes on the scaled Dirichlet distribution. In: Compositional Data Analysis: Theory and Applications. Wiley, Chichester (2011)
Google Scholar
Hankin, R.K., et al.: A generalization of the Dirichlet distribution. J. Stat. Softw. 33(11), 1–18 (2010)
Article Google Scholar
Oboh, B.S., Bouguila, N.: Unsupervised learning of finite mixtures using scaled Dirichlet distribution and its application to software modules categorization. In: Proceedings of the 2017 IEEE International Conference on Industrial Technology (ICIT), pp. 1085–1090. IEEE (2017)
Google Scholar
Bouguila, N., Ziou, D.: Unsupervised learning of a finite discrete mixture: applications to texture modeling and image databases summarization. J. Vis. Commun. Image Representation 18(4), 295–309 (2007)
Article Google Scholar
Elkan, C.: Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 289–296. ACM (2006)
Google Scholar
McCallum, A.K.: Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering (1996). http://www.cs.cmu.edu/mccallum/bow
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res. 6, 1345–1382 (2005)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada
Nuha Zamzami & Nizar Bouguila
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
Nuha Zamzami

Authors

Nuha Zamzami
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nuha Zamzami .

Editor information

Editors and Affiliations

University of Regina, Regina, SK, Canada
Malek Mouhoub
University of Regina, Regina, SK, Canada
Samira Sadaoui
Concordia University, Montreal, QC, Canada
Otmane Ait Mohamed
Texas State University, San Marcos, TX, USA
Moonis Ali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zamzami, N., Bouguila, N. (2018). Text Modeling Using Multinomial Scaled Dirichlet Distributions. In: Mouhoub, M., Sadaoui, S., Ait Mohamed, O., Ali, M. (eds) Recent Trends and Future Technology in Applied Intelligence. IEA/AIE 2018. Lecture Notes in Computer Science(), vol 10868. Springer, Cham. https://doi.org/10.1007/978-3-319-92058-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-92058-0_7
Published: 30 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92057-3
Online ISBN: 978-3-319-92058-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics