Topic-Based Hard Clustering of Documents Using Generative Models

Ponti, Giovanni; Tagarelli, Andrea

doi:10.1007/978-3-642-04125-9_26

Giovanni Ponti²³ &
Andrea Tagarelli²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5722))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

1258 Accesses
5 Citations

Abstract

In this paper, we describe a framework for clustering documents according to their mixtures of topics. The proposed framework combines the expressiveness of generative models for document representation with a properly chosen information-theoretic distance measure to group the documents via an agglomerative hierarchical clustering scheme. The clustering solution obtained at each level of the dendrogram reflects an organization of the documents into sets of topics, while being produced without the effort needed for a soft/fuzzy clustering method. Experimental results obtained on large, real-world collections of documents evidence the effectiveness of our approach in detecting non-overlapping clusters that contain documents sharing similar mixtures of topics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ali, S.M., Silvey, S.D.: A General Class of Coefficients of Divergence of One Distribution from Another. J. Royal Statistical Soc. 28(1), 131–142 (1966)
MathSciNet MATH Google Scholar
Bhattacharyya, A.: On a Measure of Divergence Between Two Statistical Populations Defined by their Probability Distributions. Bull. Calcutta Mathematical Soc. 35, 99–110 (1943)
MathSciNet MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Chernoff, H.: A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations. Annals of Mathematical Statistics 23(4), 493–507 (1952)
Article MathSciNet MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley-Interscience, Hoboken (2006)
MATH Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. American Soc. for Information Science 41, 391–407 (1990)
Article Google Scholar
Bellegarda, J.R.: Exploiting both local and global constraints for multi-spanstatistical language modeling. Acoustics, Speech and Signal Processing 2, 677–680 (1998)
Google Scholar
Kailath, T.: The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. on Comm. Tech. 15(1), 52–60 (1967)
Article MathSciNet Google Scholar
Kim, Y.-M., Pessiot, J.-F., Amini, M.-R., Gallinari, P.: An extension of PLSA for document clustering. In: Proc. of ACM CIKM, pp. 1345–1346 (2008)
Google Scholar
Kullback, S.: Information Theory and Statistics. Wiley, Chichester (1959)
MATH Google Scholar
Kullback, S., Leibler, R.A.: On Information and Sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
Kummamuru, K., Dhawale, A., Krishnapuram, R.: Fuzzy co-clustering of documents and keywords. In: Proc. of IEEE Int. Conf. on Fuzzy Systems, vol. 2, pp. 772–777 (2003)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proc. of ACM KDD, pp. 16–22 (1999)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Dietterich, G., Li, F.: RCV1: A new Benchmark Collection for Text Categorization Research. J. Machine Learning Research 5, 361–397 (2004)
Google Scholar
Murtagh, F.: A Survey of Recent Advances in Hierarchical Clustering Algorithm. The Computer Journal 26(4), 354–359 (1983)
Article MATH Google Scholar
Sato, I., Nakagawa, H.: Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior. In: Proc. of ACM KDD, pp. 590–598. ACM, New York (2007)
Google Scholar
Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1-2), 177–196 (2001)
Article MATH Google Scholar
Landauer, T.K., Dumais, S.T.: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review 104(2), 211–240 (1997)
Article Google Scholar
Ueda, N., Saito, K.: Parametric Mixture Models for Multi-Labeled Text. In: Proc. of Neural Information Processing Systems, pp. 721–728 (2002)
Google Scholar
Wolfe, M.B.W., Schreiner, M.E., Rehder, B., Laham, D., Foltz, P.W., Kintsch, W., Landauer, T.K.: Learning from text: Matching readers and texts by latent semantic analysis. Discourse Processes 25(2/3), 309–336 (1998)
Article Google Scholar
Zhao, Y., Karypis, G.: Soft clustering criterion functions for partitional document clustering: a summary of results. In: Proc. of ACM CIKM, pp. 246–247 (2004)
Google Scholar
Zhong, S., Ghosh, J.: A unified framework for model-based clustering. J. Machine Learning Research 4, 1001–1037 (2003)
MathSciNet MATH Google Scholar
Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowl. Inf. Syst. 8(3), 374–384 (2005)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Electronics, Computer and Systems Sciences, University of Calabria, Italy
Giovanni Ponti & Andrea Tagarelli

Authors

Giovanni Ponti
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Tagarelli
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics and Statistics, University of Economics, W. Churchill Sq. 4, 130 67, Prague 3, Czech Republic
Jan Rauch
Department of Computer Science, University of North Carolina, NC 27599-3175, Charlotte, USA
Zbigniew W. Raś
Faculty of Informatics and Statics, University of Economics, W. Churchill Sq. 4, 130 67, Prague, Czech Republic
Petr Berka
Institute of Software Systems, Tampere University of Technology, P. O. Box 553, 33101, Tampere, Finland
Tapio Elomaa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ponti, G., Tagarelli, A. (2009). Topic-Based Hard Clustering of Documents Using Generative Models. In: Rauch, J., Raś, Z.W., Berka, P., Elomaa, T. (eds) Foundations of Intelligent Systems. ISMIS 2009. Lecture Notes in Computer Science(), vol 5722. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04125-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-04125-9_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04124-2
Online ISBN: 978-3-642-04125-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics