Skip to main content

Topic-Based Hard Clustering of Documents Using Generative Models

  • Conference paper
Foundations of Intelligent Systems (ISMIS 2009)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5722))

Included in the following conference series:

Abstract

In this paper, we describe a framework for clustering documents according to their mixtures of topics. The proposed framework combines the expressiveness of generative models for document representation with a properly chosen information-theoretic distance measure to group the documents via an agglomerative hierarchical clustering scheme. The clustering solution obtained at each level of the dendrogram reflects an organization of the documents into sets of topics, while being produced without the effort needed for a soft/fuzzy clustering method. Experimental results obtained on large, real-world collections of documents evidence the effectiveness of our approach in detecting non-overlapping clusters that contain documents sharing similar mixtures of topics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ali, S.M., Silvey, S.D.: A General Class of Coefficients of Divergence of One Distribution from Another. J. Royal Statistical Soc. 28(1), 131–142 (1966)

    MathSciNet  MATH  Google Scholar 

  2. Bhattacharyya, A.: On a Measure of Divergence Between Two Statistical Populations Defined by their Probability Distributions. Bull. Calcutta Mathematical Soc. 35, 99–110 (1943)

    MathSciNet  MATH  Google Scholar 

  3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  4. Chernoff, H.: A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations. Annals of Mathematical Statistics 23(4), 493–507 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  5. Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley-Interscience, Hoboken (2006)

    MATH  Google Scholar 

  6. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. American Soc. for Information Science 41, 391–407 (1990)

    Article  Google Scholar 

  7. Bellegarda, J.R.: Exploiting both local and global constraints for multi-spanstatistical language modeling. Acoustics, Speech and Signal Processing 2, 677–680 (1998)

    Google Scholar 

  8. Kailath, T.: The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. on Comm. Tech. 15(1), 52–60 (1967)

    Article  MathSciNet  Google Scholar 

  9. Kim, Y.-M., Pessiot, J.-F., Amini, M.-R., Gallinari, P.: An extension of PLSA for document clustering. In: Proc. of ACM CIKM, pp. 1345–1346 (2008)

    Google Scholar 

  10. Kullback, S.: Information Theory and Statistics. Wiley, Chichester (1959)

    MATH  Google Scholar 

  11. Kullback, S., Leibler, R.A.: On Information and Sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  12. Kummamuru, K., Dhawale, A., Krishnapuram, R.: Fuzzy co-clustering of documents and keywords. In: Proc. of IEEE Int. Conf. on Fuzzy Systems, vol. 2, pp. 772–777 (2003)

    Google Scholar 

  13. Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proc. of ACM KDD, pp. 16–22 (1999)

    Google Scholar 

  14. Lewis, D.D., Yang, Y., Rose, T.G., Dietterich, G., Li, F.: RCV1: A new Benchmark Collection for Text Categorization Research. J. Machine Learning Research 5, 361–397 (2004)

    Google Scholar 

  15. Murtagh, F.: A Survey of Recent Advances in Hierarchical Clustering Algorithm. The Computer Journal 26(4), 354–359 (1983)

    Article  MATH  Google Scholar 

  16. Sato, I., Nakagawa, H.: Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior. In: Proc. of ACM KDD, pp. 590–598. ACM, New York (2007)

    Google Scholar 

  17. Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1-2), 177–196 (2001)

    Article  MATH  Google Scholar 

  18. Landauer, T.K., Dumais, S.T.: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review 104(2), 211–240 (1997)

    Article  Google Scholar 

  19. Ueda, N., Saito, K.: Parametric Mixture Models for Multi-Labeled Text. In: Proc. of Neural Information Processing Systems, pp. 721–728 (2002)

    Google Scholar 

  20. Wolfe, M.B.W., Schreiner, M.E., Rehder, B., Laham, D., Foltz, P.W., Kintsch, W., Landauer, T.K.: Learning from text: Matching readers and texts by latent semantic analysis. Discourse Processes 25(2/3), 309–336 (1998)

    Article  Google Scholar 

  21. Zhao, Y., Karypis, G.: Soft clustering criterion functions for partitional document clustering: a summary of results. In: Proc. of ACM CIKM, pp. 246–247 (2004)

    Google Scholar 

  22. Zhong, S., Ghosh, J.: A unified framework for model-based clustering. J. Machine Learning Research 4, 1001–1037 (2003)

    MathSciNet  MATH  Google Scholar 

  23. Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowl. Inf. Syst. 8(3), 374–384 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ponti, G., Tagarelli, A. (2009). Topic-Based Hard Clustering of Documents Using Generative Models. In: Rauch, J., RaÅ›, Z.W., Berka, P., Elomaa, T. (eds) Foundations of Intelligent Systems. ISMIS 2009. Lecture Notes in Computer Science(), vol 5722. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04125-9_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04125-9_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04124-2

  • Online ISBN: 978-3-642-04125-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics