Abstract
This paper presents a statistical model for discovering topical clusters of words in unstructured text. The model uses a hierarchical Bayesian structure and it is also able to identify segments of text which are topically coherent. The model is able to assign each segment to a particular topic and thus categorizes the corresponding document to potentially multiple topics. We present some initial results indicating that the word topics discovered by the proposed model are more consistent compared to other models. Our early experiments show that our model clustering performance compares well with other clustering models on a real text corpus, which do not provide topic segmentation. Segmentation performance of our model is also comparable to a recently proposed segmentation model which does not provide document clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Banerjee, A., Krumpelman, C., Basu, S., Mooney, R., Ghosh, J.: Model based overlapping clustering. In: International Conference on Knowledge Discovery and Data Mining (KDD) (August 2005)
Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Machine Learning 34(1-3), 177–210 (1999)
Blei, D., Lafferty, J.: Correlated topic models. In: Advances in Neural Information Processing Systems, vol. 18, pp. 147–154 (2006)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Buntine, W.L.: Operations for learning with graphical models. Journal of Artificial Intelligence Research (JAIR) 2, 159–225 (1994)
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Hearst, M.A.: Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)
Li, W., Mccallum, A.: Pachinko allocation: Dag-structured mixture models of topic correlations. In: 23rd International Conference on Machine Learning, Pittsburgh, USA (June 2006)
Malioutov, I., Barzilay, R.: Minimum cut model for spoken lecture segmentation. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, July 2006, pp. 25–32 (2006)
Malioutov, I., Park, A., Barzilay, R., Glass, J.: Making sense of sound: Unsupervised topic segmentation over acoustic input. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, June 2007, pp. 504–511 (2007)
Minka, T.P.: Estimating a Dirichlet distribution. Technical report, MIT (2000)
Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist. 28(1), 19–36 (2002)
Purver, M., Kording, K., Griffiths, T., Tenenbaum, J.: Unsupervised topic modelling for multi-party spoken discourse. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, July 2006, pp. 17–24 (2006)
Shafiei, M.M., Milios, E.E.: Latent dirichlet co-clustering. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 542–551. Springer, Heidelberg (2006)
Shafiei, M., Milios, E.: Model-based overlapping co-clustering. In: Proceedings of the Fourth Workshop on Text Mining, Sixth SIAM International Conference on Data Mining, Bethesda, Maryland, April 22 (2006)
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., Mcnamara, D., Dennis, S., Kintsch, W. (eds.) Latent Semantic Analysis: A Road to Meaning, Lawrence Erlbaum, Mahwah (2005)
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: ICML 2006: Proceedings of the 23rd international conference on Machine learning, pp. 977–984 (2006)
Wang, X., McCallum, A.: A note on topical n-grams. Technical Report UM-CS-2005-071, University of Massachusetts Amherst (December 2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shafiei, M.M., Milios, E.E. (2008). A Statistical Model for Topic Segmentation and Clustering. In: Bergler, S. (eds) Advances in Artificial Intelligence. Canadian AI 2008. Lecture Notes in Computer Science(), vol 5032. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68825-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-68825-9_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68821-1
Online ISBN: 978-3-540-68825-9
eBook Packages: Computer ScienceComputer Science (R0)