A Statistical Model for Topic Segmentation and Clustering

Shafiei, M. Mahdi; Milios, Evangelos E.

doi:10.1007/978-3-540-68825-9_27

M. Mahdi Shafiei¹ &
Evangelos E. Milios¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5032))

Included in the following conference series:

Conference of the Canadian Society for Computational Studies of Intelligence

1502 Accesses
4 Citations

Abstract

This paper presents a statistical model for discovering topical clusters of words in unstructured text. The model uses a hierarchical Bayesian structure and it is also able to identify segments of text which are topically coherent. The model is able to assign each segment to a particular topic and thus categorizes the corresponding document to potentially multiple topics. We present some initial results indicating that the word topics discovered by the proposed model are more consistent compared to other models. Our early experiments show that our model clustering performance compares well with other clustering models on a real text corpus, which do not provide topic segmentation. Segmentation performance of our model is also comparable to a recently proposed segmentation model which does not provide document clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Banerjee, A., Krumpelman, C., Basu, S., Mooney, R., Ghosh, J.: Model based overlapping clustering. In: International Conference on Knowledge Discovery and Data Mining (KDD) (August 2005)
Google Scholar
Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Machine Learning 34(1-3), 177–210 (1999)
Article MATH Google Scholar
Blei, D., Lafferty, J.: Correlated topic models. In: Advances in Neural Information Processing Systems, vol. 18, pp. 147–154 (2006)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Article MATH Google Scholar
Buntine, W.L.: Operations for learning with graphical models. Journal of Artificial Intelligence Research (JAIR) 2, 159–225 (1994)
Google Scholar
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Google Scholar
Hearst, M.A.: Texttiling: segmenting text into multi-paragraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)
Google Scholar
Li, W., Mccallum, A.: Pachinko allocation: Dag-structured mixture models of topic correlations. In: 23rd International Conference on Machine Learning, Pittsburgh, USA (June 2006)
Google Scholar
Malioutov, I., Barzilay, R.: Minimum cut model for spoken lecture segmentation. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the ACL, July 2006, pp. 25–32 (2006)
Google Scholar
Malioutov, I., Park, A., Barzilay, R., Glass, J.: Making sense of sound: Unsupervised topic segmentation over acoustic input. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, June 2007, pp. 504–511 (2007)
Google Scholar
Minka, T.P.: Estimating a Dirichlet distribution. Technical report, MIT (2000)
Google Scholar
Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist. 28(1), 19–36 (2002)
Article Google Scholar
Purver, M., Kording, K., Griffiths, T., Tenenbaum, J.: Unsupervised topic modelling for multi-party spoken discourse. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, July 2006, pp. 17–24 (2006)
Google Scholar
Shafiei, M.M., Milios, E.E.: Latent dirichlet co-clustering. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 542–551. Springer, Heidelberg (2006)
Google Scholar
Shafiei, M., Milios, E.: Model-based overlapping co-clustering. In: Proceedings of the Fourth Workshop on Text Mining, Sixth SIAM International Conference on Data Mining, Bethesda, Maryland, April 22 (2006)
Google Scholar
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., Mcnamara, D., Dennis, S., Kintsch, W. (eds.) Latent Semantic Analysis: A Road to Meaning, Lawrence Erlbaum, Mahwah (2005)
Google Scholar
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: ICML 2006: Proceedings of the 23rd international conference on Machine learning, pp. 977–984 (2006)
Google Scholar
Wang, X., McCallum, A.: A note on topical n-grams. Technical Report UM-CS-2005-071, University of Massachusetts Amherst (December 2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Dalhousie University,
M. Mahdi Shafiei & Evangelos E. Milios

Authors

M. Mahdi Shafiei
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos E. Milios
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Sabine Bergler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shafiei, M.M., Milios, E.E. (2008). A Statistical Model for Topic Segmentation and Clustering. In: Bergler, S. (eds) Advances in Artificial Intelligence. Canadian AI 2008. Lecture Notes in Computer Science(), vol 5032. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68825-9_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-68825-9_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68821-1
Online ISBN: 978-3-540-68825-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics