Abstract
An automatic linear text segmentation in order to detect the best topic boundaries is a difficult and very useful task in many text processing systems. Some methods have tried to solve this problem with reasonable results, but they present some drawbacks as well. In this work, we propose a new method, called ClustSeg, based on a predefined window and a clustering algorithm to decide the topic cohesion. We compare our proposal against the best known methods, with a better performance against these algorithms.
Chapter PDF
Similar content being viewed by others
References
Aslam, J., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organization. Journal of Graph Algorithms and Applications 8(1), 95–129 (2004)
Beeferman, D., Berger, A., Lafferty, J.: Statistical Models for Text Segmentation. In: Second Conference on Empirical Methods in Natural Language Processing, pp. 35–46 (1997)
Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: NAACL 2000, pp. 26–33 (2000)
Choi, F.Y.Y., Wiemer-Hastings, P., Moore, J.: Latent semantic analysis for text segmentation. In: EMNLP, pp. 109–117 (2001)
Filippova, K., Strube, M.: Using Linguistically Motivated Features for Paragraph Boundary Identification. In: The 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pp. 267–274 (2006)
Ken, R., Granitzer, M.: Efficient Linear Text Segmentation Based on Information Retrieval Techniques. In: MEDES 2009, Lyon, France (2009)
Hearst, M.: TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics 23(1), 33–64 (1997)
Heinonen, O.: Optimal Multi-Paragraph Text Segmentation by Dynamic Programming. In: COLING-ACL 1998, Montreal, Canada, pp. 1484–1486 (1998)
Hernández, L., Medina, J.: TextLec: A Novel Method of Segmentation by Topic Using Lower Windows and Lexical Cohesion. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 724–733. Springer, Heidelberg (2007)
Labadié, A., Prince, V.: Finding text boundaries and finding topic boundaries: two different tasks. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 260–271. Springer, Heidelberg (2008)
Misra, H., et al.: Text Segmentation via Topic Modeling: An Analytical Study, Hong Kong, China (2009)
Pérez-Suáerez, A., Martínez, J.F., Carrasco-Ochoa, J.A.: A New Incremental Algorithm for Overlapped Clustering. In: Bayro-Corrochano, E., Eklundh, J.-O. (eds.) CIARP 2009. LNCS, vol. 5856, pp. 497–504. Springer, Heidelberg (2009)
Pevzner, L., Hearst, M.: A Critique and Improvement of an Evaluation Metric for Text Segmentation. Computational Linguistics 28(1), 19–36 (2002)
Pons-Porrata, A., Ruiz-Shulcloper, J., Berlanga-Llavori, R., Santiesteban-Alganza, Y.: Un algoritmo incremental para la obtención de cubrimientos con datos mezclados. In: CIARP 2002, pp. 405–416 (2002)
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing (1994)
Schmid, H.: Improvements in part-of-speech tagging with an application to german. In: ACL SIGDAT-Workshop (1995)
Shi, Q., et al.: Semi-Markov Models for Sequence Segmentation. In: Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 640–648 (2007)
Stokes, N., Carthy, J., Smeaton, A.: SeLeCT: A Lexical Cohesion Based News Story Segmentation System. AI Communications 17(1), 3–12 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pérez, R.A., Medina Pagola, J.E. (2010). Text Segmentation by Clustering Cohesion. In: Bloch, I., Cesar, R.M. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2010. Lecture Notes in Computer Science, vol 6419. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16687-7_37
Download citation
DOI: https://doi.org/10.1007/978-3-642-16687-7_37
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16686-0
Online ISBN: 978-3-642-16687-7
eBook Packages: Computer ScienceComputer Science (R0)