Text Segmentation by Clustering Cohesion

  • Raúl Abella Pérez
  • José Eladio Medina Pagola
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6419)


An automatic linear text segmentation in order to detect the best topic boundaries is a difficult and very useful task in many text processing systems. Some methods have tried to solve this problem with reasonable results, but they present some drawbacks as well. In this work, we propose a new method, called ClustSeg, based on a predefined window and a clustering algorithm to decide the topic cohesion. We compare our proposal against the best known methods, with a better performance against these algorithms.


  1. 1.
    Aslam, J., Pelekhov, E., Rus, D.: The star clustering algorithm for static and dynamic information organization. Journal of Graph Algorithms and Applications 8(1), 95–129 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Beeferman, D., Berger, A., Lafferty, J.: Statistical Models for Text Segmentation. In: Second Conference on Empirical Methods in Natural Language Processing, pp. 35–46 (1997)Google Scholar
  3. 3.
    Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: NAACL 2000, pp. 26–33 (2000)Google Scholar
  4. 4.
    Choi, F.Y.Y., Wiemer-Hastings, P., Moore, J.: Latent semantic analysis for text segmentation. In: EMNLP, pp. 109–117 (2001)Google Scholar
  5. 5.
    Filippova, K., Strube, M.: Using Linguistically Motivated Features for Paragraph Boundary Identification. In: The 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pp. 267–274 (2006)Google Scholar
  6. 6.
    Ken, R., Granitzer, M.: Efficient Linear Text Segmentation Based on Information Retrieval Techniques. In: MEDES 2009, Lyon, France (2009)Google Scholar
  7. 7.
    Hearst, M.: TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages. Computational Linguistics 23(1), 33–64 (1997)Google Scholar
  8. 8.
    Heinonen, O.: Optimal Multi-Paragraph Text Segmentation by Dynamic Programming. In: COLING-ACL 1998, Montreal, Canada, pp. 1484–1486 (1998)Google Scholar
  9. 9.
    Hernández, L., Medina, J.: TextLec: A Novel Method of Segmentation by Topic Using Lower Windows and Lexical Cohesion. In: Rueda, L., Mery, D., Kittler, J. (eds.) CIARP 2007. LNCS, vol. 4756, pp. 724–733. Springer, Heidelberg (2007)Google Scholar
  10. 10.
    Labadié, A., Prince, V.: Finding text boundaries and finding topic boundaries: two different tasks. In: Nordström, B., Ranta, A. (eds.) GoTAL 2008. LNCS (LNAI), vol. 5221, pp. 260–271. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  11. 11.
    Misra, H., et al.: Text Segmentation via Topic Modeling: An Analytical Study, Hong Kong, China (2009)Google Scholar
  12. 12.
    Pérez-Suáerez, A., Martínez, J.F., Carrasco-Ochoa, J.A.: A New Incremental Algorithm for Overlapped Clustering. In: Bayro-Corrochano, E., Eklundh, J.-O. (eds.) CIARP 2009. LNCS, vol. 5856, pp. 497–504. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  13. 13.
    Pevzner, L., Hearst, M.: A Critique and Improvement of an Evaluation Metric for Text Segmentation. Computational Linguistics 28(1), 19–36 (2002)CrossRefGoogle Scholar
  14. 14.
    Pons-Porrata, A., Ruiz-Shulcloper, J., Berlanga-Llavori, R., Santiesteban-Alganza, Y.: Un algoritmo incremental para la obtención de cubrimientos con datos mezclados. In: CIARP 2002, pp. 405–416 (2002)Google Scholar
  15. 15.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: International Conference on New Methods in Language Processing (1994)Google Scholar
  16. 16.
    Schmid, H.: Improvements in part-of-speech tagging with an application to german. In: ACL SIGDAT-Workshop (1995)Google Scholar
  17. 17.
    Shi, Q., et al.: Semi-Markov Models for Sequence Segmentation. In: Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 640–648 (2007)Google Scholar
  18. 18.
    Stokes, N., Carthy, J., Smeaton, A.: SeLeCT: A Lexical Cohesion Based News Story Segmentation System. AI Communications 17(1), 3–12 (2004)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Raúl Abella Pérez
    • 1
  • José Eladio Medina Pagola
    • 1
  1. 1.Advanced Technologies Application Centre (CENATAV)PlayaCuba

Personalised recommendations