Abstract
The problem of automatic text segmentation is subcategorized into two different problems: thematic segmentation into rather large topically self-contained sections and splitting into paragraphs, i.e., lexico-grammatical segmentation of lower level. In this paper we consider the latter problem. We propose a method of reasonably splitting text into paragraph based on a text cohesion measure. Specifically, we propose a method of quantitative evaluation of text cohesion based on a large linguistic resource - a collocation network. At each step, our algorithm compares word occurrences in a text against a large DB of collocations and semantic links between words in the given natural language. The procedure consists in evaluation of the cohesion function, its smoothing, normalization, and comparing with a specially constructed threshold.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Bolshakov, I. A. Multifunctional thesaurus for computerized preparation of Russian texts. Automatic Documentation and Mathematical Linguistics. Allerton Press Inc. Vol. 28, No. 1, 1994, p. 13–28.
Bolshakov, I. A. Multifunction thesaurus for Russian word processing. Proc. of 4th Conf. on Applied Natural Language Processing, Stuttgart, 13–15 October, 1994, p. 200–202.
Fellbaum, Ch. (ed.) WordNet as Electronic Lexical Database. MIT Press, 1998.
Ferret, O. How to Thematically Segment Texts by Using Lexical Cohesion? Proc. of Co-ling-ACL-98, v. 2, 1998, p. 1481–1483.
Ferret, O., B. Grau, N. Masson. Thematic segmentation of texts: two methods for two kinds of texts. Proc. of Coling-ACL-1998, v. 1, p. 392–396.
Jobbins, A. C., L. J. Evett. Text segmentation using reiteration and collocation. In: Proc. of Coling-ACL-1998, v. 1, p. 614–618.
Hearst, A. M. Multi-paragraph segmentation of expository text. Proc. ACL-94. Las Cruces, N. M., USA, 1994, p. 9–16.
Hearst, A. M., C. Plaunt. Subtopic Structuring for Full-Length Document Access. Proc. ACM-SIGIR’93, 1993, p. 59–68.
Heinonen, O. Optimal multiparagraph text segmentation by Dynamic Programming. Proc. of Coling-ACL-98, v. 2, 1998, p. 1484–1486.
Litman, D., R.J. Passonneau. Combining Multiple Knowledge Sources For Discourse Segmentation. Proc. 31th Annual Meeting ACL Conference, 1993, Columbus, p. 108–115.
Kaufmann, S. Second Order Cohesion. Proc. PACLING’99 Conf., 1999, p. 209–222.
Kozima, H. Text segmentation based on similarity between words. Proc. of ACL-93, Columbus, Ohio, USA, 1993, p. 286–288.
Kurohashi, S., M. Nagao. Automatic Detection of Discourse Structure By Checking Sur-face Information in Sentences. Proc. Coling-94, Kyoto, 1994, p. 1123–1127.
Mel’cuk. I. Dependency Syntax: Theory and Practice. SUNY Press, NY. 1988.
Nomoto, T., Y. Nitta. A Grammatico-Statistical Approach to Discourse Partitioning. Proc. Coling-94, Kyoto, 1994, p. 1145–1150.
Oppenheim, A.V., R.V. Shafer. Discrete-Time Signal Processing. Prentice Hall. NJ, 1989.
Salton, G., A. Singhal, M. Mitra, C. Buckley. Automatic Text Structuring and Summarization. Information Processing & Management. V. 33(2), 1997, p. 193–207.
Smadja, F. Retreiving Collocations from text: Xtract. Computational Linguistics. Vol. 19, No. 1, 1993, p. 143–177.
Suzuki, Y. et al. Segmentation and Event Detection of New Stories Using Term Weighting. Proc. PACLING’ 99 Conf., 1999, p. 149–154.
Vossen, Piek (ed.). EuroWordNet General Document. Vers. 3 final. 2000, http://www.hum.uva.nl/ewn.
Zadrozny, W., K. Jensen. Semantic of Paragraphs. Computational Linguistics. V. 17(2), 1991, p. 171–209.
Zobel, J. Writing for computer science. Springer. 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bolshakov, I.A., Gelbukh, A. (2001). Text Segmentation into Paragraphs Based on Local Text Cohesion. In: Matoušek, V., Mautner, P., Mouček, R., Taušer, K. (eds) Text, Speech and Dialogue. TSD 2001. Lecture Notes in Computer Science(), vol 2166. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44805-5_20
Download citation
DOI: https://doi.org/10.1007/3-540-44805-5_20
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42557-1
Online ISBN: 978-3-540-44805-1
eBook Packages: Springer Book Archive