Encyclopedia of Database Systems

2018 Edition
| Editors: Ling Liu, M. Tamer Özsu

Text Segmentation

  • Haoda HuangEmail author
  • Benyu Zhang
Reference work entry
DOI: https://doi.org/10.1007/978-1-4614-8265-9_421


Document segmentation


Text segmentation is a precursor to text retrieval, automatic summarization, information retrieval (IR); language modeling (LM) and natural language processing (NLP). In written texts, text segmentation is the process of identifying the boundaries between words, phrases, or some other linguistic meaningful units, such as sentences or topics. The term separated from such processing is useful to help humans reading texts, and are mainly used to assist computers to do some artificial processes as fundamental units, such as NLP, and IR.

Historical Background

Natural language processing (NLP) is an important research field. Its primary problem is how to segment text correctly. Various segmentation methods have emerged in the past decades for different kinds of language and applications. Text segmentation is language dependent (different language has its own special problems, which would be introduced later), corpus dependent, character-set...

This is a preview of subscription content, log in to check access.

Recommended Reading

  1. 1.
    Beeferman D, Berger A, Lafferty J. Statistical models for text segmentation. Mach Learn. 1999;34(1–3):177–210.zbMATHCrossRefGoogle Scholar
  2. 2.
    Grefenstette G, Tapanainen P. What is a word, what is a sentence? Problems of tokenization. In: Proceedings of the 3rd Conference on Computational Lexicography and Text Research; 1994. p. 7–10.Google Scholar
  3. 3.
    Mikheev A. Tagging sentence boundaries. In: Proceedings of the 1st Conference on North American Chapter of the Association for Computational Linguistics; 2000. p. 264–71.Google Scholar
  4. 4.
    Reynar JC, Marcus MP.Topic segmentation: algorithms and applications. Philadelphia: University of Pennsylvania, Ph.D. Thesis. 1998.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Microsoft Research AsiaBeijingChina

Section editors and affiliations

  • Zheng Chen
    • 1
  1. 1.Microsoft Research AsiaMicrosoft CorporationBeijingChina