Text segmentation is a precursor to text retrieval, automatic summarization, information retrieval (IR); language modeling (LM) and natural language processing (NLP). In written texts, text segmentation is the process of identifying the boundaries between words, phrases, or some other linguistic meaningful units, such as sentences or topics. The term separated from such processing is useful to help humans reading texts, and are mainly used to assist computers to do some artificial processes as fundamental units, such as NLP, and IR.
Natural language processing (NLP) is an important research field. Its primary problem is how to segment text correctly. Various segmentation methods have emerged in the past decades for different kinds of language and applications. Text segmentation is language dependent (different language has its own special problems, which would be introduced later), corpus dependent, character-set...
- 2.Grefenstette G, Tapanainen P. What is a word, what is a sentence? Problems of tokenization. In: Proceedings of the 3rd Conference on Computational Lexicography and Text Research; 1994. p. 7–10.Google Scholar
- 3.Mikheev A. Tagging sentence boundaries. In: Proceedings of the 1st Conference on North American Chapter of the Association for Computational Linguistics; 2000. p. 264–71.Google Scholar
- 4.Reynar JC, Marcus MP.Topic segmentation: algorithms and applications. Philadelphia: University of Pennsylvania, Ph.D. Thesis. 1998.Google Scholar