Text Segmentation Techniques: A Critical Review

  • Irina Pak
  • Phoey Lee Teh
Part of the Studies in Computational Intelligence book series (SCI, volume 741)


Text segmentation is a method of splitting a document into smaller parts, which is usually called segments. It is widely used in text processing. Each segment has its relevant meaning. Those segments categorized as word, sentence, topic, phrase or any information unit depending on the task of the text analysis. This study presents various reasons of usage of text segmentation for different analyzing approaches. We categorized the types of documents and languages used. The main contribution of this study includes a summarization of 50 research papers and an illustration of past decade (January 2007−January 2017)’s of research that applied text segmentation as their main approach for analysing text. Results revealed the popularity of using text segmentation in analysing different languages. Besides that, the word segment seems to be the most practical and usable segment, as it is the smaller unit than the phrase, sentence or line.



We would like to thank First EAI International Conference on Computer Science and Engineering for the opportunity to present our paper and further extend it. This research paper was partially supported by Sunway University Internal Research Grant No. INT-FST-IS-0114-07 and Sunway-Lancaster Grant SGSSL-FST-DCIS-0115-11.


  1. 1.
    Visweswariah, P.D, Wiratunga, K., Sani N.S. (2012). Two-part segmentation of text documents. In: Proceedings 21st ACM International Conference on Information Knowledge Management—CIKM’12 (p 793). ACM, New York: Maui.Google Scholar
  2. 2.
    Scaiano, M., Inkpen, D., Laganière, R., & Reinhartz, A. (2010). Automatic text segmentation for movie subtitles. In: Lecturer Notes Computer Science (pp. 295−298). Springer.Google Scholar
  3. 3.
    Oh, H., Myaeng, S. H., & Jang, M.-G. (2007). Semantic passage segmentation based on sentence topics for question answering. Information Science (Ny), 177, 3696–3717.CrossRefGoogle Scholar
  4. 4.
    Song, F., Darling, W. M., Duric, A., & Kroon, F. W. (2011). An iterative approach to text segmentation. In: 33rd Eurobean Conference on IR Resources ECIR 2011, Dublin (pp. 629–640). Berlin, Heidelberg: Springer.Google Scholar
  5. 5.
    Oyedotun, O. K., & Khashman, A. (2016). Document segmentation using textural features summarization and feedforward neural network. Applied Intelligence, 45, 1–15.CrossRefGoogle Scholar
  6. 6.
    Wu, Y., Zhang, Y., Luo, S. M., & Wang, X. J. (2007). Comprehensive information based semantic orientation identification. IEEE NLP-KE 2007 - Proc (pp. 274–279). Beijing: Int. Conf. Nat. Lang. Process. Knowl. Eng. IEEE.Google Scholar
  7. 7.
    Gao, Y., Zhou, L., Zhang, Y., et al (2010). Sentiment classification for stock news. In: ICPCA10—5th International Conference on Pervasive Computer Application (pp. 99−104). Maribor: IEEE.Google Scholar
  8. 8.
    Xia, H., Tao, M., & Wang, Y. (2010). Sentiment text classification of customers reviews on the Web based on SVM. In: Proceedings–2010 6th International Conference on National Computing (pp. 3633−3637). ICNC.Google Scholar
  9. 9.
    Liu, C., Wang, Y., & Zheng, F. (2006). Automatic text summarization for dialogue style. In: Proceedings IEEE ICIA 2006—2006 IEEE International Conference on Information Acquistics (pp. 274–278). Weihai: IEEE.Google Scholar
  10. 10.
    Osman, D. J., & Yearwood, J. L. (2007). Opinion search in web logs In: Conferences in Research and Practice Information Technology Service, 63, 133–139.Google Scholar
  11. 11.
    Brants, T., Chen, F., & Tsochantaridis, I. (2002). Topic-based document segmentation with probabilistic latent semantic analysis. CIKM’02 (pp. 211–218). Virginia: ACM.Google Scholar
  12. 12.
    Flejter, D., Wieloch, K., & Abramowicz, W. (2007). Unsupervised methods of topical text segmentation for polish. SIGIR’13 (pp. 51–58). Dublin: ACM.Google Scholar
  13. 13.
    Potrus, M. Y., Ngah, U. K., & Ahmed, B. S. (2014). An evolutionary harmony search algorithm with dominant point detection for recognition-based segmentation of online Arabic text recognition. Ain Shams Engineering Journal, 5, 1129–1139.CrossRefGoogle Scholar
  14. 14.
    Huang, X., Peng, F., Schuurmans, D., et al. (2003). Applying machine learning to text segmentation. Information Retrieval Journal, 6, 333–362.CrossRefGoogle Scholar
  15. 15.
    Zhu J, Zhu M, Wang H, Tsou BK (2009) Aspect-based sentence segmentation for sentiment summarization. In: Proceeding 1st International CIKM Worshop. Top Analysis mass Open.—TSA’09 (pp. 65–72). Hong Kong: ACM New York, NY, USA ©2009.Google Scholar
  16. 16.
    Gan, K. H., Phang, K. K., & Tang, E. K. (2007). A semantic learning approach for mapping unstructured query to web resources. In: Proceedings—2006 IEEE/WIC/ACM International Conference on Web Intelligent (WI 2006 Main Confernce Proceedings), WI’06 (pp. 494–497). Hong Kong: IEEE.Google Scholar
  17. 17.
    Hoon, G. K., Wei, & T. C. (2016). Flexible facets generation for faceted search. In: First EAI International Conference on Computer Science Engineering EAI (pp. 1–3). Penang: Malaysia.Google Scholar
  18. 18.
    Duan, D., Qian, W., Pan, S., et al (2012). VISA: A visual sentiment analysis system. In: Proceedings 5th International Symposium Visa Information Communicate Interaction—VINCI’12. (pp. 22–28). ACM: Hangzhou.Google Scholar
  19. 19.
    Sun, Y., Butler, T. S., Shafarenko, A., et al. (2007). Word segmentation of handwritten text using supervised classification techniques. Applied Software Computing, 7, 71–88.CrossRefGoogle Scholar
  20. 20.
    Lamprier, S., Amghar, T., Levrat, B., & Saubion, F. (2007). ClassStruggle: A clustering based text segmentation. In: Proceedings SAC’07. (pp. 600−604). ACM: Seoul.Google Scholar
  21. 21.
    Correa, J., & Dockrell, J. E. (2007). Unconventional word segmentation in Brazilian children’s early text production. Reading and Writing, 20, 815–831.CrossRefGoogle Scholar
  22. 22.
    El-Shayeb, M. A., El-Beltagy, S. R, & Rafea, A. (2007). Comparative analysis of different text segmentation algorithms on arabic news stories. In: IEEE International Conference on Information Reuse and Integration, Las Vegas (pp. 441–446).Google Scholar
  23. 23.
    Xie, L., Zeng, J., & Feng, W. (2008). Multi-scale texttiling for automatic story segmentation in Chinese broadcast news. In: 4th Asia Information Retrieval Symposium, Harbin (pp. 345–355). Berlin, Heidelberg: Springer.Google Scholar
  24. 24.
    Xia, Z., Suzhen, W., Mingzhu, X., & Yixin, Y. (2009). Chinese text sentiment classification based on granule network. In: 2009 IEEE International Conference on Granular Computing GRC 2009 (pp. 775−778). Nanchang: IEEE.Google Scholar
  25. 25.
    Hong, C. M., Chen, C. M., & Chiu, C. Y. (2009). Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems. Expert Systems with Applications, 36, 3641–3651.CrossRefGoogle Scholar
  26. 26.
    Mukund, S., Srihari, R., & Peterson, E. (2010). An information-extraction system for Urdu-a resource-poor language. ACM Transcations on Asian Language Information Processing, 9, 1–43.CrossRefGoogle Scholar
  27. 27.
    Tsai, R. T.-H. (2010). Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures. Expert Systems with Applications, 37, 3553–3560.CrossRefGoogle Scholar
  28. 28.
    Liu, X., Zuo, M., & Chen, L. (2010). The application of text mining technology in monitoring the network education public sentiment. In: 2010 International Conference on Computing Intelligence and Software Engineering (pp. 1–4). Wuhan: IEEE.Google Scholar
  29. 29.
    Li, N., & Wu, D. D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems, 48, 354–368.CrossRefGoogle Scholar
  30. 30.
    Misra, H., Yvon, F., Cappé, O., & Jose, J. (2011). Text segmentation: A topic modeling perspective. Information Process Management, 47, 528–544.CrossRefGoogle Scholar
  31. 31.
    Fan, J. (2011). Text segmentation of consumer magazines in PDF format. In: International Conference on Document Analysis and Recognition (ICDAR) (pp. 794–798).Google Scholar
  32. 32.
    Ranaivo-Malançon, B. (2011). Building a rule-based Malay text segmentation tool. In: 2011 International Conference on Asian Language Processing IALP 2011 (pp. 276–279). Penang: IEEE.Google Scholar
  33. 33.
    Nouri, J., & Yangarber, R. (2011). A novel evaluation method for morphological segmentation. In: Proceedings Tenth International Conference on Language Resources Evaluation (LREC 2016) (pp. 3102–3109). Portoroz: European Language Resources Association (ELRA).Google Scholar
  34. 34.
    Paliwal, S., & Pudi, V. (2012). Investigating usage of text segmentation and inter-passage similarities. In: Machine Learning and Data Mining Pattern Recognition (pp. 555–565). Berlin, Heidelberg: Springer.Google Scholar
  35. 35.
    Peng, X., Setlur, S., Govindaraju, V., & Ramachandrula, S. (2012). Using a boosted tree classifier for text segmentation in hand-annotated documents. Pattern Recognition Letters, 33, 943–950.CrossRefGoogle Scholar
  36. 36.
    Guinaudeau, C., Gravier, G.S & Billot, P. (2012). Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation. Computer Speech Language. 26, 90–104.Google Scholar
  37. 37.
    Clausner, C., Antonacopoulos, A., & Pletschacher, S. (2012). A robust hybrid approach for text line segmentation. In: 21st International Conference on pattern Recognition (pp. 335–338). Tsukuba: IEEE.Google Scholar
  38. 38.
    Ye, F.Y., Chen, Y., Luo, X., et al (2012). Research on topic segmentation of Chinese text based on lexical chain. In: 12th International Conference on Computer and Information Technology CIT 2012 (pp. 1131–1136) .Chengdu: IEEE.Google Scholar
  39. 39.
    Myint, N., Aung, M., & Maung, S. S. (2013). Semantic based text block segmentation using wordnet. International Journal of Computer Communication and Engineering, 2, 601–604.Google Scholar
  40. 40.
    Kravets, L. G. (2013). The first steps in developing machine translation of patents. World Patent Information, 35, 183–186.CrossRefGoogle Scholar
  41. 41.
    Chiru, C., & Teka, A. (2013). Sentiment-based text segmentation. In: 2nd International. Conference on Systems Computer Science (pp. 234–239). Villeneuve d’Ascq: France, IEEE.Google Scholar
  42. 42.
    Sun, X., Zhang, Y., Matsuzaki, T., et al. (2013). Probabilistic Chinese word segmentation with non-local information and stochastic training. Information Processing Management, 49, 626–636.CrossRefGoogle Scholar
  43. 43.
    Ye, Y., Wu, Q., Li, Y., et al. (2013). Unknown Chinese word extraction based on variety of overlapping strings. Information Processing Management, 49, 497–512.CrossRefGoogle Scholar
  44. 44.
    Fragkou, P. (2013). Text segmentation for language identification in Greek forums. In: Proceedings of Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants (pp. 23–29). Hissar: Elsevier B.V.Google Scholar
  45. 45.
    Ma, G., Li, X., & Rayner, K. (2014). Word segmentation of overlapping ambiguous strings during Chinese reading. Journal of Experimental Psychology: Human Perception and Performance, 40, 1046–1059.Google Scholar
  46. 46.
    Lan, Q., Li, W., & Liu, W. (2015). Chinese text sentiment orientation identificat.ion based on chinese-characters. In: International Conference on IEEE 2015 12th Fuzzy Systems and Knowledge Discovery (FSKD) (pp. 663−668). Zhangjiajie.Google Scholar
  47. 47.
    Alemi, A. A., & Ginsparg, P. (2015). Text segmentation based on semantic word embeddings. KDD2015 (pp. 1–10). Sydney, Australia: ACM.Google Scholar
  48. 48.
    Fu, X., Yang, K., Huang, J. Z., & Cui, L. (2015). Dynamic non-parametric joint sentiment topic mixture model. Knowledge-Based Systems, 82, 102–114.CrossRefGoogle Scholar
  49. 49.
    Liu, S. M., & Chen, J.-H. (2015). A multi-label classification based approach for sentiment classification. Expert Systems with Applications, 42, 1083–1093.CrossRefGoogle Scholar
  50. 50.
    Claveau, V., & Lefevre, S. (2015). Topic segmentation of TV-streams by watershed transform and vectorization. Computer Speech and Language, 29, 63–80.CrossRefGoogle Scholar
  51. 51.
    Shi, H., Zhan, W., & Li, X. (2015). A supervised fine-grained sentiment analysis system for online reviews. Intelligent Automation and Soft Computing, 21, 589–605.CrossRefGoogle Scholar
  52. 52.
    Liu, W., & Wang, L. (2016). How does dictionary size influence performance of Vietnamese word segmentation? In: Proceedings Tenth International Conference on Language Resources Evaluation (LREC 2016) (pp. 1079−1083). European Language Resources Association (ELRA), Portorož: Slovenia.Google Scholar
  53. 53.
    Grouin, C. (2016). Text segmentation of digitized clinical texts. In: Proceedings Tenth International Conference on Language Resource Evaluation (LREC 2016) (pp. 3592−3599). European Language Resources Association (ELRA), Portorož: Slovenia.Google Scholar
  54. 54.
    Logacheva, V., & Specia, L. (2016). Phrase-level segmentation and labelling of machine translation errors. In: Tenth International Conference on Language Resource Evaluation (LREC 2016) (pp. 2240–2245). European Language Resources Association (ELRA), Portorož: Slovenia.Google Scholar
  55. 55.
    Homburg, T., & Chiarcos, C. (2016). Akkadian word segmentation. In: Proceedings Tenth International Conference on Language Resource Evaluation. (LREC 2016) (pp. 4067−4074). European Language Resources Association (ELRA), Portorož: Slovenia.Google Scholar
  56. 56.
    Pedersoli, F., & Tzanetakis, G. (2016). Document segmentation and classification into musical scores and text. International Journal Document Analysis and Recognition, 19, 289–304.CrossRefGoogle Scholar
  57. 57.
    Ehsan, N., & Shakery, A. (2016). Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Information Processing and Management, 52, 1004–1017.CrossRefGoogle Scholar
  58. 58.
    Qingrong, C., Wentao, G., Scheepers, C., et al. (2017). Effects of text segmentation on silent reading of Chinese regulated poems: Evidence from eye movements. 44, 265–286.Google Scholar
  59. 59.
    Kavitha, A. S., Shivakumara, P., Kumar, G. H., & Lu, T. (2017). A new watershed model based system for character segmentation in degraded text lines. AEU—International Journal of Electronics and Communications, 71, 45–52.Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Department of Computing and Information SystemsSunway UniversityBandar SunwayMalaysia

Personalised recommendations