Abstract
Sentence tokenization is the foundational step in natural language processing to analyze the sentence. Apart from others, main causes which make the sentence tokenization difficult are quotation marks and the multipurpose usage of punctuation marks especially dot “.”. In this paper, a framework has proposed for sentence tokenization for running text in Gujarati language using statistical unsupervised machine learning approach and rule-based approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andersen S (2016) Sentence types and functions. (2014) Internet: http://www.sjsu.edu/writingcenter/handouts/Sentence%20Types%20and%20Functions.pdf. Accessed 22 Dec 2016
Grefenstette G, Tapanainen P (1994) What is a word, what is a sentence? problems of tokenization. In: 3rd international conference on computational lexicography, pp 1–11
Hardeniya N et al (2016) Tokenizing text and wordnet basics. In NLP: Python & NLTK, Packt
Zhu L, Wong DF, Chao LS (2014) Unsupervised chunking based on graph propagation from bilingual corpus. Sci World J 2014, Article ID 401943:1–10
Zheng J et al (2012) A system for coreference resolution for the clinical narrative 19(4)
Manning C (2011) Part-of-speech tagging from 97% to 100%: is it time for some linguistics? Comput Linguist Intell Text Process 6608:171–189
Walker D et al (2001) Sentence boundary detection: a comparison of paradigms for improving MT quality. In: Proceedings of the MT Summit VIII, Santiago de Compostela, Spain
Read J et al (2012) Sentence boundary detection: a long solved problem? In: Proceedings of COLING 2012: posters, pp 985–994, Mumbai, December 2012
Straus J (2008) Punctuation. In: The Blue Book of Grammar and Punctuation, Chapter 3, pp 052–068
Manning C, Schütze H (2002) Corpus-based work. In: Foundations of statistical natural language processing. The MIT Press, London, England, Chapter 4, Section 4.2.4, pp 134–136
Chithra C, Ramaraj E (2016) Heuristic sentence boundary detection and classification. Int J Emerg Technol 7(2):199–206
Indurkhya N, Damerau F (2010) Sentence segmentation. In: Handbook of natural language processing. Chapman & Hall/CRC, Taylor & Francis Group, Chapter 2, Section 2.4.1, pp 023–024
Maurya S et al (2016) Gender and number identification for Gujarati word: rule-based approach. NJSIT 9(2):1–7
Dias K (2015, August 1) Pragmatic segmenter [Online]. https://www.tm-town.com/natural-language-processing. Accessed 23 Dec 2016
Nunberg G (1990) The linguistics of punctuation. Stanford, CA: C.S.L.I. Lecture Notes, vol 18
Bayer S et al (1998) Theoretical and computational linguistics: toward a mutual understanding. In: Using computers in linguistics: a practical guide, 238–253
Aberdeen J et al (1995) MITRE: description of the Alembic system used for MUC-6. In: The Proceedings of the 6th MUC, Columbia, Maryland, November 1995, pp 144–155
Mikheev A (2000) Tagging sentence boundaries. In: Proceedings of NAACL, pp 264–271, May 2000
Parakh M et al (2011) Sentence boundary disambiguation in Kannada texts. In: Special volume: problems of parsing in Indian languages, pp 17– 19, May 2011
Deepmala N, Kumar R (2012) Sentence boundary detection in Kannada language. Int J Comput Appl (0975–8887), 39(9)
Wanjari N et al (2015) Sentence boundary detection for Marathi language. Procedia Comput Sci 78:550–555 Elsevier, Science Direct
Riley M (1989) Some applications of tree-based modeling to speech and language indexing. In: Proceedings of the DARPA speech and natural language workshop, Pennsylvania, February 1989, pp 339–352
Reynar J, Ratnaparkhi A (1997) A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the 5th conference on applied natural language processing, pp 16–19
Palmer D, Hearst M (1997) Adaptive multilingual sentence boundary disambiguation. Comput Linguist 23(2):242–267
Negi P et al (2010) Sentence boundary disambiguation: a user friendly approach. Int J Compute Appl 7(8):033–037
Wong D et al (2014) iSentenizer-µ: multilingual sentence boundary detection model. Sci World J © 2014 Wong DF et al. http://dx.doi.org/10.1155/2014/196574
Kiss T, Strunk T (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525
Mikheev A (2002) Periods, capitalized words, etc. Comput Linguist 28(3):289–318
Rudrapal D et al (2015) Sentence boundary detection for social media text. In: ICON-2015, pp 91–97
Willy et al (2016) Natural language toolkit: Punkt sentence tokenizer [Online]. http://www.nltk.org/_modules/nltk/tokenize/punkt.html. Accessed Nov 2016
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tailor, C., Patel, B. (2019). Sentence Tokenization Using Statistical Unsupervised Machine Learning and Rule-Based Approach for Running Text in Gujarati Language. In: Rathore, V., Worring, M., Mishra, D., Joshi, A., Maheshwari, S. (eds) Emerging Trends in Expert Applications and Security. Advances in Intelligent Systems and Computing, vol 841. Springer, Singapore. https://doi.org/10.1007/978-981-13-2285-3_38
Download citation
DOI: https://doi.org/10.1007/978-981-13-2285-3_38
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2284-6
Online ISBN: 978-981-13-2285-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)