Sentence Tokenization Using Statistical Unsupervised Machine Learning and Rule-Based Approach for Running Text in Gujarati Language

Tailor, Chetana; Patel, Bankim

doi:10.1007/978-981-13-2285-3_38

Chetana Tailor¹⁹ &
Bankim Patel¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 841))

1224 Accesses
2 Citations

Abstract

Sentence tokenization is the foundational step in natural language processing to analyze the sentence. Apart from others, main causes which make the sentence tokenization difficult are quotation marks and the multipurpose usage of punctuation marks especially dot “.”. In this paper, a framework has proposed for sentence tokenization for running text in Gujarati language using statistical unsupervised machine learning approach and rule-based approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andersen S (2016) Sentence types and functions. (2014) Internet: http://www.sjsu.edu/writingcenter/handouts/Sentence%20Types%20and%20Functions.pdf. Accessed 22 Dec 2016
Grefenstette G, Tapanainen P (1994) What is a word, what is a sentence? problems of tokenization. In: 3rd international conference on computational lexicography, pp 1–11
Google Scholar
Hardeniya N et al (2016) Tokenizing text and wordnet basics. In NLP: Python & NLTK, Packt
Google Scholar
Zhu L, Wong DF, Chao LS (2014) Unsupervised chunking based on graph propagation from bilingual corpus. Sci World J 2014, Article ID 401943:1–10
Google Scholar
Zheng J et al (2012) A system for coreference resolution for the clinical narrative 19(4)
Google Scholar
Manning C (2011) Part-of-speech tagging from 97% to 100%: is it time for some linguistics? Comput Linguist Intell Text Process 6608:171–189
Google Scholar
Walker D et al (2001) Sentence boundary detection: a comparison of paradigms for improving MT quality. In: Proceedings of the MT Summit VIII, Santiago de Compostela, Spain
Google Scholar
Read J et al (2012) Sentence boundary detection: a long solved problem? In: Proceedings of COLING 2012: posters, pp 985–994, Mumbai, December 2012
Google Scholar
Straus J (2008) Punctuation. In: The Blue Book of Grammar and Punctuation, Chapter 3, pp 052–068
Google Scholar
Manning C, Schütze H (2002) Corpus-based work. In: Foundations of statistical natural language processing. The MIT Press, London, England, Chapter 4, Section 4.2.4, pp 134–136
Google Scholar
Chithra C, Ramaraj E (2016) Heuristic sentence boundary detection and classification. Int J Emerg Technol 7(2):199–206
Google Scholar
Indurkhya N, Damerau F (2010) Sentence segmentation. In: Handbook of natural language processing. Chapman & Hall/CRC, Taylor & Francis Group, Chapter 2, Section 2.4.1, pp 023–024
Google Scholar
Maurya S et al (2016) Gender and number identification for Gujarati word: rule-based approach. NJSIT 9(2):1–7
MathSciNet Google Scholar
Dias K (2015, August 1) Pragmatic segmenter [Online]. https://www.tm-town.com/natural-language-processing. Accessed 23 Dec 2016
Nunberg G (1990) The linguistics of punctuation. Stanford, CA: C.S.L.I. Lecture Notes, vol 18
Google Scholar
Bayer S et al (1998) Theoretical and computational linguistics: toward a mutual understanding. In: Using computers in linguistics: a practical guide, 238–253
Google Scholar
Aberdeen J et al (1995) MITRE: description of the Alembic system used for MUC-6. In: The Proceedings of the 6th MUC, Columbia, Maryland, November 1995, pp 144–155
Google Scholar
Mikheev A (2000) Tagging sentence boundaries. In: Proceedings of NAACL, pp 264–271, May 2000
Google Scholar
Parakh M et al (2011) Sentence boundary disambiguation in Kannada texts. In: Special volume: problems of parsing in Indian languages, pp 17– 19, May 2011
Google Scholar
Deepmala N, Kumar R (2012) Sentence boundary detection in Kannada language. Int J Comput Appl (0975–8887), 39(9)
Google Scholar
Wanjari N et al (2015) Sentence boundary detection for Marathi language. Procedia Comput Sci 78:550–555 Elsevier, Science Direct
Article Google Scholar
Riley M (1989) Some applications of tree-based modeling to speech and language indexing. In: Proceedings of the DARPA speech and natural language workshop, Pennsylvania, February 1989, pp 339–352
Google Scholar
Reynar J, Ratnaparkhi A (1997) A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the 5th conference on applied natural language processing, pp 16–19
Google Scholar
Palmer D, Hearst M (1997) Adaptive multilingual sentence boundary disambiguation. Comput Linguist 23(2):242–267
Google Scholar
Negi P et al (2010) Sentence boundary disambiguation: a user friendly approach. Int J Compute Appl 7(8):033–037
Google Scholar
Wong D et al (2014) iSentenizer-µ: multilingual sentence boundary detection model. Sci World J © 2014 Wong DF et al. http://dx.doi.org/10.1155/2014/196574
Kiss T, Strunk T (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525
Article Google Scholar
Mikheev A (2002) Periods, capitalized words, etc. Comput Linguist 28(3):289–318
Article Google Scholar
Rudrapal D et al (2015) Sentence boundary detection for social media text. In: ICON-2015, pp 91–97
Google Scholar
Willy et al (2016) Natural language toolkit: Punkt sentence tokenizer [Online]. http://www.nltk.org/_modules/nltk/tokenize/punkt.html. Accessed Nov 2016

Download references

Author information

Authors and Affiliations

Shrimad Rajachandra Institute of Management and Computer Application, Uka Tarsadia University, Bardoli, Surat, India
Chetana Tailor & Bankim Patel

Authors

Chetana Tailor
View author publications
You can also search for this author in PubMed Google Scholar
Bankim Patel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chetana Tailor .

Editor information

Editors and Affiliations

Jaipur Engineering College and Research Centre, Jaipur, Rajasthan, India
Vijay Singh Rathore
Intelligent Systems Lab, University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Sri Aurobindo Institute of Technology, Indore, Madhya Pradesh, India
Durgesh Kumar Mishra
Sabar Institute of Technology for Girls, Ahmedabad, Gujarat, India
Amit Joshi
Department of Computer Science and Engineering, Jaipur Engineering College and Research Centre, Jaipur, Rajasthan, India
Shikha Maheshwari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tailor, C., Patel, B. (2019). Sentence Tokenization Using Statistical Unsupervised Machine Learning and Rule-Based Approach for Running Text in Gujarati Language. In: Rathore, V., Worring, M., Mishra, D., Joshi, A., Maheshwari, S. (eds) Emerging Trends in Expert Applications and Security. Advances in Intelligent Systems and Computing, vol 841. Springer, Singapore. https://doi.org/10.1007/978-981-13-2285-3_38

Download citation

DOI: https://doi.org/10.1007/978-981-13-2285-3_38
Published: 20 November 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-2284-6
Online ISBN: 978-981-13-2285-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics