A Crowdsource-Based Approach for Preparing Bangla POS Tagged Corpus

Ehsan, Shamim; Swarna, Sadia Tasnim; Ismail, Sabir

doi:10.1007/978-981-13-1501-5_40

A Crowdsource-Based Approach for Preparing Bangla POS Tagged Corpus

Shamim Ehsan¹⁹,
Sadia Tasnim Swarna¹⁹ &
Sabir Ismail¹⁹

Conference paper
First Online: 02 September 2018

1146 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 814))

Abstract

Automated Parts of Speech Tagging plays a vital role in the natural language processing. For computational Bangla Language Processing, we do not have large-scale Parts of Speech tagged corpus. There are two basic approaches to implement a corpus, by written rules or automated. To implement a rule-based corpus, we need experts in Bangla linguistics and it is also time-consuming. And for the automated corpus, we need a trained corpus, which is currently not available. Crowdsourcing can be served a vital role to fulfill these two requirements. So, in this paper, we proposed a crowd source-based approach to building Bangla Parts of Speech tagged corpus. We have used a standard tag set for Bangla. Raw documents are collected from various newspapers, books, and online site. We first give some example of Parts of Speech and then provide data to people for crowdsourcing. Finally, we analyze the result of the data, and its accuracy is 95%.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://translate.google.com/.
2.
Pipilika is the first Bangla search engine developed by the students of Shahjalal University of Science and Technology.

References

Quinn, A.J., Bederson, B.B.: Human computation: a survey and taxonomy of a growing field. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM (2011)
Google Scholar
Gordon, J., Van Durme, B., Schubert, L.K.: Evaluation of commonsense knowledge with Mechanical Turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics (2010)
Google Scholar
Gao, Q., Vogel, S.: Consensus versus expertise: a case study of word alignment with mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics (2010)
Google Scholar
Jha, M., et al.: Corpus creation for new genres: a crowdsourced approach to PP attachment. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics (2010)
Google Scholar
Parent, G., Eskenazi, M.: Clustering dictionary definitions using amazon mechanical turk.. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics (2010)
Google Scholar
Skory, A., Eskenazi, M.: Predicting cloze task quality for vocabulary training. In: Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics (2010)
Google Scholar
Akkaya, C., et al.: Amazon mechanical turk for subjectivity word sense disambiguation. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics (2010)
Google Scholar
Callison-Burch, C., Dredze, M.: Creating speech and language data with Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics (2010)
Google Scholar
Dr. Muhammad Zafar Iqbal, Rashed: My Friend, ISBN-984-437046-9
Google Scholar
Niladri Sekhar Dash, POS tagset for Bangla Document, Microsoft Research India, Aug 2010
Google Scholar
Categorizing and Tagging Words. http://www.nltk.org/book/ch05.html Cited 30 Aug 2017
Quick, Draw! https://quickdraw.withgoogle.com/, cited 30 Aug 2017

Download references

Author information

Authors and Affiliations

Shahjalal University of Science and Technology, Sylhet, 3114, Bangladesh
Shamim Ehsan, Sadia Tasnim Swarna & Sabir Ismail

Authors

Shamim Ehsan
View author publications
You can also search for this author in PubMed Google Scholar
Sadia Tasnim Swarna
View author publications
You can also search for this author in PubMed Google Scholar
Sabir Ismail
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Shamim Ehsan or Sadia Tasnim Swarna .

Editor information

Editors and Affiliations

Machine Intelligence Research Labs, Auburn, WA, USA
Ajith Abraham
Department of Computer and Systems Sciences, Visva-Bharati University, Santiniketan, West Bengal, India
Paramartha Dutta
Department of Computer Science and Engineering, University of Kalyani, Kalyani, India
Jyotsna Kumar Mandal
Institute of Engineering and Management, Kolkata, West Bengal, India
Abhishek Bhattacharya
Institute of Engineering and Management, Kolkata, West Bengal, India
Soumi Dutta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ehsan, S., Swarna, S.T., Ismail, S. (2019). A Crowdsource-Based Approach for Preparing Bangla POS Tagged Corpus. In: Abraham, A., Dutta, P., Mandal, J., Bhattacharya, A., Dutta, S. (eds) Emerging Technologies in Data Mining and Information Security. Advances in Intelligent Systems and Computing, vol 814. Springer, Singapore. https://doi.org/10.1007/978-981-13-1501-5_40

Download citation

DOI: https://doi.org/10.1007/978-981-13-1501-5_40
Published: 02 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1500-8
Online ISBN: 978-981-13-1501-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics