Building Indonesian Local Language Detection Tools Using Wikipedia Data

Martadinata, Puji; Trisedya, Bayu Distiawan; Manurung, Hisar Maruli; Adriani, Mirna

doi:10.1007/978-3-319-31468-6_8

Puji Martadinata¹⁵,
Bayu Distiawan Trisedya¹⁵,
Hisar Maruli Manurung¹⁵ &
…
Mirna Adriani¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9442))

Included in the following conference series:

International Workshop on Worldwide Language Service Infrastructure

416 Accesses
1 Citations

Abstract

The widespread use of social media today has generated lots of research interest towards information retrieval, natural language processing, and also machine learning. The vast diversity of languages used on social media creates the need for accurate automated language identification tools. In this research, we develop a language identification tool that can help automatically identify social media posts in Indonesian, Javanese, Sundanese, and Minangkabau. The latter three are some of the most widely spoken regional languages in Indonesia. We conducted experiments to compare three popular methods used to develop language identification tools, namely N-grams, statistical models, and the Small Words technique. Our experiments conducted using articles on internet for training and tested using social media data that we constructed, show that the statistical method obtains the best result among all the methods used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.wikipedia.com.
2.
twitter.com.
3.
http://odur.let.rug.nl/˜vannoord/TextCat/.
4.
http://search.cpan.org/~mpiotr/Lingua-Ident-1.7/Ident.pm.
5.
http://search.cpan.org/~ambs/Lingua-Identify-0.56/lib/Lingua/Identify.pm.

References

House, A.S., Neuburg, E.P.: Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62(3), 708–713 (1977)
Article Google Scholar
Ruslan, H.: Bahasa Daerah di Indonesia Terancam Punah (2013). Retrieved from Republika: http://www.republika.co.id/berita/nasional/umum/13/06/12/moa5s5-bahasa-daerah-di-indonesia-terancam-punah
Cavnar, W.B., Trenkle, J.M.: N-gram based text categorization. In: Proceedings of SDAIR 1994, pp. 161–175 (1994)
Google Scholar
Kranig, S.: Evaluation of Language Identification Method. Bakalárska práca. Universität Tübingen, Nemecko (2005)
Google Scholar
Dunning, T.: Statistical identification of language. Technical report MCCS-94-273, Computing Research Lab, New Mexico State University (1994)
Google Scholar
Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of JADT 1995, 3rd International Conference on Statistical Analysis of Textual Data (1995)
Google Scholar
Padró, M., Padró, L.: Comparing methods for language identification. Procesamiento del Lenguaje Nat. 33, 155–162 (2004)
Google Scholar
Wilkinson, D., Huberman, B.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis, pp. 157–164 (2007)
Google Scholar
Adafre, S.F., De Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)
Google Scholar
Tyers, F.M., Pienaar, J.: Extracting bilingual word pairs from Wikipedia. In: Proceedings of the SALTMIL Workshop at the Language Resources and Evaluation Conference, LREC 2008, pp. 19–22 (2008)
Google Scholar
Louvan, S., Ibrahim, M., Adriani, M., Vania, C., Trisedya, B.D., Wanagiri, M.Z.: University of Indonesia at TREC 2011 microblog track. In: Text Retrieval Conference Proceedings. NIST (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Universitas Indonesia, Depok, Indonesia
Puji Martadinata, Bayu Distiawan Trisedya, Hisar Maruli Manurung & Mirna Adriani

Authors

Puji Martadinata
View author publications
You can also search for this author in PubMed Google Scholar
Bayu Distiawan Trisedya
View author publications
You can also search for this author in PubMed Google Scholar
Hisar Maruli Manurung
View author publications
You can also search for this author in PubMed Google Scholar
Mirna Adriani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Puji Martadinata .

Editor information

Editors and Affiliations

Unit of Design, Kyoto University, Kyoto, Japan
Yohei Murakami
Kyoto University, Kyoto, Japan
Donghui Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martadinata, P., Trisedya, B.D., Manurung, H.M., Adriani, M. (2016). Building Indonesian Local Language Detection Tools Using Wikipedia Data. In: Murakami, Y., Lin, D. (eds) Worldwide Language Service Infrastructure. WLSI 2015. Lecture Notes in Computer Science(), vol 9442. Springer, Cham. https://doi.org/10.1007/978-3-319-31468-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-31468-6_8
Published: 13 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31467-9
Online ISBN: 978-3-319-31468-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics