Skip to main content

Building Indonesian Local Language Detection Tools Using Wikipedia Data

  • Conference paper
  • First Online:
Worldwide Language Service Infrastructure (WLSI 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9442))

Included in the following conference series:

Abstract

The widespread use of social media today has generated lots of research interest towards information retrieval, natural language processing, and also machine learning. The vast diversity of languages used on social media creates the need for accurate automated language identification tools. In this research, we develop a language identification tool that can help automatically identify social media posts in Indonesian, Javanese, Sundanese, and Minangkabau. The latter three are some of the most widely spoken regional languages in Indonesia. We conducted experiments to compare three popular methods used to develop language identification tools, namely N-grams, statistical models, and the Small Words technique. Our experiments conducted using articles on internet for training and tested using social media data that we constructed, show that the statistical method obtains the best result among all the methods used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    www.wikipedia.com.

  2. 2.

    twitter.com.

  3. 3.

    http://odur.let.rug.nl/Ëœvannoord/TextCat/.

  4. 4.

    http://search.cpan.org/~mpiotr/Lingua-Ident-1.7/Ident.pm.

  5. 5.

    http://search.cpan.org/~ambs/Lingua-Identify-0.56/lib/Lingua/Identify.pm.

References

  1. House, A.S., Neuburg, E.P.: Toward automatic identification of the language of an utterance. I. Preliminary methodological considerations. J. Acoust. Soc. Am. 62(3), 708–713 (1977)

    Article  Google Scholar 

  2. Ruslan, H.: Bahasa Daerah di Indonesia Terancam Punah (2013). Retrieved from Republika: http://www.republika.co.id/berita/nasional/umum/13/06/12/moa5s5-bahasa-daerah-di-indonesia-terancam-punah

  3. Cavnar, W.B., Trenkle, J.M.: N-gram based text categorization. In: Proceedings of SDAIR 1994, pp. 161–175 (1994)

    Google Scholar 

  4. Kranig, S.: Evaluation of Language Identification Method. Bakalárska práca. Universität Tübingen, Nemecko (2005)

    Google Scholar 

  5. Dunning, T.: Statistical identification of language. Technical report MCCS-94-273, Computing Research Lab, New Mexico State University (1994)

    Google Scholar 

  6. Grefenstette, G.: Comparing two language identification schemes. In: Proceedings of JADT 1995, 3rd International Conference on Statistical Analysis of Textual Data (1995)

    Google Scholar 

  7. Padró, M., Padró, L.: Comparing methods for language identification. Procesamiento del Lenguaje Nat. 33, 155–162 (2004)

    Google Scholar 

  8. Wilkinson, D., Huberman, B.: Cooperation and quality in Wikipedia. In: Proceedings of the 2007 International Symposium on Wikis, pp. 157–164 (2007)

    Google Scholar 

  9. Adafre, S.F., De Rijke, M.: Finding similar sentences across multiple languages in Wikipedia. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)

    Google Scholar 

  10. Tyers, F.M., Pienaar, J.: Extracting bilingual word pairs from Wikipedia. In: Proceedings of the SALTMIL Workshop at the Language Resources and Evaluation Conference, LREC 2008, pp. 19–22 (2008)

    Google Scholar 

  11. Louvan, S., Ibrahim, M., Adriani, M., Vania, C., Trisedya, B.D., Wanagiri, M.Z.: University of Indonesia at TREC 2011 microblog track. In: Text Retrieval Conference Proceedings. NIST (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Puji Martadinata .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Martadinata, P., Trisedya, B.D., Manurung, H.M., Adriani, M. (2016). Building Indonesian Local Language Detection Tools Using Wikipedia Data. In: Murakami, Y., Lin, D. (eds) Worldwide Language Service Infrastructure. WLSI 2015. Lecture Notes in Computer Science(), vol 9442. Springer, Cham. https://doi.org/10.1007/978-3-319-31468-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31468-6_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31467-9

  • Online ISBN: 978-3-319-31468-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics