Improved N-grams Approach for Web Page Language Identification

  • Ali Selamat
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6910)


Language identification has been widely used for machine translations and information retrieval. In this paper, an improved N-grams (ING) approach is proposed for web page language identification. The improved N-grams approach is based on a combination of original N-grams (ONG) approach and a modified N-grams (MNG) approach that has been used for language identification of web documents. The features selected from the improved N-grams approach are based on N-grams frequency and N-grams position. The features selected from the original N-grams approach are based on a distance measurement and the features selected from the modified N-grams approach are based on a Boolean matching rate for language identification of Roman and Arabic scripts web pages. A large real-world document collection from British Broadcasting Corporation (BBC) website, which is composed of 1000 documents on each of the languages (e.g., Azeri, English, Indonesian, Serbian, Somali, Spanish, Turkish, Vietnamese, Arabic, Persian, Urdu, Pashto) have been used for evaluations. The precision, recall and F1 measures have been used to determine the effectiveness of the proposed improved N-grams (ING) approach. From the experiments, we have found that the improved N-grams approach has been able to improve the language identification of the contents in Roman and Arabic scripts web page documents from the available datasets.


Monolingual multilingual web page language identification N-grams approach 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Gordon, R.G.: Ethnologue: Languages of the world. In: SIL International Dallas, TX (2005)Google Scholar
  2. 2.
    Abd Rozan, M.Z., Mikami, Y., Abu Bakar, A.Z., Vikas, O.: Multilingual ict education: Language observatory as a monitoring instrument. In: Proceedings of the South East Asia Regional Computer Confederation 2005: ICT Building Bridges Conference, Sydney, Australia, vol. 46, pp. 53–61 (2005)Google Scholar
  3. 3.
    Maclean, D.: Beyond english: Transnational corporations and the strategic management of language in a complex multilingual business environment. Management Decision 44(10), 1377–1390 (2006)CrossRefGoogle Scholar
  4. 4.
    Redondo-Bellon, I.: The effects of bilingualism on the consumer: The case of spain. European Journal of Marketing 33(11/12), 1136–1160 (1999)CrossRefGoogle Scholar
  5. 5.
    Selamat, A., Ng, C.C.: Arabic script web page language identifications using decision tree neural networks. Pattern Recognition, Elsevier Science (2010), doi:10.1016/j.patcog.2010.07.009Google Scholar
  6. 6.
    Chowdhury, G.G.: Natural language processing. Annual Review of Information Science and Technology 37(1), 51–89 (2003)CrossRefGoogle Scholar
  7. 7.
    Lewandowski, D.: Problems with the use of web search engines to find results in foreign languages. Online Information Review 32(5), 668–672 (2008)CrossRefGoogle Scholar
  8. 8.
    Jin, H., Wong, K.F.: A chinese dictionary construction algorithm for information retrieval. ACM Transactions on Asian Language Information Processing 1(4), 281–296 (2002)CrossRefGoogle Scholar
  9. 9.
    Botha, G., Zimu, V., Barnard, E.: Text-based language identification for the south african languages. In: Proceedings of the 17th Annual Symposium of the Pattern Recognition Association of South Africa 2006, Parys, South Africa, pp. 7–13 (2006)Google Scholar
  10. 10.
    Ng, C.-C., Selamat, A.: Improve feature selection method of web page language identification using fuzzy artmap. International Journal of Intelligent Information and Database Systems 4(6), 629–642 (2010)CrossRefGoogle Scholar
  11. 11.
    Barroso, N., de Ipiña, K.L., Ezeiza, A., Barroso, O., Susperregi, U.: Hybrid approach for language identification oriented to multilingual speech recognition in the basque context. In: Graña Romay, M., Corchado, E., Garcia Sebastian, M.T. (eds.) HAIS 2010. LNCS (LNAI), vol. 6076, pp. 196–204. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Wang, H., Xiao, X., Zhang, X., Zhang, J., Yan, Y.: A hierarchical system design for language identification. In: 2nd International Symposium on Information Science and Engineering, ISISE 2009, pp. 443–446 (2010)Google Scholar
  13. 13.
    Amine, A.B., Elberrichi, Z., Simonet, M.: Automatic language identification: an alternative unsupervised approach using a new hybrid algorithm. International Journal of Computer Science and Applications 7(1), 94–107 (2010)Google Scholar
  14. 14.
    Xiao, H., Yu, L., Chen, K.: An efficient method of language identification using lvq network. In: International Conference on Signal Processing Proceedings, ICSP, pp. 1690–1694 (2008)Google Scholar
  15. 15.
    Řehůřek, R., Kolkus, M.: Language identification on the web: Extending the dictionary method. In: Gelbukh, A. (ed.) CICLing 2009. LNCS (LNAI), vol. 5449, pp. 357–368. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  16. 16.
    You, J.-L., Chen, Y.-N., Chu, M., Soong, F.K., Wang, J.-L.: Identifying language origin of named entity with multiple information sources. IEEE Transactions on Audio, Speech and Language Processing 16(6), 1077–1086 (2008)CrossRefGoogle Scholar
  17. 17.
    Ng, R., Lee, T.: Entropy-based analysis of the prosodic features of chinese dialects. In: Proceedings - 2008 6th International Symposium on Chinese Spoken Language Processing, ISCSLP 2008, pp. 65–68 (2008)Google Scholar
  18. 18.
    Deng, Y., Liu, J.: Automatic language identification using support vector machines and phonetic n-gram. In: ICALIP 2008, Proceedings of 2008 International Conference on Audio, Language and Image Processing, pp. 71–74 (2008)Google Scholar
  19. 19.
    Botha, G., Zimu, V., Barnard, E.: Text-based language identification for south african languages. Transactions of the South African Institute of Electrical Engineers 98(4), 141–148 (2007)Google Scholar
  20. 20.
    Cordoba, R., D’Haro, L., Fernandez-Martinez, F., Macias-Guarasa, J., Ferreiros, J.: Language identification based on n-gram frequency ranking. In: 8th Annual Conference of the International Speech Communication Association, Interspeech 2007., vol. 3, pp. 1921–1924 (2007)Google Scholar
  21. 21.
    Thomas, S., Verma, A.: Language identification of person names using cf-iof based weighing function. In: 8th Annual Conferenceof the International Speech Communication Association, Interspeech 2007, vol. 1, pp. 361–364 (2007)Google Scholar
  22. 22.
    Suo, H., Li, M., Liu, T., Lu, P., Yan, Y.: The design of backend classifiers in pprlm system for language identification. In: Proceedings of Third International Conference on Natural Computation, ICNC 2007, vol. 1, pp. 678–682 (2007)Google Scholar
  23. 23.
    Moscola, J., Cho, Y., Lockwood, J.: Hardware-accelerated parser for extraction of metadata in semantic network content. In: IEEE Aerospace Conference Proceedings (2007)Google Scholar
  24. 24.
    Yang, X., Siu, M.: N-best tokenization in a gmm-svm language identification system. In: ICASSP, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, pp. IV1005–IV1008 (2007)Google Scholar
  25. 25.
    Rouas, J.L.: Automatic prosodic variations modeling for language and dialect discrimination. IEEE Transactions on Audio, Speech and Language Processing 15(6), 1904–1911 (2007)CrossRefGoogle Scholar
  26. 26.
    Hanif, F., Latif, F., Sikandar Hayat Khiyal, M.: Unicode aided language identification across multiple scripts and heterogeneous data. Information Technology Journal 6(4), 534–540 (2007)CrossRefGoogle Scholar
  27. 27.
    Li, H., Ma, B., Lee, C.H.: A vector space modeling approach to spoken language identification. IEEE Transactions on Audio, Speech and Language Processing 15(1), 271–284 (2007)CrossRefGoogle Scholar
  28. 28.
    Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval 1994, Las Vegas, Nevada, USA, pp. 161–175 (1994)Google Scholar
  29. 29.
    Choong, C., Mikami, Y., Marasinghe, C., Nandasara, S.: Optimizing n-gram order of an n-gram based language identification algorithm for 68 written languages. International Journal on Advances in ICT for Emerging Regions 2(2), 21–28 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Ali Selamat
    • 1
  1. 1.Software Engineering Research Group, Faculty of Computer Science & Information SystemsUniversiti Teknologi MalaysiaJohorMalaysia

Personalised recommendations