Skip to main content

Automatic Machine Translation for Arabic Tweets

  • Chapter
  • First Online:
Intelligent Natural Language Processing: Trends and Applications

Part of the book series: Studies in Computational Intelligence ((SCI,volume 740))

Abstract

Twitter is a continuous and unlimited source of data in natural language, which is particularly unstructured, highly noisy and short, making it difficult to deal with traditional approaches to automatic Natural Language Processing (NLP). The current research focus on the implementation of a phrase-based statistical machine translation system for tweets, from a complex and a morphological rich language, Arabic, into English. The first challenge is prepossessing the highly noisy data collected from Twitter, for both the source and target languages. A special attention is given to the pre-processing of Arabic tweets. The second challenge is related to the lack of parallel corpora for Arabic-English tweets. Thus, an out-of-domain corpus was incorporated for training a translation model and an adaptation strategy of a bigger language model for English tweets was used in the training step. Our evaluations confirm that pre-processing tweets of the source and target languages improves the performance of the statistical machine translation system. In addition, using an in-domain data for the language model and the tuning set, showed a better performance of the statistical machine translation system from Arabic to English tweets. An improvement of 4 pt. BLEU was realized.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 299.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 299.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://about.twitter.com/fr/company.

  2. 2.

    https://support.twitter.com/articles/20172133.

  3. 3.

    http://www.isi.edu/licensed-sw/pharaoh/.

  4. 4.

    http://www.statmt.org/moses.

  5. 5.

    https://dev.twitter.com/streaming/public.

  6. 6.

    http://twitter4j.org/en/index.html.

  7. 7.

    The dictionary contains 44983 pairs of words, available on-line at: http://noisy-text.github.io/2015/norm-shared-task.html.

  8. 8.

    The UN parallel corpus is available at: http://www.un.org/en/documents/ods/.

  9. 9.

    http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining.

References

  1. Adouane, W., Semmar, N., Johansson, R., Bobicev, V.: Automatic detection of Arabicized Berber and Arabic varieties. VarDial 3, 63 (2016)

    Google Scholar 

  2. Barman, U., Das, A., Wagner, J., Foster, J.: Code mixing: a challenge for language identification in the language of social media. In: First Workshop on Computational Approaches to Code Switching (EMNLP 2014), pp. 13–23. Association for Computational Linguistics (ACL) (2014)

    Google Scholar 

  3. Bies, A., Song, Z., Maamouri, M., Grimes, S., Lee, H., Wright, J., Strassel, S., Habash, N., Eskander, R., Rambow, O.: Transliteration of Arabizi into Arabic orthography: developing a parallel annotated Arabizi-Arabic script SMS/chat corpus. In: Workshop on Arabic Natural Langauge Processing (ANLP), pp. 93–103 (2014)

    Google Scholar 

  4. Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Comput. Linguist. 16(2), 79–85 (1990)

    Google Scholar 

  5. Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)

    Google Scholar 

  6. Cherry, C., Guo, H.: The unreasonable effectiveness of word representations for twitter named entity recognition. In: HLT-NAACL, pp. 735–745 (2015)

    Google Scholar 

  7. Darwish, K.: Arabizi Detection and Conversion to Arabic, pp. 217–224. Association for Computational Linguistics, Doha, Qatar (2014)

    Google Scholar 

  8. Dridi, H.E.: Détection d’évènements à partir de Twitter. Ph.D. thesis, Université de Montréal (2015)

    Google Scholar 

  9. Farzindar, A., Roche, M.: Les défis de l’analyse des réseaux sociaux pour le traitement automatique des langues. Traitement Automatique des Langues 54(3), 7–16 (2013)

    Google Scholar 

  10. Gahbiche-Braham, S.: Amélioration des systèmes de traduction par analyse linguistique et thématique. Ph.D. thesis, Université Paris Sud (2013)

    Google Scholar 

  11. Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49–57. Association for Computational Linguistics (ACL) (2008)

    Google Scholar 

  12. Germann, U., Jahr, M., Knight, K., Marcu, D., Yamada, K.: Fast decoding and optimal decoding for machine translation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 228–235. Association for Computational Linguistics (ACL) (2001)

    Google Scholar 

  13. Ghoul, D.: Outils génériques pour l’étiquetage morphosyntaxique de la langue arabe: segmentation et corpus d’entraînement (2011)

    Google Scholar 

  14. Gotti, F., Langlais, P., Farzindar, A.: Translating government agencies tweet feeds: specificities, problems and (a few) solutions. In: The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2013), p. 80 (2013)

    Google Scholar 

  15. Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 573–580. Association for Computational Linguistics (2005)

    Google Scholar 

  16. Habash, N., Sadat, F.: Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 49–52. Association for Computational Linguistics (ACL) (2006)

    Google Scholar 

  17. Habash, N., Sadat, F.: Challenges for Arabic machine translation. In: Abdelhadi Soudi, Ali Farghaly, Günter Neumann, Rabih Zbib (eds.) Natural Language Processing, pp. 73–94. Amsterdam (2012)

    Google Scholar 

  18. Habash, N.Y.: Introduction to Arabic natural language processing. Synth. Lect. Hum. Lang. Technol. 3(1), 1–187 (2010)

    Article  Google Scholar 

  19. Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. Assoc. Comput. Mach. Trans. Intell. Syst. Technol. (TIST) 4(1), 5 (2013)

    Google Scholar 

  20. Jehl, L., Hieber, F., Riezler, S.: Twitter translation using translation-based cross-lingual retrieval. In: Proceedings of the Seventh Workshop on Statistical Machine Translation, pp. 410–421. Association for Computational Linguistics (ACL) (2012)

    Google Scholar 

  21. Jehl, L.E.: Machine Translation for Twitter. Master’s thesis, Speech and Language Processing School of Philosophy, Psychology and Language Studies, University of Edinburgh (2010)

    Google Scholar 

  22. Johnson, J.H., Sadat, F., Foster, G., Kuhn, R., Simard, M., Joanis, E., Larkin, S.: Portage: with smoothed phrase tables and segment choice models. In: The Workshop on Statistical Machine Translation, pp. 134–137. Association for Computational Linguistics, New York City (2006)

    Google Scholar 

  23. Kadri, Y., Nie, J.Y.: Effective stemming for Arabic information retrieval. In: The Challenge of Arabic for Natural Language Processing/Machine Translation NLP/MT, pp. 68–74 (2006)

    Google Scholar 

  24. Knight, K., Marcu, D.: Machine translation in the year 2004. In: ICASSP (ed.) ICASSP (5) International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 965–968. Institute of Electrical and Electronics Engineers (IEEE) (2005)

    Google Scholar 

  25. Koehn, P.: Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In: Conference of the Association for Machine Translation in the Americas, pp. 115–124. Springer (2004)

    Google Scholar 

  26. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (ACL) (2007)

    Google Scholar 

  27. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54. Association for Computational Linguistics (ACL) (2003)

    Google Scholar 

  28. Langlais, P., Gotti, F., Patry, A.: De la chambre des communes à la chambre d’isolement: adaptabilité d’un système de traduction basée sur les segments. In: Les actes de TALN, pp. 217–226 (2006)

    Google Scholar 

  29. Le, N.T., Mallek, F., Sadat, F.: UQAM-NTL: named entity recognition in twitter messages. WNUT 2016, 197 (2016)

    Google Scholar 

  30. Ling, W., Xiang, G., Dyer, C., Black, A.W., Trancoso, I.: Microblogs as parallel corpora. Assoc. Comput. Linguist. (ACL) 1, 176–186 (2013)

    Google Scholar 

  31. Mohammad, S.M., Salameh, M., Kiritchenko, S.: How translation alters sentiment. J. Artif. Intell. Res. (JAIR) 55, 95–130 (2016)

    MathSciNet  Google Scholar 

  32. Mubarak, H., Abdelali, A.: Arabic to English person name transliteration using Twitter. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Slovenia (2016)

    Google Scholar 

  33. Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 160–167. Association for Computational Linguistics (2003)

    Google Scholar 

  34. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  35. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for computational linguistics, pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  36. Pennell, D.L., Liu, Y.: Normalization of informal text. Comput. Speech Lang. 28(1), 256–277 (2014)

    Article  Google Scholar 

  37. Quirk, C., Moore, R.: Faster beam-search decoding for phrasal statistical machine translation. Machine Translation Summit XI (2007)

    Google Scholar 

  38. Refaee, E., Rieser, V.: Benchmarking machine translated sentiment analysis for Arabic tweets. In: Student Research Workshop (SRW-2015), pp. 71–78 (2015)

    Google Scholar 

  39. Sadat, F., Habash, N.: Combination of Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1–8. Association for Computational Linguistics, Sydney, July 2006

    Google Scholar 

  40. Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of Arabic language varieties and dialects in social media. In: The 4th International Workshop on Natural Language Processing for Social Media of (SocialNLP 2014) (2014)

    Google Scholar 

  41. Sadat, F., Mallek, F., Sellami, R., Boudabous, M.M., Farzindar, A.: Collaboratively constructed linguistic resources for language variants and their exploitation in NLP applications—the case of Tunisian Arabic and the social media. In: Workshop on Lexical and Grammatical Resources for Language Processing, p. 102. Citeseer (2014)

    Google Scholar 

  42. Salameh, M., Mohammad, S.M., Kiritchenko, S.: Sentiment after translation: a case-study on Arabic social media posts. In: Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics (ACL), pp. 767–777. Association for Computational Linguistics, May 2015

    Google Scholar 

  43. Salameh, M.K.: Morphological solutions for Arabic statistical machine translation and sentiment analysis. Ph.D. thesis, University of Alberta (2016)

    Google Scholar 

  44. Shannon, C.E.: The Mathematical Theory of Communication. Urbana (1949)

    Google Scholar 

  45. Stolcke, A., et al.: Srilm—an extensible language modeling toolkit. In: ICSLP 2, pp. 901–904, Sept 2002

    Google Scholar 

  46. Toral, A., Wu, X., Pirinen, T., Qiu, Z., Bicici, E., Du, J.: Dublin City University at the TweetMT 2015 shared task. In: Tweet Translation Workshop at the International Conference of the Spanish Society For Natural Language (SEPLN 2015) (2015)

    Google Scholar 

  47. Wang, Y.Y., Waibel, A.: Decoding algorithm in statistical machine translation. In: Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics, pp. 366–372. Association for Computational Linguistics (1997)

    Google Scholar 

  48. Yamamoto, Y.: Twitter4J—an open-sourced, mavenized and Google App Engine safe Java library for the Twitter API, released under the BSD license (2009)

    Google Scholar 

  49. Zhang, M., Li, H., Kumaran, A., Liu, M.: Report of news 2012 machine transliteration shared task. In: Proceedings of the 4th Named Entity Workshop, pp. 10–20. Association for Computational Linguistics (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fatma Mallek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Mallek, F., Tan Le, N., Sadat, F. (2018). Automatic Machine Translation for Arabic Tweets. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67056-0_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67055-3

  • Online ISBN: 978-3-319-67056-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics