Automatic Machine Translation for Arabic Tweets

Mallek, Fatma; Tan Le, Ngoc; Sadat, Fatiha

doi:10.1007/978-3-319-67056-0_6

Fatma Mallek⁵,
Ngoc Tan Le⁵ &
Fatiha Sadat⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 740))

3415 Accesses
4 Citations

Abstract

Twitter is a continuous and unlimited source of data in natural language, which is particularly unstructured, highly noisy and short, making it difficult to deal with traditional approaches to automatic Natural Language Processing (NLP). The current research focus on the implementation of a phrase-based statistical machine translation system for tweets, from a complex and a morphological rich language, Arabic, into English. The first challenge is prepossessing the highly noisy data collected from Twitter, for both the source and target languages. A special attention is given to the pre-processing of Arabic tweets. The second challenge is related to the lack of parallel corpora for Arabic-English tweets. Thus, an out-of-domain corpus was incorporated for training a translation model and an adaptation strategy of a bigger language model for English tweets was used in the training step. Our evaluations confirm that pre-processing tweets of the source and target languages improves the performance of the statistical machine translation system. In addition, using an in-domain data for the language model and the tuning set, showed a better performance of the statistical machine translation system from Arabic to English tweets. An improvement of 4 pt. BLEU was realized.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Softcover Book: USD 299.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://about.twitter.com/fr/company.
2.
https://support.twitter.com/articles/20172133.
3.
http://www.isi.edu/licensed-sw/pharaoh/.
4.
http://www.statmt.org/moses.
5.
https://dev.twitter.com/streaming/public.
6.
http://twitter4j.org/en/index.html.
7.
The dictionary contains 44983 pairs of words, available on-line at: http://noisy-text.github.io/2015/norm-shared-task.html.
8.
The UN parallel corpus is available at: http://www.un.org/en/documents/ods/.
9.
http://www.statmt.org/moses/?n=FactoredTraining.PrepareTraining.

References

Adouane, W., Semmar, N., Johansson, R., Bobicev, V.: Automatic detection of Arabicized Berber and Arabic varieties. VarDial 3, 63 (2016)
Google Scholar
Barman, U., Das, A., Wagner, J., Foster, J.: Code mixing: a challenge for language identification in the language of social media. In: First Workshop on Computational Approaches to Code Switching (EMNLP 2014), pp. 13–23. Association for Computational Linguistics (ACL) (2014)
Google Scholar
Bies, A., Song, Z., Maamouri, M., Grimes, S., Lee, H., Wright, J., Strassel, S., Habash, N., Eskander, R., Rambow, O.: Transliteration of Arabizi into Arabic orthography: developing a parallel annotated Arabizi-Arabic script SMS/chat corpus. In: Workshop on Arabic Natural Langauge Processing (ANLP), pp. 93–103 (2014)
Google Scholar
Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Comput. Linguist. 16(2), 79–85 (1990)
Google Scholar
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Cherry, C., Guo, H.: The unreasonable effectiveness of word representations for twitter named entity recognition. In: HLT-NAACL, pp. 735–745 (2015)
Google Scholar
Darwish, K.: Arabizi Detection and Conversion to Arabic, pp. 217–224. Association for Computational Linguistics, Doha, Qatar (2014)
Google Scholar
Dridi, H.E.: Détection d’évènements à partir de Twitter. Ph.D. thesis, Université de Montréal (2015)
Google Scholar
Farzindar, A., Roche, M.: Les défis de l’analyse des réseaux sociaux pour le traitement automatique des langues. Traitement Automatique des Langues 54(3), 7–16 (2013)
Google Scholar
Gahbiche-Braham, S.: Amélioration des systèmes de traduction par analyse linguistique et thématique. Ph.D. thesis, Université Paris Sud (2013)
Google Scholar
Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49–57. Association for Computational Linguistics (ACL) (2008)
Google Scholar
Germann, U., Jahr, M., Knight, K., Marcu, D., Yamada, K.: Fast decoding and optimal decoding for machine translation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 228–235. Association for Computational Linguistics (ACL) (2001)
Google Scholar
Ghoul, D.: Outils génériques pour l’étiquetage morphosyntaxique de la langue arabe: segmentation et corpus d’entraînement (2011)
Google Scholar
Gotti, F., Langlais, P., Farzindar, A.: Translating government agencies tweet feeds: specificities, problems and (a few) solutions. In: The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2013), p. 80 (2013)
Google Scholar
Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 573–580. Association for Computational Linguistics (2005)
Google Scholar
Habash, N., Sadat, F.: Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers, pp. 49–52. Association for Computational Linguistics (ACL) (2006)
Google Scholar
Habash, N., Sadat, F.: Challenges for Arabic machine translation. In: Abdelhadi Soudi, Ali Farghaly, Günter Neumann, Rabih Zbib (eds.) Natural Language Processing, pp. 73–94. Amsterdam (2012)
Google Scholar
Habash, N.Y.: Introduction to Arabic natural language processing. Synth. Lect. Hum. Lang. Technol. 3(1), 1–187 (2010)
Article Google Scholar
Han, B., Cook, P., Baldwin, T.: Lexical normalization for social media text. Assoc. Comput. Mach. Trans. Intell. Syst. Technol. (TIST) 4(1), 5 (2013)
Google Scholar
Jehl, L., Hieber, F., Riezler, S.: Twitter translation using translation-based cross-lingual retrieval. In: Proceedings of the Seventh Workshop on Statistical Machine Translation, pp. 410–421. Association for Computational Linguistics (ACL) (2012)
Google Scholar
Jehl, L.E.: Machine Translation for Twitter. Master’s thesis, Speech and Language Processing School of Philosophy, Psychology and Language Studies, University of Edinburgh (2010)
Google Scholar
Johnson, J.H., Sadat, F., Foster, G., Kuhn, R., Simard, M., Joanis, E., Larkin, S.: Portage: with smoothed phrase tables and segment choice models. In: The Workshop on Statistical Machine Translation, pp. 134–137. Association for Computational Linguistics, New York City (2006)
Google Scholar
Kadri, Y., Nie, J.Y.: Effective stemming for Arabic information retrieval. In: The Challenge of Arabic for Natural Language Processing/Machine Translation NLP/MT, pp. 68–74 (2006)
Google Scholar
Knight, K., Marcu, D.: Machine translation in the year 2004. In: ICASSP (ed.) ICASSP (5) International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 965–968. Institute of Electrical and Electronics Engineers (IEEE) (2005)
Google Scholar
Koehn, P.: Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In: Conference of the Association for Machine Translation in the Americas, pp. 115–124. Springer (2004)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (ACL) (2007)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, vol. 1, pp. 48–54. Association for Computational Linguistics (ACL) (2003)
Google Scholar
Langlais, P., Gotti, F., Patry, A.: De la chambre des communes à la chambre d’isolement: adaptabilité d’un système de traduction basée sur les segments. In: Les actes de TALN, pp. 217–226 (2006)
Google Scholar
Le, N.T., Mallek, F., Sadat, F.: UQAM-NTL: named entity recognition in twitter messages. WNUT 2016, 197 (2016)
Google Scholar
Ling, W., Xiang, G., Dyer, C., Black, A.W., Trancoso, I.: Microblogs as parallel corpora. Assoc. Comput. Linguist. (ACL) 1, 176–186 (2013)
Google Scholar
Mohammad, S.M., Salameh, M., Kiritchenko, S.: How translation alters sentiment. J. Artif. Intell. Res. (JAIR) 55, 95–130 (2016)
MathSciNet Google Scholar
Mubarak, H., Abdelali, A.: Arabic to English person name transliteration using Twitter. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Slovenia (2016)
Google Scholar
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 160–167. Association for Computational Linguistics (2003)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article MATH Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for computational linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Pennell, D.L., Liu, Y.: Normalization of informal text. Comput. Speech Lang. 28(1), 256–277 (2014)
Article Google Scholar
Quirk, C., Moore, R.: Faster beam-search decoding for phrasal statistical machine translation. Machine Translation Summit XI (2007)
Google Scholar
Refaee, E., Rieser, V.: Benchmarking machine translated sentiment analysis for Arabic tweets. In: Student Research Workshop (SRW-2015), pp. 71–78 (2015)
Google Scholar
Sadat, F., Habash, N.: Combination of Arabic preprocessing schemes for statistical machine translation. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1–8. Association for Computational Linguistics, Sydney, July 2006
Google Scholar
Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of Arabic language varieties and dialects in social media. In: The 4th International Workshop on Natural Language Processing for Social Media of (SocialNLP 2014) (2014)
Google Scholar
Sadat, F., Mallek, F., Sellami, R., Boudabous, M.M., Farzindar, A.: Collaboratively constructed linguistic resources for language variants and their exploitation in NLP applications—the case of Tunisian Arabic and the social media. In: Workshop on Lexical and Grammatical Resources for Language Processing, p. 102. Citeseer (2014)
Google Scholar
Salameh, M., Mohammad, S.M., Kiritchenko, S.: Sentiment after translation: a case-study on Arabic social media posts. In: Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the Association for Computational Linguistics (ACL), pp. 767–777. Association for Computational Linguistics, May 2015
Google Scholar
Salameh, M.K.: Morphological solutions for Arabic statistical machine translation and sentiment analysis. Ph.D. thesis, University of Alberta (2016)
Google Scholar
Shannon, C.E.: The Mathematical Theory of Communication. Urbana (1949)
Google Scholar
Stolcke, A., et al.: Srilm—an extensible language modeling toolkit. In: ICSLP 2, pp. 901–904, Sept 2002
Google Scholar
Toral, A., Wu, X., Pirinen, T., Qiu, Z., Bicici, E., Du, J.: Dublin City University at the TweetMT 2015 shared task. In: Tweet Translation Workshop at the International Conference of the Spanish Society For Natural Language (SEPLN 2015) (2015)
Google Scholar
Wang, Y.Y., Waibel, A.: Decoding algorithm in statistical machine translation. In: Proceedings of the Eighth Conference on European Chapter of the Association for Computational Linguistics, pp. 366–372. Association for Computational Linguistics (1997)
Google Scholar
Yamamoto, Y.: Twitter4J—an open-sourced, mavenized and Google App Engine safe Java library for the Twitter API, released under the BSD license (2009)
Google Scholar
Zhang, M., Li, H., Kumaran, A., Liu, M.: Report of news 2012 machine transliteration shared task. In: Proceedings of the 4th Named Entity Workshop, pp. 10–20. Association for Computational Linguistics (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Université du Québec À Montréal, Montreal, Canada
Fatma Mallek, Ngoc Tan Le & Fatiha Sadat

Authors

Fatma Mallek
View author publications
You can also search for this author in PubMed Google Scholar
Ngoc Tan Le
View author publications
You can also search for this author in PubMed Google Scholar
Fatiha Sadat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fatma Mallek .

Editor information

Editors and Affiliations

The British University in Dubai, Dubai, United Arab Emirates
Khaled Shaalan
Faculty of Computers and Information Technology, Cairo University, Giza, Egypt
Aboul Ella Hassanien
Faculty of Computers and Information, Ain Shams University, Cairo, Egypt
Fahmy Tolba

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mallek, F., Tan Le, N., Sadat, F. (2018). Automatic Machine Translation for Arabic Tweets. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-67056-0_6
Published: 18 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67055-3
Online ISBN: 978-3-319-67056-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics