Advertisement

Studying the history of the Arabic language: language technology and a large-scale historical corpus

  • Yonatan BelinkovEmail author
  • Alexander Magidow
  • Alberto Barrón-Cedeño
  • Avi Shmidman
  • Maxim Romanov
Original Paper
  • 5 Downloads

Abstract

Arabic is a widely-spoken language with a long and rich history, but existing corpora and language technology focus mostly on modern Arabic and its varieties. Therefore, studying the history of the language has so far been mostly limited to manual analyses on a small scale. In this work, we present a large-scale historical corpus of the written Arabic language, spanning 1400 years. We describe our efforts to clean and process this corpus using Arabic NLP tools, including the identification of reused text. We study the history of the Arabic language using a novel automatic periodization algorithm, as well as other techniques. Our findings confirm the established division of written Arabic into Modern Standard and Classical Arabic, and confirm other established periodizations, while suggesting that written Arabic may be divisible into still further periods of development.

Keywords

Arabic Corpus Periodization Text reuse Historical linguistics 

Notes

Acknowledgements

This research was partly supported by the HBKU Qatar Computing Research Institute (QCRI), as part of a collaboration with the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). Y.B. was also supported by the Harvard Mind, Brain, Behavior Initiative. This research was also partly supported by the Israel Science Foundation (Grant No. 977/16), and by DICTA: The Israel Center For Text Analysis.

References

  1. Abdelali, A., Darwish, K., Durrani, N., & Mubarak, H. (2016). Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Demonstrations (pp. 11–16). Association for Computational Linguistics, http://aclanthology.coli.uni-saarland.de/pdf/N/N16/N16-3003.pdf.  https://doi.org/10.18653/v1/N16-3003. Accessed 11 Apr 2019.
  2. Al-Jallad, A. (2015). An outline of the grammar of the Safaitic inscriptions. No. 80 in Studies in Semitic languages and linguistics. BrillGoogle Scholar
  3. Al-Sulaiti, L. (2004). Designing and developing a corpus of contemporary Arabic. Master’s thesis, The University of Leeds, Leeds, UKGoogle Scholar
  4. Al-Thubaity, A. O. (2015). A 700M+ Arabic corpus: KACST Arabic corpus design and construction. Language Resources and Evaluation, 49(3), 721–751.CrossRefGoogle Scholar
  5. Ali, A. S. M. (1987). A linguistic study of the development of scientific vocabulary in standard Arabic. New York: Kegan Paul International.Google Scholar
  6. Alrabiah, M., Al-Salman, A., Atwell, E. (2013). The design and construction of the 50 million words KSUCCA. In Proceedings of WACL2 second workshop on Arabic corpus linguistics (pp. 5–8).Google Scholar
  7. Arts, T., Belinkov, Y., Habash, N., Kilgarriff, A., & Suchomel, V. (2014). arTenTen: Arabic corpus and word sketches. Journal of King Saud University—Computer and Information Sciences, 26(4), 357–371. (Special Issue on Arabic NLP).CrossRefGoogle Scholar
  8. Basile, C., Benedetto, D., Caglioti, G., Degli & Esposti, M. (2009). A plagiarism detection procedure in three steps: Selection, matches and squares. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), SEPLN 2009 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 2009) (Vol. 502, pp. 19–23). CEUR-WS.org, San Sebastian, Spain, http://ceur-ws.org/Vol-502. Accessed 11 Apr 2019.
  9. Belinkov, Y., Magidow, A., Romanov, M., Shmidman, A., Koppel, M. (2016). Shamela: A large-scale historical Arabic corpus. In Proceedings of the workshop on language technology resources and tools for digital humanities (LT4DH at Coling) (pp. 45–53). The COLING 2016 Organizing Committee, Osaka, Japan.Google Scholar
  10. Bensalem, I., Boukhalfa, I., Rosso, P., Lahsen, A., Darwish, K., & Chikhi, S. (2015). Overview of the AraPlagDet PAN@ FIRE2015 shared task on Arabic plagiarism detection. In Notebook Papers of FIRE 2015 (CEUR-WS), Gandhinagar, India (Vol. 1587, pp. 111–122).Google Scholar
  11. Björkelund, A., Çetinoğlu, Ö., Farkas, R., Mueller, T., & Seeker, W. (2013). (Re) Ranking meets morphosyntax: State-of-the-art results from the SPMRL 2013 shared task (pp. 135–145).Google Scholar
  12. Braschler, M., & Harman, D. (Eds.). (2010). Notebook papers of CLEF 2010 LABs and workshops, Padua, Italy.Google Scholar
  13. Chambers, N. (2012) Labeling documents with timestamps: Learning from their time expressions. In Proceedings of the 50th annual meeting of the association for computational linguistics, Jeju Island, Korea (Vol. 1: Long Papers, pp. 98–106).Google Scholar
  14. Claridge, C. (2008). Historical corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 1, pp. 242–259). Berlin: Walter de Gruyter.Google Scholar
  15. Clough, P., & Gaizauskas, R. (2009). Corpora and text re-use. In A. Lüdeling, M. Kytö, & T. McEnery (Eds.), Handbook of corpus linguistics, handbooks of linguistics and communication science (pp. 1249–1271). Berlin: Mouton de Gruyter.Google Scholar
  16. Clough, P., Gaizauskas, R., Piao, S., & Wilks, Y. (2002). Measuring text reuse. In Proceedings of the 40th annual meeting of the association for computational linguistics (ACL 2002) (pp. 152–159). Philadelphia, PA: Association for Computational Linguistics.Google Scholar
  17. Dalli, A., & Wilks, Y. (2006). Automatic dating of documents and temporal text classification. In Proceedings of the workshop on annotating and reasoning about time and events, Sydney, Australia (pp. 17–22).Google Scholar
  18. Darwish, K., Abdelali, A., & Mubarak, H. (2014). Using stem-templates to improve Arabic POS and gender/number tagging. In Proceedings of the 9th international conference on language resources and evaluation (LREC’14), LREC 2014 (pp. 2926–2931). European Language Resources Association (ELRA), http://www.lrec-conf.org/proceedings/lrec2014/pdf/335_Paper.pdf. Accessed 11 Apr 2019.
  19. Darwish, K., & Mubarak, H. (2016). Farasa: A new fast and accurate Arabic word segmenter. In LREC.Google Scholar
  20. Darwish, K., Mubarak, H., Abdelali, A., & Eldesouki, M. (2017). Arabic POS tagging: Don’t abandon feature engineering just yet. In Proceedings of the 3rd Arabic natural language processing workshop (pp. 130–137). Association for Computational Linguistics, http://aclweb.org/anthology/W17-1316. Accessed 11 Apr 2019.
  21. Davies, M. (2010). The corpus of historical American English: 400 million words, 1810–2009, http://corpus.byu.edu/coha. Accessed 11 Apr 2019.
  22. de Jong, F., Rode, H., & Hiemstra, D. (2005). Temporal language models for the disclosure of historical text. In Humanities, computers and cultural heritage: Proceedings of the XVI international conference of the association for history and computing (pp. 161–168).Google Scholar
  23. Dubossarsky, H., Tsvetkov, Y., Dyer, C., & Grossman, E. (2015). A bottom up approach to category mapping and meaning change. In: V. Pirrelli, C. Marzi, & M. Ferro (Eds.), Proceedings of the NetWordS final conference on word structure and word usage.Google Scholar
  24. Elewa, A. H. (2004). Collocation and synonymy in classical Arabic: A corpus-based approach. Ph.D. Thesis, The University of Manchester, Manchester, UKGoogle Scholar
  25. Ferrando, I. (2007). History of Arabic. In K. Versteegh (Ed.), Encyclopedia of Arabic language and linguistics (Vol. 2, pp. 604–611). Leiden: Brill.Google Scholar
  26. Fischer, W. (2006). Classical Arabic. In K. Versteegh (Ed.), Encyclopedia of Arabic language and lingusitics (Vol. 1, pp. 397–405). Leiden: Brill.Google Scholar
  27. Gries, S. T., & Hilpert, M. (2012). Variability-based neighbor clustering: A bottom-up approach to periodization in historical linguistics. In T. Nevalainen & E. C. Traugott (Eds.), The Oxford Handbook of the history of english (pp. 134–144). Oxford: Oxford University Press.Google Scholar
  28. Grozea, C., & Popescu, M. (2011). The ENCOPLOT similarity measure for automatic detection of plagiarism—Notebook for PAN at CLEF 2011. In: V. Petras, P. Forner, & P. Clough (Eds.), Notebook papers of CLEF 2011LABs and workshops, Amsterdam, The Netherlands.Google Scholar
  29. Habash, N., & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of ACL.Google Scholar
  30. Habash, N., Rambow, O., & Roth, R. (2009). MADA+TOKAN: A toolkit for arabic tokenization, diacritization, morphological disambiguation, pos tagging, stemming and lemmatization. In Proceedings of the 2nd international conference on Arabic language resources and tools.Google Scholar
  31. Hall, D. L. W., Durrett, G., & Klein, D. (2014). Less grammar, more features. In Proceedings of the 52nd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (pp. 228–237).Google Scholar
  32. Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th annual meeting of the association for computational linguistics (Vol. 1: Long Papers, pp. 1489–1501). Association for Computational Linguistics, http://aclanthology.coli.uni-saarland.de/pdf/P/P16/P16-1141.pdf.  https://doi.org/10.18653/v1/P16-1141.
  33. Hammo, B., Yagi, S., Ismail, O., & AbuShariah, M. (2016). Exploring and exploiting a historical corpus for Arabic. Language Resources and Evaluation, 50(4), 839–861.  https://doi.org/10.1007/s10579-015-9304-9.CrossRefGoogle Scholar
  34. Holes, C. (2004). Modern Arabic: Structures, functions, and varieties. Washington, DC: Georgetown University Press.Google Scholar
  35. Ji, M. (2010). A corpus-based study of lexical periodization in historical Chinese. Literary and Linguistic Computing, 25(2), 199–213.CrossRefGoogle Scholar
  36. Joachims, T. (2006). Training linear SVMs in linear time. In KDD’06 (pp. 217–226). New York, NY: ACM.Google Scholar
  37. Kasprzak, J., & Brandejs, M. (2010). Improving the reliability of the plagiarism detection system. Lab Report for PAN at CLEF 2010. In M. Braschler, & D. Harman, (Eds.), Notebook papers of CLEF 2010 LABs and workshops, Padua, Italy.Google Scholar
  38. Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In Proceedings of EURALEX.Google Scholar
  39. Kim, Y., Chiu, Y. I., Hanaki, K., Hegde, D., & Petrov, S. (2014). Temporal analysis of language through neural language models. In Proceedings of the ACL 2014 workshop on language technologies and computational social science (pp. 61–65). Baltimore, MD, USA: Association for Computational Linguistics, http://www.aclweb.org/anthology/W14-2517. Accessed 11 Apr 2019.
  40. Kulkarni, V., Al-Rfou, R., Perozzi, B., & Skiena, S. (2015). Statistically significant detection of linguistic change. In Proceedings of the 24th international world wide web conference, WWW’15.Google Scholar
  41. Lane, E. W. (1863). Arabic-English Lexicon. Willams & Norgate.Google Scholar
  42. Lentin, J. (2006). Middle Arabic. In K. Versteegh (Ed.), Encyclopedia of Arabic language and linguistics (Vol. 1, pp. 87–96). Leiden: Brill.Google Scholar
  43. Li, W. P. (2016). Language technologies for understanding law, politics, and public policy. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USAGoogle Scholar
  44. Magidow, A. (2016). A digital philological investigation of the history of hā hunā constructions. Romano-Arabica, 16, 239–256.Google Scholar
  45. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. CoRR abs/1301.3781. http://arxiv.org/abs/1301.3781. Accessed 11 Apr 2019.
  46. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. CoRR abs/1310.4546. http://arxiv.org/abs/1310.4546. Accessed 11 Apr 2019.
  47. Mikolov, T., Yih, W., & Zweig, G. (2013c). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the North American Chapter of the association for computational linguistics: Human language technologies (NAACL-HLT-2013). Association for Computational Linguistics, http://research.microsoft.com/apps/pubs/default.aspx?id=189726. Accessed 11 Apr 2019.
  48. Mikolov, T., Yih, W., & Zweig, G. (2013d). Linguistic regularities in continuous space word representations. In Human language technologies: Conference of the North American chapter of the association of computational linguistics, proceedings, June 9–14, 2013 (pp. 746–751). Atlanta, Georgia, USA: Westin Peachtree Plaza Hotel, http://aclweb.org/anthology/N/N13/N13-1090.pdf. Accessed 11 Apr 2019.
  49. Muhr, M., Kern, R., Zechner, M., & Granitzer, M. (2010) External and intrinsic plagiarism detection using a cross-lingual retrieval and segmentation system. In M. Braschler, & D. Harman (Eds.), Notebook papers of CLEF 2010 LABs and workshops, Padua, Italy.Google Scholar
  50. Newman, D. L. (2013). The Arabic literary language: The nahḍa and beyond. In J. Owens (Ed.), The Oxford Handbook of Arabic linguistics (pp. 472–494). Oxford: Oxford University Press.Google Scholar
  51. Niculae, V., Zampieri, M., Dinu, L., & Ciobanu, A. M. (2014). Temporal text ranking and automatic dating of texts. In Proceedings of the 14th conference of the European chapter of the association for computational linguistics, Gothenburg, Sweden (Vol. 2: Short Papers, pp. 17–21), http://www.aclweb.org/anthology/E14-4004. Accessed 11 Apr 2019.
  52. Osama Hamed, T. Z. (2017). A survey and comparative study of Arabic diacritization tools. JLCL, 32(1), 27–47.Google Scholar
  53. Pasha, A., Al-Badrashiny, M., Diab, M., Kholy, A. E., Eskander, R., Habash, N., et al. (2014). MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In Proceedings of the 9th international conference on language resources and evaluation (LREC’14), Reykjavik, Iceland (pp. 1094–1101), http://www.lrec-conf.org/proceedings/lrec2014/pdf/593_Paper.pdf. Accessed 11 Apr 2019.
  54. Popescu, O., & Strapparava, C. (2015). SemEval 2015, Task 7: Diachronic text evaluation. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), Denver, Colorado (pp. 870–878), http://www.aclweb.org/anthology/S15-2147. Accessed 11 Apr 2019.
  55. Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In C. R. Huang, D. Jurafsky (Eds.), Proceedings of the 23rd international conference on computational linguistics (COLING 2010) (pp. 997–1005). COLING 2010 Organizing Committee, Beijing, China.Google Scholar
  56. Rashwan, M. A., Al-Badrashiny, M. A., Attia, M., Abdou, S. M., & Rafea, A. (2011). A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Transactions on Audio, Speech and Language Processing, 19(1), 166–175.  https://doi.org/10.1109/TASL.2010.2045240.CrossRefGoogle Scholar
  57. Rehurek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks (pp. 45–50). Valletta, Malta: ELRA.Google Scholar
  58. Romanov, M. (2017). Algorithmic analysis of medieval Arabic biographical collections. Speculum, 92(S1), S226–S246.CrossRefGoogle Scholar
  59. Romanov, M., Miller, M. T., & Savant, S. B. (2017–ongoing). OpenITI—machine-actionable scholarly corpus of premodern Islamicate texts, https://openiti.github.io. Accessed 11 Apr 2019.
  60. Romanov, M. G. (2013). Computational reading of Arabic biographical collections with special reference to preaching in the sunni world (661–1300 CE). Ph.D. Thesis, University of Michigan, Ann Arbor, MI, USAGoogle Scholar
  61. Romeo, S., Da San Martino, G., Belinkov, Y., Barrón-Cedeño, A., Eldesouki, M., Darwish, K., et al. (2017). Language processing and learning models for community question answering in arabic. Information Processing & Management,.  https://doi.org/10.1016/j.ipm.2017.07.003.Google Scholar
  62. Sagi, E., Kaufmann, S., & Clark, B. (2009). Semantic density analysis: Comparing word meaning across time and phonetic space. In Proceedings of the EACL 2009 workshop on GEMS: Geometrical models of natural language semantics (pp. 104–111).Google Scholar
  63. Scherbinin, V., & Butakov, S. (2009). Using Microsoft SQL server platform for plagiarism detection. In B. Stein, P. Rosso, E. Stamatatos, M. Koppel, & E. Agirre (Eds.), SEPLN 2009 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 2009) (Vol. 502). CEUR-WS.org, San Sebastian, Spain, http://ceur-ws.org/Vol-502. Accessed 11 Apr 2019.
  64. Schneider, N., Mohit, B., Oflazer, K., & Smith, N. A. (2012). Coarse lexical semantic annotation with supersenses: An Arabic case study. In Proceedings of the 50th annual meeting of the association for computational linguistics: Short Papers, ACL’12 (Vol. 2 , pp. 253–258). Stroudsburg, PA: Association for Computational Linguistics, http://dl.acm.org/citation.cfm?id=2390665.2390726. Accessed 11 Apr 2019.
  65. Schönemann, P. H. (1966). A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1), 1–10.  https://doi.org/10.1007/BF02289451.CrossRefGoogle Scholar
  66. Shmidman, A., Koppel, M., & Porat, E. (2016). Identification of parallel passages across a large hebrew/aramaic corpus. arXiv preprint arXiv:1602.08715.
  67. Shoufan, A., & Alameri, S. (2015). Natural language processing for dialectical Arabic: A survey. In Proceedings of the 2nd workshop on Arabic natural language processing, Beijing, China (pp. 36–48), http://www.aclweb.org/anthology/W15-3205. Accessed 11 Apr 2019.
  68. Smith, D. A., Cordell, R., Dillon, E. M., Stramp, N., & Wilkerson, J. (2014). Detecting and modeling local text reuse. In Proceedings of the 14th ACM/IEEE-CS joint conference on digital libraries, JCDL’14, London, UK (pp. 183–192), http://dl.acm.org/citation.cfm?id=2740769.2740800. Accessed 11 Apr 2019.
  69. Stein, B., Rosso, P., Stamatatos, E., Koppel, M., & Agirre, E. (Eds.). (2009). SEPLN 2009 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 2009) (Vol. 502). CEUR-WS.org, San Sebastian, Spain, http://ceur-ws.org/Vol-502. Accessed 11 Apr 2019.
  70. Wijaya, D. T., & Yeniterzi, R. (2011). Understanding semantic change of words over centuries. In Proceedings of the 2011 international workshop on detecting and exploiting cultural diversity on the social web—DETECT’11 (p. 35), http://dl.acm.org/citation.cfm?doid=2064448.2064475.  https://doi.org/10.1145/2064448.2064475. Accessed 11 Apr 2019.
  71. Wilkerson, J., Smith, D., & Stramp, N. (2015). Tracing the flow of policy ideas in legislatures: A text reuse approach. American Journal of Political Science, 59(4), 943–956.  https://doi.org/10.1111/ajps.12175.CrossRefGoogle Scholar
  72. Zack, E., & Schippers, A. (Eds.). (2012). Middle Arabic and mixed Arabic: Diachrony and synchrony. Leiden: Brill Academic Publishers.Google Scholar
  73. Zaghouani, W. (2014). Critical survey of the freely available Arabic corpora. In Proceedings of the Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools.Google Scholar
  74. Zemánek, P., & Milička, J. (2014a). Ranking search results for Arabic diachronic corpora. Google-like search engine for (non)linguists. In Proceedings of CITALA 2014 (5th International Conference on Arabic Language Processing). Association for Computational Linguistics.Google Scholar
  75. Zemánek, P., & Milička, J. (2014b). Quotations, relevance and time depth: Medieval Arabic literature in grids and networks. In Proceedings of the 3rd workshop on computational linguistics for literature (CLFL), Gothenburg, Sweden (pp. 17–24), http://www.aclweb.org/anthology/W14-0903. Accessed 11 Apr 2019.
  76. Zerrouki, T., & Balla, A. (2017). Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems. Data in Brief, 11, 147–151.  https://doi.org/10.1016/j.dib.2017.01.011.CrossRefGoogle Scholar

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  • Yonatan Belinkov
    • 1
    Email author
  • Alexander Magidow
    • 2
  • Alberto Barrón-Cedeño
    • 3
  • Avi Shmidman
    • 4
    • 5
  • Maxim Romanov
    • 6
  1. 1.MIT Computer Science and Artificial Intelligence Laboratory and Harvard John A. Paulson School of Engineering and Applied SciencesCambridgeUSA
  2. 2.Department of Modern and Classical Languages and LiteraturesUniversity of Rhode IslandKingstonUSA
  3. 3.Qatar Computing Research Institute, HBKUDohaQatar
  4. 4.Department of Hebrew LiteratureBar-Ilan UniversityRamat GanIsrael
  5. 5.Dicta: The Israel Center for Text AnalysisJerusalemIsrael
  6. 6.Department of HistoryUniversity of ViennaViennaAustria

Personalised recommendations