Advertisement

Automatic Preparation of Standard Arabic Phonetically Rich Written Corpora with Different Linguistic Units

  • Fadi SindranEmail author
  • Firas Mualla
  • Tino Haderlein
  • Khaled Daqrouq
  • Elmar Nöth
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10415)

Abstract

Phonetically rich and balanced speech corpora are essential components in state-of-the-art automatic speech recognition (ASR) and text-to-speech (TTS) systems. The written form of speech corpora must be prepared carefully to represent the richness and balance in the linguistic content. There is a lack of this type of spoken and written corpora for Standard Arabic (SA), and the only one available was prepared manually by expert linguists and phoneticians. In this work, we address the task of automatic preparation of written corpora with rich linguistic units. Our work depends on a comprehensive statistical linguistic study of SA based on automatic phonetic transcription of texts with more than 5 million words. We prepared two written corpora: the first corpus contains all allophones in SA with at least 3 occurrences of each allophone and 17 occurences of each phoneme. The second corpus contains, in addition to all allophones, 90.72% of diphones in SA.

Keywords

Phonetically rich SA written corpora Linguistic content Allophones Diphones 

References

  1. 1.
    Al Jazeera Website For Learning Arabic, in Arabic: “Open image in new window”, March 2017. http://learning.aljazeera.net/Arabic
  2. 2.
    Diwan of Standard Arabic Poetry, in Arabic “Open image in new window”, March 2017. http://www.aldiwan.net/poem.html?Word=%C7%E1%DF%C7%E3%E1&Find=meaning
  3. 3.
  4. 4.
    Holy Quran, in Arabic: “Open image in new window”, March 2017. http://www.holyquran.net/quran/index.html
  5. 5.
  6. 6.
    Abushariah, M., Ainon, R., Zainuddin, R., Khalifa, O., Elshafei, M.: Phonetically rich and balanced arabic speech corpus: an overview. In: International Conference on Computer and Communication Engineering, pp. 1–6. IEEE, Kuala Lumpur (2010)Google Scholar
  7. 7.
    Alghamdi, M., Alhamid, A.H., Aldasuqi, M.M.: Database of Arabic sounds: sentences, in Arabic: “Open image in new window”. Technical report, King Abdulaziz City of Science and Technology (KACST), Riyadh, Saudi Arabia (2003)Google Scholar
  8. 8.
    Bobzin, K.: Arabic Basic Course, in German: “Arabisch Grundkurs”. Harrassowitz Verlag, Wiesbaden (2009)Google Scholar
  9. 9.
    Gibbon, D., Moore, R., Winski, R.: Handbook of Standards and Resources for Spoken Language Systems. Mouton De Gruyter, Berlin (1997)Google Scholar
  10. 10.
    Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: Proceedings of the Second IASTED International Conference on Computational Intelligence, pp. 442–447. ACTA Press, San Francisco (2006)Google Scholar
  11. 11.
    Sindran, F., Mualla, F., Haderlein, T., Daqrouq, K., Nöth, E.: Automatic phonetization-based statistical linguistic study of standard Arabic. Int. J. Comput. Linguist. (IJCL) 7, 38–53 (2016)Google Scholar
  12. 12.
    Sindran, F., Mualla, F., Haderlein, T., Daqrouq, K., Nöth, E.: Rule-based standard arabic phonetization at phoneme, allophone, and syllable level. Int. J. Comput. Linguist. (IJCL) 7, 23–37 (2016)Google Scholar
  13. 13.
    Cormen, Thomas H., Leiserson, Charles E., Rivest, Ronald L., Stein, Clifford: Introduction to Algorithms. The MIT Press, Massachusetts (2009)zbMATHGoogle Scholar
  14. 14.
    Yuwan, R., Lestari, D.P.: Automatic extraction phonetically rich and balanced verses for speaker-dependent quranic speech recognition system. In: 14th International Conference of the Pacific Association for Computational Linguistics, pp. 65–75. Springer, Bali (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Fadi Sindran
    • 1
    Email author
  • Firas Mualla
    • 1
  • Tino Haderlein
    • 1
  • Khaled Daqrouq
    • 2
  • Elmar Nöth
    • 1
  1. 1.Lehrstuhl für Informatik 5 (Mustererkennung)Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)ErlangenGermany
  2. 2.Department of Electrical and Computer EngineeringKing Abdulaziz UniversityJeddahSaudi Arabia

Personalised recommendations