Skip to main content

Automatic Preparation of Standard Arabic Phonetically Rich Written Corpora with Different Linguistic Units

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10415))

Included in the following conference series:

Abstract

Phonetically rich and balanced speech corpora are essential components in state-of-the-art automatic speech recognition (ASR) and text-to-speech (TTS) systems. The written form of speech corpora must be prepared carefully to represent the richness and balance in the linguistic content. There is a lack of this type of spoken and written corpora for Standard Arabic (SA), and the only one available was prepared manually by expert linguists and phoneticians. In this work, we address the task of automatic preparation of written corpora with rich linguistic units. Our work depends on a comprehensive statistical linguistic study of SA based on automatic phonetic transcription of texts with more than 5 million words. We prepared two written corpora: the first corpus contains all allophones in SA with at least 3 occurrences of each allophone and 17 occurences of each phoneme. The second corpus contains, in addition to all allophones, 90.72% of diphones in SA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al Jazeera Website For Learning Arabic, in Arabic: “”, March 2017. http://learning.aljazeera.net/Arabic

  2. Diwan of Standard Arabic Poetry, in Arabic “”, March 2017. http://www.aldiwan.net/poem.html?Word=%C7%E1%DF%C7%E3%E1&Find=meaning

  3. Holy Bible, in Arabic: “”, March 2017. http://ar.arabicbible.com/arabic-bible/word.html

  4. Holy Quran, in Arabic: “”, March 2017. http://www.holyquran.net/quran/index.html

  5. Nahj al-Balagha, in Arabic: “”, March 2017. http://ia600306.us.archive.org/7/items/98472389432/nhj-blagh-ali.pdf

  6. Abushariah, M., Ainon, R., Zainuddin, R., Khalifa, O., Elshafei, M.: Phonetically rich and balanced arabic speech corpus: an overview. In: International Conference on Computer and Communication Engineering, pp. 1–6. IEEE, Kuala Lumpur (2010)

    Google Scholar 

  7. Alghamdi, M., Alhamid, A.H., Aldasuqi, M.M.: Database of Arabic sounds: sentences, in Arabic: “”. Technical report, King Abdulaziz City of Science and Technology (KACST), Riyadh, Saudi Arabia (2003)

    Google Scholar 

  8. Bobzin, K.: Arabic Basic Course, in German: “Arabisch Grundkurs”. Harrassowitz Verlag, Wiesbaden (2009)

    Google Scholar 

  9. Gibbon, D., Moore, R., Winski, R.: Handbook of Standards and Resources for Spoken Language Systems. Mouton De Gruyter, Berlin (1997)

    Google Scholar 

  10. Matoušek, J., Romportl, J.: On building phonetically and prosodically rich speech corpus for text-to-speech synthesis. In: Proceedings of the Second IASTED International Conference on Computational Intelligence, pp. 442–447. ACTA Press, San Francisco (2006)

    Google Scholar 

  11. Sindran, F., Mualla, F., Haderlein, T., Daqrouq, K., Nöth, E.: Automatic phonetization-based statistical linguistic study of standard Arabic. Int. J. Comput. Linguist. (IJCL) 7, 38–53 (2016)

    Google Scholar 

  12. Sindran, F., Mualla, F., Haderlein, T., Daqrouq, K., Nöth, E.: Rule-based standard arabic phonetization at phoneme, allophone, and syllable level. Int. J. Comput. Linguist. (IJCL) 7, 23–37 (2016)

    Google Scholar 

  13. Cormen, Thomas H., Leiserson, Charles E., Rivest, Ronald L., Stein, Clifford: Introduction to Algorithms. The MIT Press, Massachusetts (2009)

    MATH  Google Scholar 

  14. Yuwan, R., Lestari, D.P.: Automatic extraction phonetically rich and balanced verses for speaker-dependent quranic speech recognition system. In: 14th International Conference of the Pacific Association for Computational Linguistics, pp. 65–75. Springer, Bali (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fadi Sindran .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Sindran, F., Mualla, F., Haderlein, T., Daqrouq, K., Nöth, E. (2017). Automatic Preparation of Standard Arabic Phonetically Rich Written Corpora with Different Linguistic Units. In: Ekštein, K., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science(), vol 10415. Springer, Cham. https://doi.org/10.1007/978-3-319-64206-2_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64206-2_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64205-5

  • Online ISBN: 978-3-319-64206-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics