The Hebrew CHILDES corpus: transcription and morphological analysis
- 304 Downloads
We present a corpus of transcribed spoken Hebrew that reflects spoken interactions between children and adults. The corpus is an integral part of the CHILDES database, which distributes similar corpora for over 25 languages. We introduce a dedicated transcription scheme for the spoken Hebrew data that is sensitive to both the phonology and the standard orthography of the language. We also introduce a morphological analyzer that was specifically developed for this corpus. The analyzer adequately covers the entire corpus, producing detailed correct analyses for all tokens. Evaluation on a new corpus reveals high coverage as well. Finally, we describe a morphological disambiguation module that selects the correct analysis of each token in context. The result is a high-quality morphologically-annotated CHILDES corpus of Hebrew, along with a set of tools that can be applied to new corpora.
KeywordsCHILDES Hebrew Transcription of spoken language Morphological analysis Morphological disambiguation
This research was supported by Grant No. 2007241 from the United States-Israel Binational Science Foundation (BSF). We are grateful to Hadass Zaidenberg, Maayan Bloch and Ezer Rasin for their meticulous lexicographic work, to Arnon Lazerson for developing the conversion script, and to Shai Gretz for helping with the manual annotation.
- Adam, G. (2002). From variable to optimal grammar: Evidence from language acquisition and language change. PhD thesis, Tel Aviv University.Google Scholar
- Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2011). A morphologically-analyzed CHILDES corpus of Hebrew. Presented at The International Association of the Study of Child Language (IASCL).Google Scholar
- Albert, A., Nir, B., MacWhinney, B., & Wintner, S. (2012). A morphologically annotated Hebrew CHILDES corpus. In Proceedings of the EACL-2012 workshop on computational models of language acquisition and loss.Google Scholar
- Berman, R. A. (1979). Lexical decomposition and lexical unity in the expression of derived verbal categories in modern Hebrew. Afroasiatic Linguistics, 6, 1–26.Google Scholar
- Berman, R. A. (1985). The acquisition of Hebrew. In D. I. Slobin (Ed.), The crosslinguistic study of language acquisition (pp. 255–372). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
- Berman, R. A. (2009). Childrens acquisition of compound constructions. In R. Lieber & P. Stekauer (Eds.), The Oxford handbook of compounding. USA: Oxford University Press.Google Scholar
- Berman, R. A., & Ravid, D. (1986). Lexicalization of noun compounds. Hebrew Linguistics, 24, 5–22 (In Hebrew).Google Scholar
- Berman, R. A., & Weissenborn, J. (1991). Acquisition of word order: A crosslinguistic study. Final Report. German-Israel Foundation for Research and Development (GIF).Google Scholar
- Borer, H. (1988). On the morphological parallelism between compounds and constructs. In G. Booij & J. van Marle (Eds.), Yearbook of morphology 1 (pp. 45–65). Dordrecht Holland: Foris publications.Google Scholar
- Borer, H. (1996). The construct in review. In L. Jacqueline, L. Jean & S. Ur (Eds.), Studies in afroasiatic grammar (pp. 30–61). The Hague: Holland Academic Graphics.Google Scholar
- Brown, R. (1973). A first language: The early stages. Cambridge, MA: Harvard University Press.Google Scholar
- Crystal, D., Fletcher, P. J., & Garman, M. (1976). The grammatical analysis of language disability: A procedure for assessment and remediation. London: Edward Arnold. ISBN 0713158425.Google Scholar
- Freudenthal, D., Pine, J., & Gobet, F. (2010). Explaining quantitative variation in the rate of optional infinitive errors across languages: A comparison of mosaic and the variational learning model. Journal of Child Language, 37(3), 643–69. ISSN 1469-7602. URL http://www.biomedsearch.com/nih/Explaining-quantitative-variation-in-rate/20334719.html.
- Hausser, R. R. (1989). Principles of computational morphology. Technical report, Center for Machine Translation, Carnegie Mellon University.Google Scholar
- Leben, W. R. (1973). Suprasegmental phonology. PhD thesis, Massachusetts Institute of Technology.Google Scholar
- Leben, W. R. (1978). The representation of tone. In: V. Fromkin (Ed.), Tone: A linguistic survey (pp. 177–220). New York: Academic.Google Scholar
- Lee, L. L. (1974). Developmental sentence analysis. Evanston, IL: Northwestern University Press.Google Scholar
- MacWhinney, B. (1996). The CHILDES system. American Journal of Speech Language Pathology, 5, 5–14.Google Scholar
- MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk third edition. Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
- MacWhinney, B. (2008). Enriching CHILDES for morphosyntactic analysis. In H. Behrens (Ed.), Corpora in language acquisition research: History, methods, perspectives volume 6 of trends in language acquisition research. Amsterdam: Benjamins.Google Scholar
- McCarthy, J. J. (1986). OCP effects: Gemination and antigemination. Linguistic Inquiry, 17, 207–263.Google Scholar
- Miller, J., & Chapman, R. (1983). SALT: Systematic analysis of language transcripts, user’s manual. Madison, WI: University of Wisconsin Press.Google Scholar
- Miyata, S., Hirakawa, M., Itoh, K., MacWhinney, B., Oshima-Takane, Y., Otomo, K., et al. (2009). Constructing a new language measure for Japanese: Developmental sentence scoring for Japanese. In S. Miyata (Ed.), Development of a developmental index of Japanese and its application to speech developmental disorders. Report of the Grant-in-Aid for Scientific Research (B) (2006–2008) No. 18330141, pp. 15–66. Nagoya, Japan: Aichi Shukutoku University.Google Scholar
- Miyata, S., & MacWhinney, B. (2011). The development of parallel language measures: The example of Japanese DSSJ. Presented at The International Association of the Study of Child Language (IASCL).Google Scholar
- Nir, B., MacWhinney, B., & Wintner, S. (2010). A morphologically-analyzed CHILDES corpus of Hebrew. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10) (pp. 1487–1490). European Language Resources Association (ELRA). ISBN 2-9517408-6-7.Google Scholar
- Ornan, U. (1986). Phonemic script: A central vehicle for processing natural language—the case of Hebrew. Technical Report 88.181, IBM Research Center, Haifa, Israel.Google Scholar
- Ornan, U. (1994). Basic concepts in “Romanization” of scripts. Technical Report LCL 94-5, Laboratory for Computational Linguistics, Technion, Haifa, Israel.Google Scholar
- Ornan, U., & Katz, M. (1995). A new program for Hebrew index based on the Phonemic Script. Technical Report LCL 94-7, Laboratory for Computational Linguistics, Technion, Haifa, Israel.Google Scholar
- Ravid, D., Dressler, W. U., Nir-Sagiv, B., Korecky- Kröll, K., Souman, A., Rehfeldt, K., et al. (2008). Core morphology in child directed speech: Crosslinguistic corpus analyses of noun plurals. In H. Behrens (Ed.), Corpora in language acquisition research: Finding structure in data (pp. 25–60). Amsterdam: John Benjamins.Google Scholar
- Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on intelligent text processing and computational linguistics (CICLING 2002), Mexico City, Mexico, pp. 1–15.Google Scholar
- Sagae, K., Davis, E., Lavie, A., MacWhinney, B., & Wintner, S. (2007). High-accuracy annotation and parsing of CHILDES transcripts. In Proceedings of the ACL-2007 workshop on cognitive aspects of computational language acquisition (pp. 25–32), Prague, Czech Republic, June 2007. Association for Computational Linguistics. http://www.aclweb.org/anthology/W/W07/W07-0604.
- Shimron, J. (Ed.). (2003). Language processing and acquisition in languages of semitic, root-based, morphology. Number 28 in language acquisition and language disorders. John Benjamins.Google Scholar
- Slobin, D. I. (1985). The crosslinguistic study of language acquisition: The data. The crosslinguistic study of language acquisition. Hillsdale, NJ: Lawrence Erlbaum Associates. ISBN 9780898593679.Google Scholar
- Wintner, S. (2004). Hebrew computational linguistics: Past and future. Artificial Intelligence Review, 21(2), 113–138. ISSN doi: 10.1023/B:AIRE.0000020865.73561.bc.