Abstract
NLP research on low resource African languages is often impeded by the unavailability of basic resources: tools, techniques, annotated corpora, and datasets. Besides the lack of funding for the manual development of these resources, building from scratch will amount to the reinvention of the wheel. Therefore, adapting existing techniques and models from well-resourced languages is often an attractive option. One of the most generally applied NLP models is word embeddings. Embedding models often require large amounts of data to train which are not available for most African languages. In this work, we adopt an alignment based projection method to transfer trained English embeddings to the Igbo language. Various English embedding models were projected and evaluated on the odd-word, analogy and word-similarity tasks intrinsically, and also on the diacritic restoration task. Our results show that the projected embeddings performed very well across these tasks.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
A wordkey is a word stripped of its diacritics if it has any. Wordkeys could have multiple diacritic variants, one of which could be the same as the wordkey itself.
- 3.
The pre-trained Igbo model from fastText Wiki word vectors project [19] was also tested but its performance was so bad that we had to drop it.
- 4.
https://code.google.com/archive/p/word2vec/.
- 5.
Highly dominant variants or very rarely occurring wordkeys were generally excluded from the datasets.
- 6.
An alternative considered is to combine the word e.g. nkwubi okwu \(\rightarrow \) nkwubi-okwu and update the model with a projected vector or a combination of the vectors of constituting words.
- 7.
We intend to implement higher level n-gram models.
- 8.
See igbonlp.org.
References
Crandall, D.: Automatic Accent Restoration in Spanish text (2005). http://www.cs.indiana.edu/~djcran/projects/674_final.pdf. Accessed 7 Jan 2016
De Pauw, G., De Schryver, G.M., Pretorius, L., Levin, L.: Introduction to the special issue on African language technology. Lang. Resour. Eval. 45, 263–269 (2011)
Ezeani, I., Hepple, M., Onyenwe, I.: Automatic restoration of diacritics for Igbo language. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 198–205. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_23
Ezeani, I., Hepple, M., Onyenwe, I.: Lexical disambiguation of Igbo using diacritic restoration. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and Their Applications, pp. 53–60 (2017)
Finkelstein, L., et al.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)
Francom, J., Hulden, M.: Diacritic error detection and restoration via POS tags. In: Proceedings of the 6th Language and Technology Conference (2013)
Guo, J., Che, W., Yarowsky, D., Wang, H., Liu, T.: Cross-lingual dependency parsing based on distributed representations. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol1: Long Papers), pp. 1234–1244 (2015)
Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_35
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781 (2013)
Onyenwe, I.E., Hepple, M., Chinedu, U., Ezeani, I.: A basic language resource kit implementation for the IgboNLP project. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17(2), 101–1023 (2018)
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 22 May, pp. 45–50. ELRA, Valletta (2010). http://is.muni.cz/publication/884893/en
Cocks, J., Keegan, T.-T.: A word-based approach for diacritic restoration in Māori. In: Proceedings of the Australasian Language Technology Association Workshop 2011, Canberra, Australia, pp. 126–130, December 2011. http://www.aclweb.org/anthology/U/U11/U11-2016
Tufiş, D., Chiţu, A.: Automatic diacritics insertion in Romanian texts. In: Proceedings of the International Conference on Computational Lexicography, Pecs, Hungary, pp. 185–194 (1999)
Scannell, K.P.: Statistical unicodification of African languages. Lang. Resour. Eval. 45(3), 375–386 (2011)
Simard, M.: Automatic insertion of accents in French text. In: Proceedings of the Third Conference on Empirical Methods for Natural Language Processing, pp. 27–35 (1998)
Wagacha, P.W., De Pauw, G., Githinji P.W.: A grapheme-based approach to accent restoration in Gĩkũyũ. In: Fifth International Conference on Language Resources and Evaluation (2006)
Yarowsky, D.: A comparison of corpus-based techniques for restoring accents in Spanish and French text. In: Proceedings of 2nd Annual Workshop on Very Large Corpora, Kyoto, pp. 19–32 (1994)
Yarowsky, D.: Corpus-Based Techniques for Restoring Accents in Spanish and French Text, Natural Language Processing Using Very Large Corpora. Kluwer Academic Publishers, pp. 99–120 (1999)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information, arXiv preprint arXiv:1607.04606 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ezeani, I., Hepple, M., Onyenwe, I., Enemuo, C. (2018). Multi-task Projected Embedding for Igbo. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-00794-2_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)