Multi-task Projected Embedding for Igbo

Ezeani, Ignatius; Hepple, Mark; Onyenwe, Ikechukwu; Enemuo, Chioma

doi:10.1007/978-3-030-00794-2_31

Multi-task Projected Embedding for Igbo

Ignatius Ezeani¹⁹,
Mark Hepple¹⁹,
Ikechukwu Onyenwe¹⁹ &
…
Chioma Enemuo¹⁹

Conference paper
First Online: 08 September 2018

1368 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Abstract

NLP research on low resource African languages is often impeded by the unavailability of basic resources: tools, techniques, annotated corpora, and datasets. Besides the lack of funding for the manual development of these resources, building from scratch will amount to the reinvention of the wheel. Therefore, adapting existing techniques and models from well-resourced languages is often an attractive option. One of the most generally applied NLP models is word embeddings. Embedding models often require large amounts of data to train which are not available for most African languages. In this work, we adopt an alignment based projection method to transfer trained English embeddings to the Igbo language. Various English embedding models were projected and evaluated on the odd-word, analogy and word-similarity tasks intrinsically, and also on the diacritic restoration task. Our results show that the projected embeddings performed very well across these tasks.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
jw.org.
2.
A wordkey is a word stripped of its diacritics if it has any. Wordkeys could have multiple diacritic variants, one of which could be the same as the wordkey itself.
3.
The pre-trained Igbo model from fastText Wiki word vectors project [19] was also tested but its performance was so bad that we had to drop it.
4.
https://code.google.com/archive/p/word2vec/.
5.
Highly dominant variants or very rarely occurring wordkeys were generally excluded from the datasets.
6.
An alternative considered is to combine the word e.g. nkwubi okwu \(\rightarrow \) nkwubi-okwu and update the model with a projected vector or a combination of the vectors of constituting words.
7.
We intend to implement higher level n-gram models.
8.
See igbonlp.org.

References

Crandall, D.: Automatic Accent Restoration in Spanish text (2005). http://www.cs.indiana.edu/~djcran/projects/674_final.pdf. Accessed 7 Jan 2016
De Pauw, G., De Schryver, G.M., Pretorius, L., Levin, L.: Introduction to the special issue on African language technology. Lang. Resour. Eval. 45, 263–269 (2011)
Article Google Scholar
Ezeani, I., Hepple, M., Onyenwe, I.: Automatic restoration of diacritics for Igbo language. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 198–205. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_23
Chapter Google Scholar
Ezeani, I., Hepple, M., Onyenwe, I.: Lexical disambiguation of Igbo using diacritic restoration. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and Their Applications, pp. 53–60 (2017)
Google Scholar
Finkelstein, L., et al.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)
Google Scholar
Francom, J., Hulden, M.: Diacritic error detection and restoration via POS tags. In: Proceedings of the 6th Language and Technology Conference (2013)
Google Scholar
Guo, J., Che, W., Yarowsky, D., Wang, H., Liu, T.: Cross-lingual dependency parsing based on distributed representations. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol1: Long Papers), pp. 1234–1244 (2015)
Google Scholar
Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_35
Chapter Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781 (2013)
Onyenwe, I.E., Hepple, M., Chinedu, U., Ezeani, I.: A basic language resource kit implementation for the IgboNLP project. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17(2), 101–1023 (2018)
Google Scholar
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 22 May, pp. 45–50. ELRA, Valletta (2010). http://is.muni.cz/publication/884893/en
Cocks, J., Keegan, T.-T.: A word-based approach for diacritic restoration in Māori. In: Proceedings of the Australasian Language Technology Association Workshop 2011, Canberra, Australia, pp. 126–130, December 2011. http://www.aclweb.org/anthology/U/U11/U11-2016
Tufiş, D., Chiţu, A.: Automatic diacritics insertion in Romanian texts. In: Proceedings of the International Conference on Computational Lexicography, Pecs, Hungary, pp. 185–194 (1999)
Google Scholar
Scannell, K.P.: Statistical unicodification of African languages. Lang. Resour. Eval. 45(3), 375–386 (2011)
Article Google Scholar
Simard, M.: Automatic insertion of accents in French text. In: Proceedings of the Third Conference on Empirical Methods for Natural Language Processing, pp. 27–35 (1998)
Google Scholar
Wagacha, P.W., De Pauw, G., Githinji P.W.: A grapheme-based approach to accent restoration in Gĩkũyũ. In: Fifth International Conference on Language Resources and Evaluation (2006)
Google Scholar
Yarowsky, D.: A comparison of corpus-based techniques for restoring accents in Spanish and French text. In: Proceedings of 2nd Annual Workshop on Very Large Corpora, Kyoto, pp. 19–32 (1994)
Google Scholar
Yarowsky, D.: Corpus-Based Techniques for Restoring Accents in Spanish and French Text, Natural Language Processing Using Very Large Corpora. Kluwer Academic Publishers, pp. 99–120 (1999)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information, arXiv preprint arXiv:1607.04606 (2016)

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Sheffield, Sheffield, UK
Ignatius Ezeani, Mark Hepple, Ikechukwu Onyenwe & Chioma Enemuo

Authors

Ignatius Ezeani
View author publications
You can also search for this author in PubMed Google Scholar
Mark Hepple
View author publications
You can also search for this author in PubMed Google Scholar
Ikechukwu Onyenwe
View author publications
You can also search for this author in PubMed Google Scholar
Chioma Enemuo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ignatius Ezeani .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ezeani, I., Hepple, M., Onyenwe, I., Enemuo, C. (2018). Multi-task Projected Embedding for Igbo. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-00794-2_31
Published: 08 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00793-5
Online ISBN: 978-3-030-00794-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics