Skip to main content

Multi-task Projected Embedding for Igbo

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11107))

Abstract

NLP research on low resource African languages is often impeded by the unavailability of basic resources: tools, techniques, annotated corpora, and datasets. Besides the lack of funding for the manual development of these resources, building from scratch will amount to the reinvention of the wheel. Therefore, adapting existing techniques and models from well-resourced languages is often an attractive option. One of the most generally applied NLP models is word embeddings. Embedding models often require large amounts of data to train which are not available for most African languages. In this work, we adopt an alignment based projection method to transfer trained English embeddings to the Igbo language. Various English embedding models were projected and evaluated on the odd-word, analogy and word-similarity tasks intrinsically, and also on the diacritic restoration task. Our results show that the projected embeddings performed very well across these tasks.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    jw.org.

  2. 2.

    A wordkey is a word stripped of its diacritics if it has any. Wordkeys could have multiple diacritic variants, one of which could be the same as the wordkey itself.

  3. 3.

    The pre-trained Igbo model from fastText Wiki word vectors project [19] was also tested but its performance was so bad that we had to drop it.

  4. 4.

    https://code.google.com/archive/p/word2vec/.

  5. 5.

    Highly dominant variants or very rarely occurring wordkeys were generally excluded from the datasets.

  6. 6.

    An alternative considered is to combine the word e.g. nkwubi okwu \(\rightarrow \) nkwubi-okwu and update the model with a projected vector or a combination of the vectors of constituting words.

  7. 7.

    We intend to implement higher level n-gram models.

  8. 8.

    See igbonlp.org.

References

  1. Crandall, D.: Automatic Accent Restoration in Spanish text (2005). http://www.cs.indiana.edu/~djcran/projects/674_final.pdf. Accessed 7 Jan 2016

  2. De Pauw, G., De Schryver, G.M., Pretorius, L., Levin, L.: Introduction to the special issue on African language technology. Lang. Resour. Eval. 45, 263–269 (2011)

    Article  Google Scholar 

  3. Ezeani, I., Hepple, M., Onyenwe, I.: Automatic restoration of diacritics for Igbo language. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 198–205. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_23

    Chapter  Google Scholar 

  4. Ezeani, I., Hepple, M., Onyenwe, I.: Lexical disambiguation of Igbo using diacritic restoration. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and Their Applications, pp. 53–60 (2017)

    Google Scholar 

  5. Finkelstein, L., et al.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)

    Google Scholar 

  6. Francom, J., Hulden, M.: Diacritic error detection and restoration via POS tags. In: Proceedings of the 6th Language and Technology Conference (2013)

    Google Scholar 

  7. Guo, J., Che, W., Yarowsky, D., Wang, H., Liu, T.: Cross-lingual dependency parsing based on distributed representations. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol1: Long Papers), pp. 1234–1244 (2015)

    Google Scholar 

  8. Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_35

    Chapter  Google Scholar 

  9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781 (2013)

  10. Onyenwe, I.E., Hepple, M., Chinedu, U., Ezeani, I.: A basic language resource kit implementation for the IgboNLP project. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17(2), 101–1023 (2018)

    Google Scholar 

  11. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 22 May, pp. 45–50. ELRA, Valletta (2010). http://is.muni.cz/publication/884893/en

  12. Cocks, J., Keegan, T.-T.: A word-based approach for diacritic restoration in Māori. In: Proceedings of the Australasian Language Technology Association Workshop 2011, Canberra, Australia, pp. 126–130, December 2011. http://www.aclweb.org/anthology/U/U11/U11-2016

  13. Tufiş, D., Chiţu, A.: Automatic diacritics insertion in Romanian texts. In: Proceedings of the International Conference on Computational Lexicography, Pecs, Hungary, pp. 185–194 (1999)

    Google Scholar 

  14. Scannell, K.P.: Statistical unicodification of African languages. Lang. Resour. Eval. 45(3), 375–386 (2011)

    Article  Google Scholar 

  15. Simard, M.: Automatic insertion of accents in French text. In: Proceedings of the Third Conference on Empirical Methods for Natural Language Processing, pp. 27–35 (1998)

    Google Scholar 

  16. Wagacha, P.W., De Pauw, G., Githinji P.W.: A grapheme-based approach to accent restoration in Gĩkũyũ. In: Fifth International Conference on Language Resources and Evaluation (2006)

    Google Scholar 

  17. Yarowsky, D.: A comparison of corpus-based techniques for restoring accents in Spanish and French text. In: Proceedings of 2nd Annual Workshop on Very Large Corpora, Kyoto, pp. 19–32 (1994)

    Google Scholar 

  18. Yarowsky, D.: Corpus-Based Techniques for Restoring Accents in Spanish and French Text, Natural Language Processing Using Very Large Corpora. Kluwer Academic Publishers, pp. 99–120 (1999)

    Google Scholar 

  19. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information, arXiv preprint arXiv:1607.04606 (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ignatius Ezeani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ezeani, I., Hepple, M., Onyenwe, I., Enemuo, C. (2018). Multi-task Projected Embedding for Igbo. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2018. Lecture Notes in Computer Science(), vol 11107. Springer, Cham. https://doi.org/10.1007/978-3-030-00794-2_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00794-2_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00793-5

  • Online ISBN: 978-3-030-00794-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics