A survey on training and evaluation of word embeddings

Abstract

Word embeddings have proven to be effective for many natural language processing tasks by providing word representations integrating prior knowledge. In this article, we focus on the algorithms and models used to compute those representations and on their methods of evaluation. Many new techniques were developed in a short amount of time, and there is no unified terminology to emphasise strengths and weaknesses of those methods. Based on the state of the art, we propose a thorough terminology to help with the classification of these various models and their evaluations. We also provide comparisons of those algorithms and methods, highlighting open problems and research paths, as well as a compilation of popular evaluation metrics and datasets. This survey gives: (1) an exhaustive description and terminology of currently investigated word embeddings, (2) a clear segmentation of evaluation methods and their associated datasets, and (3) high-level properties to indicate pros and cons of each solution.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    Using the FNV algorithm (Bojanowski et al. [8]).

  2. 2.

    Source code for the tetrahedron by Ignasi:https://tex.stackexchange.com/questions/174317/creating-a-labeled-tetrahedron-with-tikzpicture.

  3. 3.

    http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html.

  4. 4.

    https://github.com/facebookresearch/fastText.

References

  1. 1.

    Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1638–1649. ACL (2018)

  2. 2.

    Almuhareb, A.: Attributes in lexical acquisition. Thesis from the University of Essex (2006)

  3. 3.

    Amsaleg, L., Chelly, O., Furon, T., Girard, S., Houle, M.E., Kawarabayashi, K.-I., Nett, M.: Estimating local intrinsic dimensionality. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 29–38. ACM (2015)

  4. 4.

    Bakarov, A.: A survey of word embeddings evaluation methods. CoRR (2018). arXiv:1801.09536

  5. 5.

    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 238–247. ACL (2014)

  6. 6.

    Baroni, M., Lenci, A.: How we BLESSed distributional semantic evaluation. In: Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, Edinburgh, UK, pp. 1–10. Association for Computational Linguistics (2011)

  7. 7.

    Baroni, M., Murphy, B., Barbu, E., Poesio, M.: Strudel: a corpus-based semantic model based on properties and types. Cogn. Sci. 34, 222–54 (2010)

    Article  Google Scholar 

  8. 8.

    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Ling. 5, 135–146 (2017)

    Google Scholar 

  9. 9.

    Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Proceedings of NeurIPS, Advances in Neural Information Processing Systems 26, pp. 2787–2795. Curran Associates, Inc (2013)

  10. 10.

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krüger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. OpenAI publications (2020)

  11. 11.

    Bruni, E., Tran, N.K., Baroni, M.: Multimodal distributional semantics. J. Artif. Int. Res. 49(1), 1–47 (2014)

    MathSciNet  MATH  Google Scholar 

  12. 12.

    Claveau, V., Kijak, E.: Direct vs. indirect evaluation of distributional thesauri. In: International Conference on Computational Linguistics, COLING, Osaka, Japan (2016)

  13. 13.

    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 999888, 2493–2537 (2011)

    MATH  Google Scholar 

  14. 14.

    Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019, pp. 4171–4186 (2019). arXiv:1810.04805

  15. 15.

    Houle, M.E., Kashima, H., Nett, M.: Generalized expansion dimension. In: Proceedings—12th IEEE International Conference on Data Mining Workshops, ICDMW 2012, pp. 587–594 (2012)

  16. 16.

    Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002)

    Article  Google Scholar 

  17. 17.

    Ganea, O., Bécigneul, G., Hofmann, T.: Hyperbolic neural networks. CoRR (2018). arXiv:1805.09112

  18. 18.

    Gerz, D., Vulic, I., Hill, F., Reichart, R., Korhonen, A.: Simverb-3500: A large-scale evaluation set of verb similarity. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016). arXiv:1608.00869

  19. 19.

    Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, 2018 (2018)

  20. 20.

    Gu, J., Bradbury, J., Xiong, C., Li, V.O.K., Socher, R.: Non-autoregressive neural machine translation. In: International Conference on Learning Representations, ICLR (2018). arXiv:1711.02281

  21. 21.

    Harris, Z.: Distributional structure. Word 10, 146–162 (1954)

    Article  Google Scholar 

  22. 22.

    Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Am. J. Comput. Ling. 41(4), 665–695 (2015)

    MathSciNet  Article  Google Scholar 

  23. 23.

    Iacobacci, I., Pilehvar, M.T., Navigli, R.: SensEmbed: Learning sense embeddings for word and relational similarity. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 95–105. Association for Computational Linguistics (2015)

  24. 24.

    Józefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the limits of language modeling. CoRR (2016). arXiv:1602.02410

  25. 25.

    Karypis, G.: Cluto: a clustering toolkit. Technical Report 02-017, University of Minnesota (Department of Computer Science) (2003)

  26. 26.

    Kochurov, M., Kozlukov, S., Karimov, R., Yanush, V.: Geoopt: adaptive Riemannian optimization in pytorch (2019). https://github.com/geoopt/geoopt

  27. 27.

    Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. Association for Computational Linguistics (2016)

  28. 28.

    Laub, J., Müller, K.-R.: Feature discovery in non-metric pairwise data. J. Mach. Learn. Res. 5, 801–818 (2004)

    MathSciNet  MATH  Google Scholar 

  29. 29.

    Leimeister, M., Wilson, B.J.: Skip-gram word embeddings in hyperbolic space. CoRR (2018). arXiv:1809.01498

  30. 30.

    Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 2177–2185. Curran Associates, Inc (2014)

  31. 31.

    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Ling. 3, 211–225 (2015)

    Google Scholar 

  32. 32.

    Ling, W., Tsvetkov, Y., Amir, S., Fermandez, R., Dyer, C., Black, A.W., Trancoso, I., Lin, C.-C.: Not all contexts are created equal: better word representations with variable attention. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1367–1372. Association for Computational Linguistics (2015)

  33. 33.

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach (2019). arXiv:1907.11692

  34. 34.

    Lu, W., Zhang, Y., Wang, S., Huang, H., Liu, Q., Luo, S.: Concept representation by learning explicit and implicit concept couplings. IEEE Intell. Syst. (2020)

  35. 35.

    Luong, T., Socher, R., Manning, C.: Better word representations with recursive neural networks for morphology. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, pp. 104–113. Association for Computational Linguistics (2013)

  36. 36.

    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Google Scholar 

  37. 37.

    McCann, B., Bradbury, J., Xiong, C., Socher, R.: Learned in translation: contextualized word vectors (2017). arXiv:1708.00107

  38. 38.

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2–4, 2013, Workshop Track Proceedings (2013)

  39. 39.

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119. Curran Associates, Inc (2013)

  40. 40.

    Mikolov, T., Yih, W.-T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 746–751. Association for Computational Linguistics (2013)

  41. 41.

    Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  42. 42.

    Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 6338–6347. Curran Associates, Inc (2017)

  43. 43.

    Nickel, M., Kiela, D.: Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In: Proceedings of the International Conference on Machine Learning, ICML (2018)

  44. 44.

    Niven, T., Kao, H.: Probing neural network comprehension of natural language arguments. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)

  45. 45.

    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

  46. 46.

    Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: Proceedings of NAACL (2018)

  47. 47.

    Poling, B., Lerman, G.: A new approach to two-view motion segmentation using global dimension minimization. Int. J. Comput. Vis. 108(3), 165–185 (2014)

    MathSciNet  Article  Google Scholar 

  48. 48.

    Radford, A.: Improving language understanding by generative pre-training. OpenAI publications (2018)

  49. 49.

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI publications (2019)

  50. 50.

    Roy, O., Vetterli, M.: The effective rank: a measure of effective dimensionality. In: 2007 15th European Signal Processing Conference, pp. 606–610 (2007)

  51. 51.

    Rubenstein, H., Goodenough, J.B.: Contextual correlates of synonymy. Commun. ACM 8(10), 627–633 (1965)

    Article  Google Scholar 

  52. 52.

    Sang, E.F.T.K., Meulder, F.D.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (2003)

  53. 53.

    Schakel, A.M.J., Wilson, B.J.: Measuring word significance using distributed representations of words. CoRR (2015). arXiv:1508.02297

  54. 54.

    Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 298–307. Association for Computational Linguistics (2015)

  55. 55.

    Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A., Potts, C.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1631–1642. Association for Computational Linguistics (2013)

  56. 56.

    Sun, C., Yan, H., Qiu, X., Huang, X.: Gaussian word embedding with a Wasserstein distance loss. CoRR (2018). arXiv:1808.07016

  57. 57.

    Sun, K., Wang, J., Kalousis, A., Marchand-Maillet, S.: Space-time local embeddings. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 100–108. Curran Associates, Inc (2015)

  58. 58.

    Tifrea, A., Becigneul, G., Ganea, O.-E.: Poincaré glove: hyperbolic word embeddings. In: International Conference on Learning Representations (ICLR 2019) (2019)

  59. 59.

    Torregrossa, F., Claveau, V., Kooli, N., Gravier, G., Allesiardo, R.: On the correlation of word embedding evaluation metrics. In: Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, pp. 4789–4797. European Language Resources Association (2020)

  60. 60.

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc (2017)

  61. 61.

    Vilnis, L., McCallum, A.: Word representations via Gaussian embedding. In: International Conference on Learning Representations, ICLR 2015 (2015)

  62. 62.

    Vulić, I., Gerz, D., Kiela, D., Hill, F., Korhonen, A.: HyperLex: a large-scale evaluation of graded lexical entailment. Am. J. Comput. Ling. 43(4), 781–835 (2017)

    MathSciNet  Article  Google Scholar 

  63. 63.

    Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. Association for Computational Linguistics (2018)

  64. 64.

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T.L., Gugger, S., Drame, M., Lhoest, Q., Rush, A.M.: Huggingface’s transformers: state-of-the-art natural language processing (2019). arXiv:1910.03771

  65. 65.

    Yang, Z., Dai, Z., Yang, Y., Carbonell, J.G., Salakhutdinov, R., and Le, Q.V: XLNet: generalized autoregressive pretraining for language understanding. CoRR (2019). arXiv:1906.08237

  66. 66.

    You, Y., Li, J., Hseu, J., Song, X., Demmel, J., Hsieh, C.: Reducing BERT pre-training time from 3 days to 76 minutes. CoRR (2019). arXiv:1904.00962

  67. 67.

    Zhai, X., Oliver, A., Kolesnikov, A., Beyer, L.: S\({}^{\text{4}}\)l: self-supervised semi-supervised learning. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)

  68. 68.

    Zhou, P., Qi, Z., Zheng, S., Xu, J., Bao, H., Xu, B.: Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics (2016)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to François Torregrossa.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Nihel Kooli and Robin Allesiardo have moved to another company/ organization.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Torregrossa, F., Allesiardo, R., Claveau, V. et al. A survey on training and evaluation of word embeddings. Int J Data Sci Anal (2021). https://doi.org/10.1007/s41060-021-00242-8

Download citation

Keywords

  • Word embeddings
  • Word embedding evaluation
  • Survey
  • Contextualised embeddings
  • Non-Euclidean embeddings