Do Scaling Algorithms Preserve Word2Vec Semantics? A Case Study for Medical Entities

  • Janus WawrzinekEmail author
  • José María González Pinto
  • Philipp Markiewka
  • Wolf-Tilo Balke
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11371)


The exponential increase of scientific publications in the bio-medical field challenges access to scientific information, which primarily is encoded by semantic relationships between medical entities, such as active ingredients, diseases, or genes. Neural language models, such as Word2Vec, offer new ways of automatically learning semantically meaningful entity relationships even from large text corpora. They offer high scalability and deliver better accuracy than comparable approaches. Still, first the models have to be tuned by testing different training parameters. Arguably, the most critical parameter is the number of training dimensions for the neural network training and testing individually different numbers of dimensions is time-consuming. It usually takes hours or even days per training iteration on large corpora. In this paper we show a more efficient way to determine the optimal number of dimensions concerning quality measures such as precision/recall. We show that the quality of results gained using simpler and easier to compute scaling approaches like MDS or PCA correlates strongly with the expected quality when using the same number of Word2Vec training dimensions. This has even more impact if after initial Word2Vec training only a limited number of entities and their respective relations are of interest.


Information extraction Neural language models Scaling approaches 


  1. 1.
    Wawrzinek, J., Balke, W.-T.: Semantic facettation in pharmaceutical collections using deep learning for active substance contextualization. In: Choemprayong, S., Crestani, F., Cunningham, S.J. (eds.) ICADL 2017. LNCS, vol. 10647, pp. 41–53. Springer, Cham (2017). Scholar
  2. 2.
    Wang, Z.Y., Zhang, H.Y.: Rational drug repositioning by medical genetics. Nat. Biotechnol. 31(12), 1080 (2013)CrossRefGoogle Scholar
  3. 3.
    Abdelaziz, I., Fokoue, A., Hassanzadeh, O., Zhang, P., Sadoghi, M.: Large-scale structural and textual similarity-based mining of knowledge graph to predict drug–drug interactions. Web Semant.: Sci., Serv. Agents World Wide Web 44, 104–117 (2017)CrossRefGoogle Scholar
  4. 4.
    Leser, U., Hakenberg, J.: What makes a gene name? Named entity recognition in the biomedical literature. Brief. Bioinform. 6(4), 357–369 (2005)CrossRefGoogle Scholar
  5. 5.
    Lotfi Shahreza, M., Ghadiri, N., Mousavi, S.R., Varshosaz, J., Green, J.R.: A review of network-based approaches to drug repositioning. Brief. Bioinform. bbx017 (2017)Google Scholar
  6. 6.
    Dudley, J.T., Deshpande, T., Butte, A.J.: Exploiting drug–disease relationships for computational drug repositioning. Brief. Bioinform. 12(4), 303–311 (2011)CrossRefGoogle Scholar
  7. 7.
    Willett, P., Barnard, J.M., Downs, G.M.: Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38(6), 983–996 (1998)CrossRefGoogle Scholar
  8. 8.
    Ngo, D.L., et al.: Application of word embedding to drug repositioning. J. Biomed. Sci. Eng. 9(01), 7 (2016)CrossRefGoogle Scholar
  9. 9.
    Lengerich, B.J., Maas, A.L., Potts, C.: Retrofitting Distributional Embeddings to Knowledge Graphs with Functional Relations. arXiv preprint arXiv:1708.00112 (2017)
  10. 10.
    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 238–247 (2014)Google Scholar
  11. 11.
    Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)Google Scholar
  12. 12.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: NIPS (2013)Google Scholar
  13. 13.
    Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing, pp. 2177–2185 (2014)Google Scholar
  14. 14.
    Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)Google Scholar
  15. 15.
    Bengio, Y., Courville, A., Vincent, P., Collobert, R., Weston, J., et al.: Natural language processing (almost) from scratch. IEEE Trans. Pattern Anal. Mach. Intell. 35, 384–394 (2014)Google Scholar
  16. 16.
    Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, vol. 2, pp. 427–431 (2016). Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Valencia, Spain, 3–7 April 2017Google Scholar
  17. 17.
    Borg, I., Groenen, P.J.: Modern Multidimensional Scaling: Theory and Applications. Springer, New york (2005). Scholar
  18. 18.
    Weinberg, S.L.: An introduction to multidimensional scaling. Meas. Eval. Couns. Dev. 24, 12–36 (1991)Google Scholar
  19. 19.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  20. 20.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefGoogle Scholar
  21. 21.
    Hamilton, W.L., Leskovec, J., Jurafsky, D.: Diachronic word embeddings reveal statistical laws of semantic change. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016, pp. 1489–1501 (2016)Google Scholar
  22. 22.
    Altman, D.G., Bland, J.M.: Measurement in medicine: the analysis of method comparison studies. Statistician 32, 307–317 (1983)CrossRefGoogle Scholar
  23. 23.
    Schönemann, P.H.: A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1–10 (1966)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Jessop, D.M., Adams, S.E., Willighagen, E.L., Hawizy, L., Murray-Rust, P.: OSCAR4: a flexible architecture for chemical text-mining. J. Cheminformatics 3(1), 41 (2011)CrossRefGoogle Scholar
  25. 25.
    Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRefGoogle Scholar
  26. 26.
    Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)Google Scholar
  27. 27.
    Gittens, A., Achlioptas, D., Mahoney, M.W.: Skip-gram - zipf + uniform = vector additivity. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 69–76 (2017)Google Scholar
  28. 28.
    Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., Chen, E.: Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: IJCAI International Joint Conference on Artificial Intelligence, pp. 3650–3656 (2015)Google Scholar
  29. 29.
    Canese, K.: PubMed relevance sort. NLM Tech. Bull 394, e2 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.IFIS TU-BraunschweigBrunswickGermany

Personalised recommendations