Word Embeddings for Semantic Resemblance of Substantial Text Data: A Comparative Study

  • Kazi Lutful KabirEmail author
  • Fardina Fathmiul Alam
  • Anika Binte Islam
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 141)


Extraction of semantic resemblance from text data is an important task in the field of text mining. Out of several approaches in this direction, strategies based on distributional semantics are found to be reasonably effective. A number of such semantic word embeddings of considerably high quality are publicly available. The aim of this article is to compare a few of those both qualitatively and quantitatively and find which one is more suitable for dealing with a large amount of text data. The techniques considered have also been contrasted as superior to traditional semantic analyses.


Centroid approach Distributional semantics Semantic analysis Text data Word embedding 

Supplementary material

464195_1_En_30_MOESM1_ESM.pdf (664 kb)
Supplementary material 1 (pdf 664 KB)


  1. 1.
    Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)Google Scholar
  2. 2.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  3. 3.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)CrossRefGoogle Scholar
  4. 4.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP), vol. 12 (2014)Google Scholar
  5. 5.
    A java implementation of the glove algorithm.
  6. 6.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196 (2014)Google Scholar
  7. 7.
    Kim, H.K., Kim, H., Cho, S.: Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266, 336–352 (2017)CrossRefGoogle Scholar
  8. 8.
    Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document spectral clustering. Pattern Recognit. Lett. 85, 29–34 (2017)CrossRefGoogle Scholar
  9. 9.
    Cha, M., Gwon, Y., Kung, H.: Language modeling by clustering with word embeddings for text readability assessment. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2003–2006. ACM (2017)Google Scholar
  10. 10.
    Wang, S., Koopman, R.: Semantic embedding for information retrieval. In: BIR@ ECIR, pp. 122–132 (2017)Google Scholar
  11. 11.
    Ganguly, D., Ghosh, K.: Contextual word embedding: a case study in clustering tweets about emergency situations. In: Companion of the The Web Conference 2018 on The Web Conference 2018, pp. 73–74. International World Wide Web Conferences Steering Committee (2018)Google Scholar
  12. 12.
    Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation (2016). arXiv:1607.05368
  13. 13.
    Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based framework for text categorization. Procedia Eng. 69, 1356–1364 (2014)CrossRefGoogle Scholar
  14. 14.
    Dai, T.: News articles (2017).
  15. 15.
  16. 16.
  17. 17.
  18. 18.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation.
  19. 19.
    Řehůřek, R., Sojka, P.: Software framework for topic modeling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA (2010)Google Scholar
  20. 20.
    Řehůřek, R., Sojka, P.: Deep learning with word2vec.
  21. 21.
    Řehůřek, R.: Doc2vec tutorial.
  22. 22.

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Kazi Lutful Kabir
    • 1
    Email author
  • Fardina Fathmiul Alam
    • 1
  • Anika Binte Islam
    • 2
  1. 1.Department of Computer ScienceGeorge Mason UniversityFairfaxUSA
  2. 2.Department of Computer Science and EngineeringMilitary Institute of Science and TechnologyDhakaBangladesh

Personalised recommendations