Skip to main content

Word Embeddings for Semantic Resemblance of Substantial Text Data: A Comparative Study

  • Conference paper
  • First Online:

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 141))

Abstract

Extraction of semantic resemblance from text data is an important task in the field of text mining. Out of several approaches in this direction, strategies based on distributional semantics are found to be reasonably effective. A number of such semantic word embeddings of considerably high quality are publicly available. The aim of this article is to compare a few of those both qualitatively and quantitatively and find which one is more suitable for dealing with a large amount of text data. The techniques considered have also been contrasted as superior to traditional semantic analyses.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   259.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   329.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)

    Google Scholar 

  2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  3. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)

    Article  Google Scholar 

  4. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP), vol. 12 (2014)

    Google Scholar 

  5. A java implementation of the glove algorithm. https://github.com/erwtokritos/JGloVe

  6. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196 (2014)

    Google Scholar 

  7. Kim, H.K., Kim, H., Cho, S.: Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266, 336–352 (2017)

    Article  Google Scholar 

  8. Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document spectral clustering. Pattern Recognit. Lett. 85, 29–34 (2017)

    Article  Google Scholar 

  9. Cha, M., Gwon, Y., Kung, H.: Language modeling by clustering with word embeddings for text readability assessment. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2003–2006. ACM (2017)

    Google Scholar 

  10. Wang, S., Koopman, R.: Semantic embedding for information retrieval. In: BIR@ ECIR, pp. 122–132 (2017)

    Google Scholar 

  11. Ganguly, D., Ghosh, K.: Contextual word embedding: a case study in clustering tweets about emergency situations. In: Companion of the The Web Conference 2018 on The Web Conference 2018, pp. 73–74. International World Wide Web Conferences Steering Committee (2018)

    Google Scholar 

  12. Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation (2016). arXiv:1607.05368

  13. Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based framework for text categorization. Procedia Eng. 69, 1356–1364 (2014)

    Article  Google Scholar 

  14. Dai, T.: News articles (2017). https://doi.org/10.7910/DVN/GMFCTR

  15. Weiss, R.J.: Cleaning text. https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/cleaningtext.html

  16. Word2vec. https://code.google.com/archive/p/word2vec/

  17. Similarity measure of textual documents. https://www.kernix.com/blog/similarity-measure-of-textual-documents_p12

  18. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. https://nlp.stanford.edu/projects/glove/

  19. Řehůřek, R., Sojka, P.: Software framework for topic modeling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA (2010)

    Google Scholar 

  20. Řehůřek, R., Sojka, P.: Deep learning with word2vec. https://radimrehurek.com/gensim/models/word2vec.html

  21. Řehůřek, R.: Doc2vec tutorial. https://rare-technologies.com/doc2vec-tutorial/

  22. Bakharia, A.: Topic modeling with scikit learn. https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kazi Lutful Kabir .

Editor information

Editors and Affiliations

30.1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 664 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kabir, K.L., Alam, F.F., Islam, A.B. (2020). Word Embeddings for Semantic Resemblance of Substantial Text Data: A Comparative Study. In: Somani, A.K., Shekhawat, R.S., Mundra, A., Srivastava, S., Verma, V.K. (eds) Smart Systems and IoT: Innovations in Computing. Smart Innovation, Systems and Technologies, vol 141. Springer, Singapore. https://doi.org/10.1007/978-981-13-8406-6_30

Download citation

Publish with us

Policies and ethics