Word Embeddings for Semantic Resemblance of Substantial Text Data: A Comparative Study

Kabir, Kazi Lutful; Alam, Fardina Fathmiul; Islam, Anika Binte

doi:10.1007/978-981-13-8406-6_30

Word Embeddings for Semantic Resemblance of Substantial Text Data: A Comparative Study

Kazi Lutful Kabir⁸,
Fardina Fathmiul Alam⁸ &
Anika Binte Islam⁹

Conference paper
First Online: 27 October 2019

1252 Accesses
1 Citations

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 141))

Abstract

Extraction of semantic resemblance from text data is an important task in the field of text mining. Out of several approaches in this direction, strategies based on distributional semantics are found to be reasonably effective. A number of such semantic word embeddings of considerably high quality are publicly available. The aim of this article is to compare a few of those both qualitatively and quantitatively and find which one is more suitable for dealing with a large amount of text data. The techniques considered have also been contrasted as superior to traditional semantic analyses.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Article Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP), vol. 12 (2014)
Google Scholar
A java implementation of the glove algorithm. https://github.com/erwtokritos/JGloVe
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196 (2014)
Google Scholar
Kim, H.K., Kim, H., Cho, S.: Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266, 336–352 (2017)
Article Google Scholar
Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document spectral clustering. Pattern Recognit. Lett. 85, 29–34 (2017)
Article Google Scholar
Cha, M., Gwon, Y., Kung, H.: Language modeling by clustering with word embeddings for text readability assessment. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2003–2006. ACM (2017)
Google Scholar
Wang, S., Koopman, R.: Semantic embedding for information retrieval. In: BIR@ ECIR, pp. 122–132 (2017)
Google Scholar
Ganguly, D., Ghosh, K.: Contextual word embedding: a case study in clustering tweets about emergency situations. In: Companion of the The Web Conference 2018 on The Web Conference 2018, pp. 73–74. International World Wide Web Conferences Steering Committee (2018)
Google Scholar
Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation (2016). arXiv:1607.05368
Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based framework for text categorization. Procedia Eng. 69, 1356–1364 (2014)
Article Google Scholar
Dai, T.: News articles (2017). https://doi.org/10.7910/DVN/GMFCTR
Weiss, R.J.: Cleaning text. https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/cleaningtext.html
Word2vec. https://code.google.com/archive/p/word2vec/
Similarity measure of textual documents. https://www.kernix.com/blog/similarity-measure-of-textual-documents_p12
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. https://nlp.stanford.edu/projects/glove/
Řehůřek, R., Sojka, P.: Software framework for topic modeling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA (2010)
Google Scholar
Řehůřek, R., Sojka, P.: Deep learning with word2vec. https://radimrehurek.com/gensim/models/word2vec.html
Řehůřek, R.: Doc2vec tutorial. https://rare-technologies.com/doc2vec-tutorial/
Bakharia, A.: Topic modeling with scikit learn. https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730

Download references

Author information

Authors and Affiliations

Department of Computer Science, George Mason University, 4400 University Drive, MSN 4A5, Fairfax, VA, 22030, USA
Kazi Lutful Kabir & Fardina Fathmiul Alam
Department of Computer Science and Engineering, Military Institute of Science and Technology, Mirpur Cantonment, Dhaka, 1216, Bangladesh
Anika Binte Islam

Authors

Kazi Lutful Kabir
View author publications
You can also search for this author in PubMed Google Scholar
Fardina Fathmiul Alam
View author publications
You can also search for this author in PubMed Google Scholar
Anika Binte Islam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kazi Lutful Kabir .

Editor information

Editors and Affiliations

College of Engineering, Iowa State University, Ames, IA, USA
Arun K. Somani
School of Computing and Information Technology, Manipal University Jaipur, Jaipur, Rajasthan, India
Rajveer Singh Shekhawat
Department of Information Technology, Manipal University Jaipur, Jaipur, Rajasthan, India
Ankit Mundra
Department of Information Technology, Manipal University Jaipur, Jaipur, Rajasthan, India
Sumit Srivastava
School of Computing and Information Technology, Manipal University Jaipur, Jaipur, Rajasthan, India
Vivek Kumar Verma

30.1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 664 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kabir, K.L., Alam, F.F., Islam, A.B. (2020). Word Embeddings for Semantic Resemblance of Substantial Text Data: A Comparative Study. In: Somani, A.K., Shekhawat, R.S., Mundra, A., Srivastava, S., Verma, V.K. (eds) Smart Systems and IoT: Innovations in Computing. Smart Innovation, Systems and Technologies, vol 141. Springer, Singapore. https://doi.org/10.1007/978-981-13-8406-6_30

Download citation

DOI: https://doi.org/10.1007/978-981-13-8406-6_30
Published: 27 October 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-8405-9
Online ISBN: 978-981-13-8406-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics