Abstract
Extraction of semantic resemblance from text data is an important task in the field of text mining. Out of several approaches in this direction, strategies based on distributional semantics are found to be reasonably effective. A number of such semantic word embeddings of considerably high quality are publicly available. The aim of this article is to compare a few of those both qualitatively and quantitatively and find which one is more suitable for dealing with a large amount of text data. The techniques considered have also been contrasted as superior to traditional semantic analyses.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391 (1990)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP), vol. 12 (2014)
A java implementation of the glove algorithm. https://github.com/erwtokritos/JGloVe
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196 (2014)
Kim, H.K., Kim, H., Cho, S.: Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266, 336–352 (2017)
Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document spectral clustering. Pattern Recognit. Lett. 85, 29–34 (2017)
Cha, M., Gwon, Y., Kung, H.: Language modeling by clustering with word embeddings for text readability assessment. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2003–2006. ACM (2017)
Wang, S., Koopman, R.: Semantic embedding for information retrieval. In: BIR@ ECIR, pp. 122–132 (2017)
Ganguly, D., Ghosh, K.: Contextual word embedding: a case study in clustering tweets about emergency situations. In: Companion of the The Web Conference 2018 on The Web Conference 2018, pp. 73–74. International World Wide Web Conferences Steering Committee (2018)
Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation (2016). arXiv:1607.05368
Trstenjak, B., Mikac, S., Donko, D.: KNN with TF-IDF based framework for text categorization. Procedia Eng. 69, 1356–1364 (2014)
Dai, T.: News articles (2017). https://doi.org/10.7910/DVN/GMFCTR
Weiss, R.J.: Cleaning text. https://stanford.edu/~rjweiss/public_html/IRiSS2013/text2/notebooks/cleaningtext.html
Similarity measure of textual documents. https://www.kernix.com/blog/similarity-measure-of-textual-documents_p12
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. https://nlp.stanford.edu/projects/glove/
Řehůřek, R., Sojka, P.: Software framework for topic modeling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA (2010)
Řehůřek, R., Sojka, P.: Deep learning with word2vec. https://radimrehurek.com/gensim/models/word2vec.html
Řehůřek, R.: Doc2vec tutorial. https://rare-technologies.com/doc2vec-tutorial/
Bakharia, A.: Topic modeling with scikit learn. https://medium.com/@aneesha/topic-modeling-with-scikit-learn-e80d33668730
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
30.1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kabir, K.L., Alam, F.F., Islam, A.B. (2020). Word Embeddings for Semantic Resemblance of Substantial Text Data: A Comparative Study. In: Somani, A.K., Shekhawat, R.S., Mundra, A., Srivastava, S., Verma, V.K. (eds) Smart Systems and IoT: Innovations in Computing. Smart Innovation, Systems and Technologies, vol 141. Springer, Singapore. https://doi.org/10.1007/978-981-13-8406-6_30
Download citation
DOI: https://doi.org/10.1007/978-981-13-8406-6_30
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-8405-9
Online ISBN: 978-981-13-8406-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)