Word Mover’s Distance for Agglomerative Short Text Clustering

Franciscus, Nigel; Ren, Xuguang; Wang, Junhu; Stantic, Bela

doi:10.1007/978-3-030-14799-0_11

Word Mover’s Distance for Agglomerative Short Text Clustering

Nigel Franciscus¹⁸,
Xuguang Ren¹⁸,
Junhu Wang¹⁸ &
…
Bela Stantic¹⁸

Conference paper
First Online: 07 March 2019

2011 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11431))

Abstract

In the era of information overload, text clustering plays an important part in the analysis processing pipeline. Partitioning high-quality texts into unseen categories tremendously helps applications in information retrieval, databases, and business intelligence domains. Short texts from social media environment such as tweets, however, remain difficult to interpret due to the broad aspects of contexts. Traditional text similarity approaches only rely on the lexical matching while ignoring the semantic meaning of words. Recent advances in distributional semantic space have opened an alternative approach in utilizing high-quality word embeddings to aid the interpretation of text semantics. In this paper, we investigate the word mover’s distance metrics to automatically cluster short text using the word semantic information. We utilize the agglomerative strategy as the clustering method to efficiently group texts based on their similarity. The experiment indicates the word mover’s distance outperformed other standard metrics in the short text clustering task.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14
Chapter Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
MATH Google Scholar
Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W.W.: Tweet2vec: character-based distributed representations for social media. In: The 54th Annual Meeting of the Association for Computational Linguistics, p. 269 (2016)
Google Scholar
Franciscus, N., Ren, X., Stantic, B.: Answering temporal analytic queries over big data based on precomputing architecture. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawiński, B. (eds.) ACIIDS 2017. LNCS (LNAI), vol. 10191, pp. 281–290. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54472-4_27
Chapter Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan Kaufmann Publishers Inc. (1999)
Google Scholar
Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)
Google Scholar
Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Liu, C.Y., Chen, M.S., Tseng, C.Y.: Incrests: towards real-time incremental short text summarization on comment streams from social network services. IEEE Trans. Knowl. Data Eng. 27(11), 2986–3000 (2015)
Article Google Scholar
Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In: Advances in Neural Information Processing Systems, pp. 1081–1088 (2009)
Google Scholar
Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: ICCV, vol. 9, pp. 460–467 (2009)
Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)
Article Google Scholar
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394. Association for Computational Linguistics (2010)
Google Scholar
Vakulenko, S., Nixon, L., Lupu, M.: Character-based neural embeddings for tweet clustering. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pp. 36–44 (2017)
Google Scholar
Vosoughi, S., Vijayaraghavan, P., Roy, D.: Tweet2vec: learning tweet embeddings using character-level CNN-LSTM encoder-decoder. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1041–1044. ACM (2016)
Google Scholar
Vosoughi, S., Vijayaraghavan, P., Yuan, A., Roy, D.: Mapping twitter conversation landscapes. In: Proceedings of the Eleventh International Conference on Web and Social Media, ICWSM, 15–18 May 2017, pp. 684–687 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Integrated and Intelligent Systems, Brisbane, QLD, Australia
Nigel Franciscus, Xuguang Ren, Junhu Wang & Bela Stantic

Authors

Nigel Franciscus
View author publications
You can also search for this author in PubMed Google Scholar
Xuguang Ren
View author publications
You can also search for this author in PubMed Google Scholar
Junhu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bela Stantic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nigel Franciscus .

Editor information

Editors and Affiliations

Ton Duc Thang University, Ho Chi Minh City, Vietnam
Ngoc Thanh Nguyen
Bina Nusantara University, Jakarta, Indonesia
Ford Lumban Gaol
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Franciscus, N., Ren, X., Wang, J., Stantic, B. (2019). Word Mover’s Distance for Agglomerative Short Text Clustering. In: Nguyen, N., Gaol, F., Hong, TP., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2019. Lecture Notes in Computer Science(), vol 11431. Springer, Cham. https://doi.org/10.1007/978-3-030-14799-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-14799-0_11
Published: 07 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14798-3
Online ISBN: 978-3-030-14799-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics