Skip to main content

Word Mover’s Distance for Agglomerative Short Text Clustering

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11431))

Abstract

In the era of information overload, text clustering plays an important part in the analysis processing pipeline. Partitioning high-quality texts into unseen categories tremendously helps applications in information retrieval, databases, and business intelligence domains. Short texts from social media environment such as tweets, however, remain difficult to interpret due to the broad aspects of contexts. Traditional text similarity approaches only rely on the lexical matching while ignoring the semantic meaning of words. Recent advances in distributional semantic space have opened an alternative approach in utilizing high-quality word embeddings to aid the interpretation of text semantics. In this paper, we investigate the word mover’s distance metrics to automatically cluster short text using the word semantic information. We utilize the agglomerative strategy as the clustering method to efficiently group texts based on their similarity. The experiment indicates the word mover’s distance outperformed other standard metrics in the short text clustering task.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://code.google.com/archive/p/word2vec/.

  2. 2.

    https://nlp.stanford.edu/projects/glove/.

  3. 3.

    https://fasttext.cc/.

  4. 4.

    https://code.google.com/archive/p/word2vec/.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  2. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_14

    Chapter  Google Scholar 

  3. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)

    MATH  Google Scholar 

  4. Dhingra, B., Zhou, Z., Fitzpatrick, D., Muehl, M., Cohen, W.W.: Tweet2vec: character-based distributed representations for social media. In: The 54th Annual Meeting of the Association for Computational Linguistics, p. 269 (2016)

    Google Scholar 

  5. Franciscus, N., Ren, X., Stantic, B.: Answering temporal analytic queries over big data based on precomputing architecture. In: Nguyen, N.T., Tojo, S., Nguyen, L.M., Trawiński, B. (eds.) ACIIDS 2017. LNCS (LNAI), vol. 10191, pp. 281–290. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54472-4_27

    Chapter  Google Scholar 

  6. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296. Morgan Kaufmann Publishers Inc. (1999)

    Google Scholar 

  7. Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)

    Google Scholar 

  8. Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)

    Google Scholar 

  9. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)

    Google Scholar 

  10. Liu, C.Y., Chen, M.S., Tseng, C.Y.: Incrests: towards real-time incremental short text summarization on comment streams from social network services. IEEE Trans. Knowl. Data Eng. 27(11), 2986–3000 (2015)

    Article  Google Scholar 

  11. Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In: Advances in Neural Information Processing Systems, pp. 1081–1088 (2009)

    Google Scholar 

  12. Pele, O., Werman, M.: Fast and robust earth mover’s distances. In: ICCV, vol. 9, pp. 460–467 (2009)

    Google Scholar 

  13. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  14. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarity and soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)

    Article  Google Scholar 

  15. Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394. Association for Computational Linguistics (2010)

    Google Scholar 

  16. Vakulenko, S., Nixon, L., Lupu, M.: Character-based neural embeddings for tweet clustering. In: Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, pp. 36–44 (2017)

    Google Scholar 

  17. Vosoughi, S., Vijayaraghavan, P., Roy, D.: Tweet2vec: learning tweet embeddings using character-level CNN-LSTM encoder-decoder. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1041–1044. ACM (2016)

    Google Scholar 

  18. Vosoughi, S., Vijayaraghavan, P., Yuan, A., Roy, D.: Mapping twitter conversation landscapes. In: Proceedings of the Eleventh International Conference on Web and Social Media, ICWSM, 15–18 May 2017, pp. 684–687 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nigel Franciscus .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Franciscus, N., Ren, X., Wang, J., Stantic, B. (2019). Word Mover’s Distance for Agglomerative Short Text Clustering. In: Nguyen, N., Gaol, F., Hong, TP., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2019. Lecture Notes in Computer Science(), vol 11431. Springer, Cham. https://doi.org/10.1007/978-3-030-14799-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-14799-0_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-14798-3

  • Online ISBN: 978-3-030-14799-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics