BOWL: Bag of Word Clusters Text Representation Using Word Embeddings

Rui, Weikang; Xing, Kai; Jia, Yawei

doi:10.1007/978-3-319-47650-6_1

Weikang Rui¹⁵,
Kai Xing¹⁵ &
Yawei Jia¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9983))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1927 Accesses
6 Citations

Abstract

The text representation is fundamental for text mining and information retrieval. The Bag Of Words (BOW) and its variants (e.g. TF-IDF) are very basic text representation methods. Although the BOW and TF-IDF are simple and perform well in tasks like classification and clustering, its representation efficiency is extremely low. Besides, word level semantic similarity is not captured which results failing to capture text level similarity in many situations. In this paper, we propose a straightforward Bag Of Word cLusters (BOWL) representation for texts in a higher level, much lower dimensional space. We exploit the word embeddings to group semantically close words and consider them as a whole. The word embeddings are trained on a large corpus and incorporate extensive knowledge. We demonstrate on three benchmark datasets and two tasks, that BOWL representation shows significant advantages in terms of representation accuracy and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. J. Mach. Learn. Res. 3, 1183–1208 (2003)
MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Blunsom, P., Grefenstette, E., Kalchbrenner, N., et al.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (2014)
Google Scholar
Chen, M., Xu, Z., Weinberger, K., Sha, F.: Marginalized denoising autoencoders for domain adaptation. arXiv preprint arXiv:1206.4683 (2012)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JAsIs 41(6), 391–407 (1990)
Article Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep learning approach. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 513–520 (2011)
Google Scholar
Griffiths, T.L., Steyvers, M., Blei, D.M., Tenenbaum, J.B.: Integrating topics and syntax. In: Advances in Neural Information Processing Systems, pp. 537–544 (2004)
Google Scholar
Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 856–864 (2010)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Google Scholar
Inza, I., Larrañaga, P., Etxeberria, R., Sierra, B.: Feature subset selection by Bayesian network-based optimization. Artif. Intell. 123(1), 157–184 (2000)
Article MATH Google Scholar
Jiang, C., Coenen, F., Sanderson, R., Zito, M.: Text classification using graph mining-based feature extraction. Knowl. Based Syst. 23(4), 302–308 (2010)
Article Google Scholar
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings of The 32nd International Conference on Machine Learning, pp. 957–966 (2015)
Google Scholar
Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retrieval 14(2), 178–203 (2011)
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Oh, I.S., Lee, J.S., Moon, B.R.: Hybrid genetic algorithms for feature selection. IEEE Trans. Pattern Anal. Mach. Intell. 26(11), 1424–1437 (2004)
Article Google Scholar
Petterson, J., Buntine, W., Narayanamurthy, S.M., Caetano, T.S., Smola, A.J.: Word features for latent dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 1921–1929 (2010)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Xu, Z.E., Chen, M., Weinberger, K.Q., Sha, F.: From sbow to dCoT marginalized encoders for text representation. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 1879–1884. ACM (2012)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, University of Science and Technology of China, Hefei, Anhui, China
Weikang Rui, Kai Xing & Yawei Jia

Authors

Weikang Rui
View author publications
You can also search for this author in PubMed Google Scholar
Kai Xing
View author publications
You can also search for this author in PubMed Google Scholar
Yawei Jia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weikang Rui .

Editor information

Editors and Affiliations

University of Passau, Passau, Germany
Franz Lehner
University of Passau , Passau, Germany
Nora Fteimi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rui, W., Xing, K., Jia, Y. (2016). BOWL: Bag of Word Clusters Text Representation Using Word Embeddings. In: Lehner, F., Fteimi, N. (eds) Knowledge Science, Engineering and Management. KSEM 2016. Lecture Notes in Computer Science(), vol 9983. Springer, Cham. https://doi.org/10.1007/978-3-319-47650-6_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-47650-6_1
Published: 05 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47649-0
Online ISBN: 978-3-319-47650-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics