Taxonomy-Augmented Features for Document Clustering

Seifollahi, Sattar; Piccardi, Massimo; Borzeshi, Ehsan Zare; Kruger, Bernie

doi:10.1007/978-981-13-6661-1_19

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 996))

Included in the following conference series:

Australasian Conference on Data Mining

1077 Accesses
1 Citations

Abstract

In document clustering, individual documents are typically represented by feature vectors based on term-frequency or bag-of-word models. However, such feature vectors intrinsically dismiss the order of the words in the document and suffer from very high dimensionality. For these reasons, in this paper we present novel taxonomy-augmented features that enjoy two promising characteristics: (1) they leverage semantic word embeddings to take the word order into account, and (2) they reduce the feature dimensionality to a very manageable size. Our feature extraction approach consists of three main steps: first, we apply a word embedding technique to represent the words in a word embedding space. Second, we partition the word vocabulary into a hierarchy of clusters by using k-means hierarchically. Lastly, the individual documents are projected to the hierarchy and a compact feature vector is extracted. We propose two methods for generating the features: the first uses all the clusters in the hierarchy and results in a feature vector whose dimensionality is equal to the number of the clusters. The second uses a small set of user-defined words and results in an even smaller feature vector whose dimensionality is equal to the size of the set. Numerical experiments on document clustering show that the proposed approach is capable of achieving comparable or even higher accuracy than conventional feature vectors with a much more compact representation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Gabow, H. (Ed.) Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms [SODA07], pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Google Scholar
Bagirov, A., Seifollahi, S., Piccardi, M., Zare, E., Kruger, B.: SMGKM: an efficient incremental algorithm for clustering document collections. CICLing 2018 (2018)
Google Scholar
Brock, G., Pihur, V., Datta, S., Datta, S.: clValid: an R package for cluster validation. J. Stat. Softw. 25, 1–22 (2008)
Article Google Scholar
Y. Cheng. Ontology-based fuzzy semantic clustering. In Proceedings – 3rd International Conference on Convergence and Hybrid Information Technology, ICCIT 2008, vol. 2, pp. 128–133 (2008)
Google Scholar
Das, R., Zaheer, M., Dyer, C.: Gaussian LDA for topic models with word embeddings. In: Proceedings ACL 2015, pp. 795–804 (2015)
Google Scholar
Dhillon, S., Fan, J., Guan, Y.: Efficient clustering of very large document collections. In: Kamath, C., Kumar, V., Grossman, R., Namburu, R. (eds.) Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, Oxford (2001)
Google Scholar
Elsayed, A., Mokhtar, H.M.O., Ismail, O.: Ontology based document clustering using Mapreduce. Int. J. Database Manag. Syst. 7(2), 1–12 (2015)
Article Google Scholar
Erra, U., Senatore, S., Minnella, F., Caggianese, G.: Approximate tf-idf based on topic extraction from massive message stream using the gpu. Inf. Sci. 292, 143–161 (2015)
Article Google Scholar
Fodeh, S., Punch, B., Tan, P.-N.: On ontology-driven document clustering using core semantic features. Knowl. Inf. Syst. 28(2), 395–421 (2011)
Article Google Scholar
Friedman, J.H.: On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Min. Knowl. Discov. 1(1), 55–77 (1997)
Article MathSciNet Google Scholar
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, pp. 541–544 (2003)
Google Scholar
Kim, J., Rousseau, F., Vazirgiannis, M.: Convolutional sentence kernel from word embeddings for short text categorization. In: Proceedings EMNLP 2015, pp. 775–780, September 2015
Google Scholar
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. In: Proceedings - ICML, vol. 37, pp. 957–966 (2015)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32, pp. 1188–1196 (2014)
Google Scholar
Lenc, L., Král, P.: Word embeddings for multi-label document classification. In: Proceedings of Recent Advances in Natural Language Processing, pp. 431–437 (2017)
Google Scholar
Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: Proceedings of IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 136–140 (2015)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Arxiv, pp. 1–12 (2013)
Google Scholar
Moseley, B., Wang, J.R.: Approximation bounds for hierarchical clustering: average linkage, bisecting K-means, and local search. Number Nips, pp. 3097–3106 (2017)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings EMNLP 2014, pp. 1532–1543 (2014)
Google Scholar
Qimin, C., Qiao, G., Yongliang, W., Xianghua, W.: Text clustering using vsm with feature clusters. Neural Comput. Appl. 26(4), 995–1003 (2015)
Article Google Scholar
Recupero, D.R.: A new unsupervised method for document clustering by using WordNet lexical and conceptual relations. Inf. Retr. 10(6), 563–579 (2007)
Article Google Scholar
Seifollahi, S., Bagirov, A., Layton, R., Gondal, I.: Optimization based clustering algorithms for authorship analysis of phishing emails. Neural Process. Lett. 46(2), 411–425 (2017)
Article Google Scholar
Seifzadeh, S., Farahat, A.K., Kamel, M.S., Karray, F.: Short-text clustering using statistical semantics. In: Proceedings WWW 2015, pp. 805–810 (2015)
Google Scholar
Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing with compositional vector grammars. In: Proceedings ACL 2013, vol. 1, pp. 455–465 (2013)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD workshop on text mining, vol. 400, pp. 1–2 (2000)
Google Scholar
Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings ACL, pp. 1555–1565 (2014)
Google Scholar
Wang, P., Xu, B., Xu, J., Tian, G., Liu, C.L., Hao, H.: Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174, 806–814 (2016)
Article Google Scholar
Xun, G., Gopalakrishnan, V., Li, F.M.Y., Gao, J., Zhang, A.: Topic discovery for short texts using word embeddings. In: Proceedings - IEEE International Conference on Data Mining, ICDM, pp. 1299–1304 (2017)
Google Scholar
Zhang, D., Xu, H., Su, Z., Xu, Y.: Chinese comments sentiment classification based on word2vec and SVMperf. Expert Syst. Appl. 42(4), 1857–1863 (2015)
Article Google Scholar

Download references

Acknowledgement

This project was funded by the Capital Markets Cooperative Research Centre in combination with the Transport Accident Commission of Victoria. Acknowledgements and thanks to industry partner David Attwood (Lead Operational Management and Data Research). This research has received ethics approval from the University of Technology Sydney (UTS HREC REF NO. ETH16-0968).

Author information

Authors and Affiliations

University of Technology Sydney, Sydney, NSW, Australia
Sattar Seifollahi & Massimo Piccardi
Capital Markets Cooperative Research Centre (CMCRC), Sydney, NSW, Australia
Sattar Seifollahi & Ehsan Zare Borzeshi
Transport Accident Commission (TAC), Geelong, VIC, Australia
Sattar Seifollahi & Bernie Kruger

Authors

Sattar Seifollahi
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Piccardi
View author publications
You can also search for this author in PubMed Google Scholar
Ehsan Zare Borzeshi
View author publications
You can also search for this author in PubMed Google Scholar
Bernie Kruger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sattar Seifollahi .

Editor information

Editors and Affiliations

School of Computing and Mathematics, Charles Sturt University, Albury, NSW, Australia
Rafiqul Islam
University of Auckland, Auckland, New Zealand
Yun Sing Koh
CSIRO Scientific Computing, Canberra, Australia
Yanchang Zhao
Data Science and Engineering, Australian Taxation Office, Canberra, Australia
Graco Warwick
Department of Information Technology, University of Wollongong, Wollongong, NSW, Australia
David Stirling
School of Computing and Mathematics, Charles Sturt University, Wagga Wagga, Australia
Chang-Tsun Li
School of Computing and Mathematics, Charles Sturt University, Bathurst, Australia
Zahidul Islam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Seifollahi, S., Piccardi, M., Borzeshi, E.Z., Kruger, B. (2019). Taxonomy-Augmented Features for Document Clustering. In: Islam, R., et al. Data Mining. AusDM 2018. Communications in Computer and Information Science, vol 996. Springer, Singapore. https://doi.org/10.1007/978-981-13-6661-1_19

Download citation

DOI: https://doi.org/10.1007/978-981-13-6661-1_19
Published: 16 February 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6660-4
Online ISBN: 978-981-13-6661-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics