Cluster Summarization with Dense Region Detection

Bigdeli, Elnaz; Mohammadi, Mahdi; Raahemi, Bijan; Matwin, Stan

doi:10.1007/978-3-319-25840-9_5

Elnaz Bigdeli¹⁵,
Mahdi Mohammadi¹⁶,
Bijan Raahemi¹⁶ &
…
Stan Matwin^17,18

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 553))

Included in the following conference series:

International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management

710 Accesses
2 Citations

Abstract

This paper introduces a new approach to summarize clusters by finding dense regions, and representing each cluster as a Gaussian Mixture Model (GMM). The GMM summarization allows us to summarize a cluster efficiently, then regenerate the original data with high accuracy. Unlike the classical representation of a cluster using a radius and a center, the proposed approach keeps information of the shape, as well as distributions of the samples in the clusters. Considering the GMM as a parametric model (number of Gaussian mixtures in each GMM), we propose a method to find number of Gaussian mixtures automatically. Each GMM is able to summarize a cluster generated by any kind of clustering algorithms and regenerate the original data with high accuracy. Moreover, when a new sample is presented to the GMMs of clusters, a membership value is calculated for each cluster. Then, using the membership values, the new incoming sample is assigned to the closest cluster. Employing the GMMs to summarize clusters offers several advantages with regards to accuracy, detection rate, memory efficiency and time complexity. We evaluate the proposed method on a variety of datasets, both synthetic dataset and real datasets from the UCI repository. We examine the quality of the summarized clusters generated by the proposed method in terms of DUNN, DB, SD and SSD indexes, and compare them with that of the well-known ABACUS method. We also employ the proposed algorithm in anomaly detection applications, and study the performance of the proposed method in terms of false alarm and detection rates, and compare them with Negative Selection, Naïve models, and ABACUS. Furthermore, we evaluate the memory usage and processing time of the proposed algorithms with other algorithms. The results illustrate that our algorithm outperforms other well-known anomaly detection algorithms in terms of accuracy, detection rate, as well as memory usage and processing time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD
Google Scholar
Wang, W., Yang, J., Muntz, R.R.: Sting: a statistical information grid approach to spatial data mining. San Francisco (1997)
Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques: The Morgan Kaufmann Series in Data Management Systems, 3rd edn. Morgan Kaufmann Publishers, Burlington (2006)
Google Scholar
MacQueen, B.J.: Some Methods for classification and Analysis of Multivariate Observations (1967)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Upper Saddle River (1988)
MATH Google Scholar
Kaufman, L., Rousseeuw, J.P.: Clustering by means of Medoids, in Statistical Data Analysis Based on the L_1–Norm and Related Methods. Y. Dodge, North-Holland (1987)
Google Scholar
Karypis, G., Han, H.E., Kumar, V.: CHAMELEON: a hierarchical clustering algorithm using dynamic modeling. IEEE Comput. 32(8), 68–75 (1999)
Article Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, New York (1990)
Book MATH Google Scholar
Agrawal, J., Gunopulos, D., Raghavan, P.: Automatic sub-space clustering of high dimensional data for data mining applications (1998)
Google Scholar
Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise (1998)
Google Scholar
Guha, S., Meyerson, A., Mishra, N., Motwani, R.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 505–528 (2003)
Article Google Scholar
Bifet, A., Holmes, G., Pfahringer, B.: New ensemble methods for evolving data streams (2009)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for clustering evolving data streams (2003)
Google Scholar
Yang, D., Elke, A., Matthew, O.W.: Summarization and matching of density-based clusters in streaming environments. Proc. VLDB Endowment 5(2), 121–132 (2011)
Article Google Scholar
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SIAM Conference on Data Mining (2006)
Google Scholar
Chaoji, V., Li, W., Yildirim, H., Zaki, M.: ABACUS: mining arbitrary shaped clusters from large datasets based on backbone identification. In: SIAM/Omnipress (2011)
Google Scholar
He, Z., Xu, X., Deng, S.: Discovering cluster-based local outliers. Pattern Recogn. Lett. 24, 1641–1650 (2003)
Article MATH Google Scholar
Gaddam, S., Phoha, V., Balagani, K.: K-means+ID3: a novel method for supervised anomaly detection by cascading k-means clustering and ID3 decision tree learning methods. IEEE Trans. Knowl. Data Eng. 19(3), 345–354 (2007)
Article Google Scholar
Mohammadi, M., Akbari, A., Raahemi, B., Nasersharif, B., Asgharian, H.: A fast anomaly detection system using probabilistic artificial immune algorithm capable of learning new attacks. Evol. Intel. 6(3), 135–156 (2014)
Article Google Scholar
Kersting, K., Wahabzada, M., Thurau, C., Bauckhage, C.: Hierarchical convex NMF for clustering massive data (2010)
Google Scholar
Hershberger, J., Shrivastava, N., Suri, S.: Summarizing spatial data streams using ClusterHulls. J. Exp. Algorithmics (JEA) 13 (2009). doi:10.1145/1412228.1412238
Mohammadi, M., Akbari, A., Raahemi, B., Nasersharif, B., Asgharian, H.: A fast anomaly detection system using probabilistic artificial immune algorithm capable of learning new attacks. Evol. Intel. 6(5), 135–156 (2014)
Article Google Scholar
Gaddam, S., Phoha, V., Balagani, K.: K-means+ID3: a novel method for supervised anomaly detection by cascading k-means clustering and ID3 decision tree learning methods. IEEE Trans. Knowl. Data Eng. 19(3), 345–354 (2007)
Article Google Scholar
Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. Cybernetics 4, 95–104 (1997)
Article MathSciNet MATH Google Scholar
Davies, L.D., Bouldin, W.D.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1(4), 224–227 (1979)
Article Google Scholar
Halkidi, M., Vazirgiannis, M., Batistakis, Y.: Quality scheme assessment in the clustering process. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 265–276. Springer, Heidelberg (2000)
Chapter Google Scholar
Sande, P.C., Monroe, J.G.: Negative selection of immature b cells by receptor editing or deletion is determined by site of antigen encounter. Immunity 10(3), 289–299 (1999)
Article Google Scholar

Download references

Acknowledgement

This research was supported by NSERC Canada, Grant Nbr RGPIN/341811-2012.

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, Canada
Elnaz Bigdeli
Knowledge Discovery and Data Mining Lab, Telfer School of Management, University of Ottawa, Ottawa, Canada
Mahdi Mohammadi & Bijan Raahemi
Department of Computing, Dalhousie University, Halifax, Canada
Stan Matwin
Polish Academy of Sciences, Warsaw, Poland
Stan Matwin

Authors

Elnaz Bigdeli
View author publications
You can also search for this author in PubMed Google Scholar
Mahdi Mohammadi
View author publications
You can also search for this author in PubMed Google Scholar
Bijan Raahemi
View author publications
You can also search for this author in PubMed Google Scholar
Stan Matwin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elnaz Bigdeli .

Editor information

Editors and Affiliations

Instituto de Telecomunicações, Lisboa, Portugal
Ana Fred
Delft University of Technology, Delft, Zuid-Holland, The Netherlands
Jan L. G. Dietz
University of Madeira, Funchal, Portugal
David Aveiro
Henley Business School, University of Reading, Reading, United Kingdom
Kecheng Liu
INSTICC, Setubal, Portugal
Joaquim Filipe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bigdeli, E., Mohammadi, M., Raahemi, B., Matwin, S. (2015). Cluster Summarization with Dense Region Detection. In: Fred, A., Dietz, J., Aveiro, D., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2014. Communications in Computer and Information Science, vol 553. Springer, Cham. https://doi.org/10.1007/978-3-319-25840-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-25840-9_5
Published: 28 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25839-3
Online ISBN: 978-3-319-25840-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics