A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm

Lorbeer, Boris; Kosareva, Ana; Deva, Bersant; Softić, Dženan; Ruppel, Peter; Küpper, Axel

doi:10.1007/978-3-319-47898-2_18

Boris Lorbeer⁷,
Ana Kosareva⁷,
Bersant Deva⁷,
Dženan Softić⁷,
Peter Ruppel⁷ &
…
Axel Küpper⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 529))

Included in the following conference series:

INNS Conference on Big Data

2400 Accesses
7 Citations

Abstract

Clustering algorithms are recently regaining attention with the availability of large datasets and the rise of parallelized computing architectures. However, most clustering algorithms do not scale well with increasing dataset sizes and require proper parametrization for correct results. In this paper we present A-BIRCH, an approach for automatic threshold estimation for the BIRCH clustering algorithm using Gap Statistic. This approach renders the global clustering step of BIRCH unnecessary and does not require knowledge on the expected number of clusters beforehand. This is achieved by analyzing a small representative subset of the data to extract attributes such as the cluster radius and the minimal cluster distance. These attributes are then used to compute a threshold that results, with high probability, in the correct clustering of elements. For the analysis of the representative subset we parallelized Gap Statistic to improve performance and ensure scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Parzen, E., Tanabe, K., Kitagawa, G. (eds.) Selected Papers of Hirotugu Akaike, pp. 199–213. Springer, New York (1998)
Chapter Google Scholar
Burbeck, K., Nadjm-Tehrani, S.: Adaptive real-time anomaly detection with incremental clustering. Inf. Secur. Tech. Rep. 12(1), 56–67 (2007)
Article Google Scholar
Dash, M., Liu, H., Xu, X.: ‘\(1+1 > 2\)’: Merging distance and density based clustering. In: Proceedings of Seventh International Conference on Database Systems for Advanced Applications, 2001, pp. 32–39 (2001)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press (1996)
Google Scholar
Ismael, N., Alzaalan, M., Ashour, W.: Improved multi threshold birch clustering algorithm. Int. J. Artif. Intell. Appl. Smart Devices 2(1), 1–10 (2014)
Article Google Scholar
Jordan, M.I., Bach, F.R.: Learning spectral clustering. In: Advances in Neural Information Processing Systems 16. MIT Press (2003)
Google Scholar
Kumar, N.S.L.P., Satoor, S., Buck, I.: Fast parallel expectation maximization for gaussian mixture models on gpus using cuda. In: 11th IEEE International Conference on High Performance Computing and Communications, pp. 103–109 (2009)
Google Scholar
Macqueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Math, Statistics, and Probability, vol. 1, pp. 281–297. University of California Press (1967)
Google Scholar
Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. CoRR (2015)
Google Scholar
Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Shelter Island (2011)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Article MathSciNet MATH Google Scholar
Sugar, C.A.: Techniques for Clustering and Classification with Applications to Medical Problems. Ph.D. Dissertation, Department of Statistics, Stanford University (1998)
Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc. B (Stat. Methodol.) 63(2), 411–423 (2001)
Article MathSciNet MATH Google Scholar
Zechner, M., Granitzer, M.: Accelerating k-means on the graphics processor via cuda. In: First International Conference on Intensive Applications and Services, pp. 7–15 (2009)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm and its applications. Data Min. Knowl. Disc. 1(2), 141–182 (1997)
Article Google Scholar
Zhou, B., Hansen, J.: Unsupervised audio stream segmentation and clustering via the bayesian information criterion. In: Proceedings of ISCLP-2000: International Conference of Spoken Language Processing, pp. 714–717 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Service-centric Networking, Telekom Innovation Laboratories, Technische Universität Berlin, Berlin, Germany
Boris Lorbeer, Ana Kosareva, Bersant Deva, Dženan Softić, Peter Ruppel & Axel Küpper

Authors

Boris Lorbeer
View author publications
You can also search for this author in PubMed Google Scholar
Ana Kosareva
View author publications
You can also search for this author in PubMed Google Scholar
Bersant Deva
View author publications
You can also search for this author in PubMed Google Scholar
Dženan Softić
View author publications
You can also search for this author in PubMed Google Scholar
Peter Ruppel
View author publications
You can also search for this author in PubMed Google Scholar
Axel Küpper
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Boris Lorbeer or Ana Kosareva .

Editor information

Editors and Affiliations

School of Computing and Communications, Lancaster University , Lancaster, United Kingdom
Plamen Angelov
Data Engineering Lab, Dept. of Informatics, Aristotle University of Thessaloniki , Thessaloniki, Greece
Yannis Manolopoulos
Lab of Forest Informatics (FiLAB), Democritus University of Thrace , Orestiada, Greece
Lazaros Iliadis
WPC Information Systems Faculty, Arizona State University , Tempe, Arizona, USA
Asim Roy
Electrical Engineering Dept, (ICA), Pontifical Catholic Univ of Rio de Janei , Rio de Janeiro, Rio de Janeiro, Brazil
Marley Vellasco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lorbeer, B., Kosareva, A., Deva, B., Softić, D., Ruppel, P., Küpper, A. (2017). A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds) Advances in Big Data. INNS 2016. Advances in Intelligent Systems and Computing, vol 529. Springer, Cham. https://doi.org/10.1007/978-3-319-47898-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-47898-2_18
Published: 08 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47897-5
Online ISBN: 978-3-319-47898-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics