Skip to main content

A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm

  • Conference paper
  • First Online:
Advances in Big Data (INNS 2016)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 529))

Included in the following conference series:

Abstract

Clustering algorithms are recently regaining attention with the availability of large datasets and the rise of parallelized computing architectures. However, most clustering algorithms do not scale well with increasing dataset sizes and require proper parametrization for correct results. In this paper we present A-BIRCH, an approach for automatic threshold estimation for the BIRCH clustering algorithm using Gap Statistic. This approach renders the global clustering step of BIRCH unnecessary and does not require knowledge on the expected number of clusters beforehand. This is achieved by analyzing a small representative subset of the data to extract attributes such as the cluster radius and the minimal cluster distance. These attributes are then used to compute a threshold that results, with high probability, in the correct clustering of elements. For the analysis of the representative subset we parallelized Gap Statistic to improve performance and ensure scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Parzen, E., Tanabe, K., Kitagawa, G. (eds.) Selected Papers of Hirotugu Akaike, pp. 199–213. Springer, New York (1998)

    Chapter  Google Scholar 

  2. Burbeck, K., Nadjm-Tehrani, S.: Adaptive real-time anomaly detection with incremental clustering. Inf. Secur. Tech. Rep. 12(1), 56–67 (2007)

    Article  Google Scholar 

  3. Dash, M., Liu, H., Xu, X.: ‘\(1+1 > 2\)’: Merging distance and density based clustering. In: Proceedings of Seventh International Conference on Database Systems for Advanced Applications, 2001, pp. 32–39 (2001)

    Google Scholar 

  4. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  5. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Second International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press (1996)

    Google Scholar 

  6. Ismael, N., Alzaalan, M., Ashour, W.: Improved multi threshold birch clustering algorithm. Int. J. Artif. Intell. Appl. Smart Devices 2(1), 1–10 (2014)

    Article  Google Scholar 

  7. Jordan, M.I., Bach, F.R.: Learning spectral clustering. In: Advances in Neural Information Processing Systems 16. MIT Press (2003)

    Google Scholar 

  8. Kumar, N.S.L.P., Satoor, S., Buck, I.: Fast parallel expectation maximization for gaussian mixture models on gpus using cuda. In: 11th IEEE International Conference on High Performance Computing and Communications, pp. 103–109 (2009)

    Google Scholar 

  9. Macqueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Math, Statistics, and Probability, vol. 1, pp. 281–297. University of California Press (1967)

    Google Scholar 

  10. Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Zadeh, R., Zaharia, M., Talwalkar, A.: MLlib: machine learning in apache spark. CoRR (2015)

    Google Scholar 

  11. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications Co., Shelter Island (2011)

    Google Scholar 

  12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  13. Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  14. Sugar, C.A.: Techniques for Clustering and Classification with Applications to Medical Problems. Ph.D. Dissertation, Department of Statistics, Stanford University (1998)

    Google Scholar 

  15. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc. B (Stat. Methodol.) 63(2), 411–423 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  16. Zechner, M., Granitzer, M.: Accelerating k-means on the graphics processor via cuda. In: First International Conference on Intensive Applications and Services, pp. 7–15 (2009)

    Google Scholar 

  17. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: a new data clustering algorithm and its applications. Data Min. Knowl. Disc. 1(2), 141–182 (1997)

    Article  Google Scholar 

  18. Zhou, B., Hansen, J.: Unsupervised audio stream segmentation and clustering via the bayesian information criterion. In: Proceedings of ISCLP-2000: International Conference of Spoken Language Processing, pp. 714–717 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Boris Lorbeer or Ana Kosareva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Lorbeer, B., Kosareva, A., Deva, B., Softić, D., Ruppel, P., Küpper, A. (2017). A-BIRCH: Automatic Threshold Estimation for the BIRCH Clustering Algorithm. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds) Advances in Big Data. INNS 2016. Advances in Intelligent Systems and Computing, vol 529. Springer, Cham. https://doi.org/10.1007/978-3-319-47898-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47898-2_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47897-5

  • Online ISBN: 978-3-319-47898-2

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics