Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Dong, Jianqiang; Wang, Fei; Yuan, Bo

doi:10.1007/978-3-642-41278-3_50

Jianqiang Dong²⁴,
Fei Wang²⁴ &
Bo Yuan²⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8206))

Included in the following conference series:

International Conference on Intelligent Data Engineering and Automated Learning

4969 Accesses
12 Citations

Abstract

In this big data era, the capability of mining and analyzing large scale datasets is imperative. As data are becoming more abundant than ever before, data driven methods are playing a critical role in areas such as decision support and business intelligence. In this paper, we demonstrate how state-of-the-art GPUs and the Dynamic Parallelism feature of the latest CUDA platform can bring significant benefits to BIRCH, one of the most well-known clustering techniques for streaming data. Experiment results show that, on a number of benchmark problems, the GPU accelerated BIRCH can be made up to 154 times faster than the CPU version with good scalability and high accuracy. Our work suggests that massively parallel GPU computing is a promising and effective solution to the challenges of big data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zhang, T., Raghu, R., Miron, L.: BIRCH: An Efficient Data Clustering Method for Very Large Databases. ACM SIGMOD Record 25(2), 103–114 (1996)
Article Google Scholar
Zhang, T., Raghu, R., Miron, L.: BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery 1(2), 141–182 (1997)
Article Google Scholar
Fang, W., Lau, K., Lu, M., et al.: Parallel Data Mining on Graphics Processors. Technical Report HKUST-CS08-07 (2008)
Google Scholar
Bai, H., He, L., Ouyang, D., Li, Z., Li, H.: K-Means on Commodity GPUs with CUDA. In: 2009 WRI World Congress on Computer Science and Information Engineering, pp. 651–655 (2009)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Mahdiraji, A.R.: Clustering Data Stream: A Survey of Algorithms. International Journal of Knowledge-Based and Intelligent Engineering Systems 13(2), 39–44 (2009)
Google Scholar
Berkhin, P.: A Survey of Clustering Data Mining Techniques. In: Kogan, J., et al. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer (2006)
Google Scholar
Barbará, D.: Requirements for Clustering Data Streams. ACM SIGKDD Explorations Newsletter 3(2), 23–27 (2002)
Article Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.: A Framework for Clustering Evolving Data Streams. In: 29th International Conference on Very Large Data Bases, pp. 81–92 (2003)
Google Scholar
O’Callaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-Data Algorithms for High-Quality Clustering. In: 18th International Conference on Data Engineering, pp. 685–694 (2002)
Google Scholar
Shalom, S.A., Dash, M.: Efficient Partitioning Based Hierarchical Agglomerative Clustering Using Graphics Accelerations with CUDA. International Journal of Artificial Intelligence & Applications 4(2), 13–33 (2013)
Article Google Scholar
Shalom, S.A., Dash, M., Tue, M., Wilson, N.: Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture. In: 2009 International Conference on Signal Processing Systems, pp. 556–561 (2009)
Google Scholar
Garg, A., Mangla, A., Gupta, N., Bhatnagar, V.: PBIRCH: A Scalable Parallel Clustering Algorithm for Incremental Data. In: 10th IEEE International Database Engineering and Applications Symposium, pp. 315–316 (2006)
Google Scholar
Bagga, A., Toshniwal, D.: Parallelization of Hierarchical Text Clustering on Multi-core CUDA Architecture. International Journal of Computer Science and Electrical Engineering 1, 72–76 (2012)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithm for Large Databases. In: 1998 ACM International Conference on Management of Data, pp. 73–84 (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

Intelligent Computing Lab, Division of Informatics, Graduate School at Shenzhen, Tsinghua University, Shenzhen, 518055, P.R. China
Jianqiang Dong, Fei Wang & Bo Yuan

Authors

Jianqiang Dong
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Electrical and Electronic Engineering, University of Manchester, UK
Hujun Yin
University of Science and Technology of China, Hefei, China
Ke Tang
Nanjing University, Nanjing, China
Yang Gao
Ostfalia University of Applied Sciences, 38302, Wolfenbüttel, Germany
Frank Klawonn
Kyungpook National University, 702-701, Buk-Gu, Daegu, Korea
Minho Lee
Nature Inspired Computational and Applications Laboratory, School of Computer Science and Technology,, University of Science and Technology of China, 230027, Hefei, China
Thomas Weise
University of Science and Technology of China, 230017, Hefei, China
Bin Li
CERCIA, School of Computer Science, University of Birmingham, B15 2TT, Edgbaston, Birmingham, UK
Xin Yao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dong, J., Wang, F., Yuan, B. (2013). Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism. In: Yin, H., et al. Intelligent Data Engineering and Automated Learning – IDEAL 2013. IDEAL 2013. Lecture Notes in Computer Science, vol 8206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41278-3_50

Download citation

DOI: https://doi.org/10.1007/978-3-642-41278-3_50
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41277-6
Online ISBN: 978-3-642-41278-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics