An SNN-DBSCAN Based Clustering Algorithm for Big Data

Pandey, Sriniwas; Samal, Mamata; Mohanty, Sraban Kumar

doi:10.1007/978-981-15-1081-6_11

Sriniwas Pandey¹⁸,
Mamata Samal¹⁹ &
Sraban Kumar Mohanty¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1082))

780 Accesses
3 Citations

Abstract

Clustering is a technique to partition data into different groups in such a way that data items in a group are more similar to each other than the data points in any other group. The assumption of infinite main memory is very usual while designing most of the clustering algorithms but this assumption fails when the size of data set starts increasing. In this scenario, data needs to be stored in the secondary memory and time spent in the input/outputs (I/O) dominates the actual computational time. Therefore by reducing the I/O, the efficiency of the clustering techniques can be improved. In this paper, one shared near neighbor based algorithm is devised by minimizing its I/O complexity to make it suitable for the Big Data in external memory model proposed by Aggarwal and Vitter. There is no change in the computational steps, hence cluster quality remains the same. We implement the algorithm in the STXXL library to show its efficacy for Big Data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://en.wikipedia.org/?title=Binary_prefix.

References

Aggarwal, A., Vitter, J.: The input/output complexity of sorting and related problems. Commun. ACM 31(9), 1116–1127 (1988)
Article MathSciNet Google Scholar
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)
Article Google Scholar
Ball, G.H., Hall, D.J.: A clustering technique for summarizing multivariate data. Behav. Sci. 12(2), 153–155 (1967)
Article Google Scholar
Bezdek, J.C., Ehrlich, R., Full, W.: Fcm: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2), 191–203 (1984)
Article Google Scholar
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of Computing, pp. 626–635. ACM (1997)
Google Scholar
Dementiev, R., Kettner, L., Sanders, P.: STXXL: standard template library for xxl data sets. Softw. Pract. Exp. 38(6), 589–637 (2008)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Ser B (Methodol), 1–38 (1977)
Google Scholar
Ertoz, L., Steinbach, M., Kumar, V.: A new shared nearest neighbor clustering algorithm and its applications. In: 2nd International Conference on Data Mining, Clustering High Dimensional Data and its Applications, pp. 105–115. SIAM (2002)
Google Scholar
Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: SDM, pp. 47–58. SIAM (2003)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD Record, vol. 27, pp. 73–84. ACM (1998)
Google Scholar
Hinneburg, A., Keim, D.A.: An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of KDD’98, pp. 58–65 (1998)
Google Scholar
Hinneburg, A., Keim, D.A.: Optimal grid-clustering: towards breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th International Conference on Very Large Data Bases VLDB’99, pp. 506–517 (1999)
Google Scholar
Januzaj, E., Kriegel, H.P., Pfeifle, M.: Dbdc: density based distributed clustering. In: Advances in Database Technology—EDBT 2004, Lecture Notes in Computer Science, vol. 2992, pp. 88–105 (2004)
Google Scholar
Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Trans. Comput. C 22(11), 1025–1034 (1973)
Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: A local search approximation algorithm for k-means clustering. In: Proceedings of the Eighteenth Annual Symposium on Computational Geometry, pp. 10–18. ACM (2002)
Google Scholar
Kim, W.: Parallel clustering algorithms: survey (2009). http://www.cs.gsu.edu/~wkim/indexfiles/SurveyParallelClustering.pdf
Liu, Y., Guo, Q., Yang, L., Li, Y.: Research on incremental clustering. In: 2nd International Conference on Consumer Electronics, Communications and Networks (CECNet), 2012, pp. 2803–2806 (April 2012)
Google Scholar
Moreira, G., Santos, M.Y., Moura-Pires, J.: SNN input parameters: how are they related? In: International Conference on Parallel and Distributed Systems (ICPADS), pp. 492–497. IEEE (2013)
Google Scholar
Ng, R.T., Jiawei, H.: CLARANS: a method for clustering objects for spatial data mining. IEEE Trans. Knowl. Data Eng. 14(5), 1003–1016 (2002)
Article Google Scholar
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995)
Article MathSciNet Google Scholar
Rokach, L., Maimon, O.: Clustering methods. In: Data Mining and Knowledge Discovery Handbook, pp. 321–352. Springer (2005)
Google Scholar
Wikipedia.: Approximation algorithm, online (2015). Accessed June 2015
Google Scholar
Xu, X., Ester, M., Kriegel, H.P., Sander, J.: A distribution-based clustering algorithm for mining in large spatial databases. In: Proceedings of 14th International Conference on Data Engineering, 1998, pp. 324–331. IEEE (1998)
Google Scholar
Yadav, P.K., Pandey, S., Samal, M., Mohanty, S.K.: Nearest neighbor-based clustering algorithm for large data sets. In: Bhatia, S., Tiwari, S., Mishra, K., Trivedi, M. (eds.) Advances in Computer Communication and Computational Sciences. Advances in Intelligent Systems and Computing, vol. 760. Springer, Singapore (2018)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: ACM SIGMOD Record, vol. 25, pp. 103–114. ACM (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, PDPM Indian Institute of Information Technology, Design and Manufacturing, Jabalpur, Jabalpur, 482005, Madhya Pradesh, India
Sriniwas Pandey & Sraban Kumar Mohanty
Jabalpur, India
Mamata Samal

Authors

Sriniwas Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Mamata Samal
View author publications
You can also search for this author in PubMed Google Scholar
Sraban Kumar Mohanty
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sraban Kumar Mohanty .

Editor information

Editors and Affiliations

Department of Computer Science, Rama Devi Women’s University, Bhubaneswar, Odisha, India
Bibudhendu Pati
Department of Computer Science, Rama Devi Women’s University, Bhubaneswar, Odisha, India
Chhabi Rani Panigrahi
Cloud Computing, The University of Melbourne, Melbourne, VIC, Australia
Rajkumar Buyya
Department of Computer Science and Information Engineering, Providence University, Taichung, Taiwan
Kuan-Ching Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pandey, S., Samal, M., Mohanty, S.K. (2020). An SNN-DBSCAN Based Clustering Algorithm for Big Data. In: Pati, B., Panigrahi, C., Buyya, R., Li, KC. (eds) Advanced Computing and Intelligent Engineering. Advances in Intelligent Systems and Computing, vol 1082. Springer, Singapore. https://doi.org/10.1007/978-981-15-1081-6_11

Download citation

DOI: https://doi.org/10.1007/978-981-15-1081-6_11
Published: 19 February 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-1080-9
Online ISBN: 978-981-15-1081-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics