An Efficient Partition-Repetition Approach in Clustering of Big Data

Karmakar, Bikram; Mukhopadhayay, Indranil

doi:10.1007/978-81-322-3628-3_5

Bikram Karmakar⁴ &
Indranil Mukhopadhayay⁵

5040 Accesses

Abstract

Addressing the problem of clustering, i.e. splitting the data into homogeneous groups in an unsupervised way, is one of the major challenges in big data analytics. Volume, variety and velocity associated with such big data make this problem even more complex. Standard clustering techniques might fail due to this inherent complexity of the data cloud. Some adaptations are required or demand for novel methods are to be fulfilled towards achieving a reasonable solution to this problem without compromising the performance, at least beyond a certain limit. In this article we discuss the salient features, major challenges and prospective solution paths to this problem of clustering big data. Discussion on current state of the art reveals the existing problems and some solutions to this issue. The current paradigm and research work specific to the complexities in this area is outlined with special emphasis on the characteristic of big data in this context. We develop an adaptation of a standard method that is more suitable to big data clustering when the data cloud is relatively regular with respect to inherent features. We also discuss a novel method for some special types of data where it is a more plausible and realistic phenomenon to leave some data points as noise or scattered in the domain of whole data cloud while a major portion form different clusters. Our demonstration through simulations reveals the strength and feasibility of applying the proposed algorithm for practical purpose with a very low computation time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications, 1st edn. Chapman & Hall/CRC
Google Scholar
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. SIGMOD Rec 28(2):49–60
Article Google Scholar
Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, pp 25–71
Chapter Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, pp 226–231
Google Scholar
He Y, Tan H, Luo W, Feng S, Fan J (2014) MR-DBSCAN: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99
Article MathSciNet Google Scholar
Hinneburg A, Hinneburg E, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. AAAI Press, pp 58–65
Google Scholar
Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4):387–415
Google Scholar
Januzaj E, Kriegel HP, Pfeifle M (2004) DBDC: Density based distributed clustering. In: Bertino E, Christodoulakis S, Plexousakis D, Christophides V, Koubarakis M, Bhm K, Ferrari E (eds) Advances in database technology—EDBT 2004. Lecture notes in computer science, vol 2992. Springer, Berlin, pp 88–105
Google Scholar
Karypis G, Kumar V (1999) Parallel multilevel k-way partitioning for irregular graphs. SIAM Rev 41(2):278–300
Article MathSciNet MATH Google Scholar
Katal A, Wazid M, Goudar RH (2013) Big data: Issues, challenges, tools and good practices. In: 2013 Sixth International conference on contemporary computing (IC3), pp 404–409
Google Scholar
Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley Interscience
Google Scholar
Kim Y, Shim K, Kim MS, Lee JS (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using mapreduce. Inf Syst 42:15–35
Google Scholar
Kleinberg JM (2003) An impossibility theorem for clustering. In: Becker S, Thrun S, Obermayer K (eds) Advances in neural information processing systems, vol 15. MIT Press, pp 463–470
Google Scholar
Kogan J, Nicholas C, Teboulle M (2006) A survey of clustering data mining techniques. Springer, Berlin, pp 25–71
Google Scholar
Ng RT, Han J (2002) CLARANS: A method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016
Google Scholar
Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: Murgante B, Misra S, Rocha A, Torre C, Rocha J, Falco M, Taniar D, Apduhan B, Gervasi O (eds) Computational science and its applications ICCSA 2014. Lecture notes in computer science, vol 8583. Springer International Publishing, pp 707–720
Google Scholar
Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19):2405–2412
Google Scholar
Tseng GC, Wong WH (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 61(1):10–16
Google Scholar
Wang S, Fan J, Fang M, Yuan H (2014) Hgcudf: hierarchical grid clustering using data field. Chin J Electron 23(1):37–42
Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114
Article Google Scholar
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Jaatun M, Zhao G, Rong C (eds) Cloud computing, vol 5931., Lecture notes in computer science Springer, Berlin, pp 674–679
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, The Wharton School, University of Pennsylvania, Jon M Houston Hall, Philadelphia, PA, 19104, USA
Bikram Karmakar
Human Genetics Unit, Indian Statistical Institute, Kolkata, 700 108, India
Indranil Mukhopadhayay

Authors

Bikram Karmakar
View author publications
You can also search for this author in PubMed Google Scholar
Indranil Mukhopadhayay
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Indranil Mukhopadhayay .

Editor information

Editors and Affiliations

Indian Institute of Public Health , Hyderabad, India
Saumyadipta Pyne
CRRao AIMSCS, University of Hyderabad Campus CRRao AIMSCS, Hyderabad, India
B.L.S. Prakasa Rao
CRRao AIMSCS, University of Hyderabad Campus CRRao AIMSCS, Hyderabad, India
S.B. Rao

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Karmakar, B., Mukhopadhayay, I. (2016). An Efficient Partition-Repetition Approach in Clustering of Big Data. In: Pyne, S., Rao, B., Rao, S. (eds) Big Data Analytics. Springer, New Delhi. https://doi.org/10.1007/978-81-322-3628-3_5

Download citation

DOI: https://doi.org/10.1007/978-81-322-3628-3_5
Published: 13 October 2016
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-3626-9
Online ISBN: 978-81-322-3628-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics