Skip to main content

An Efficient Partition-Repetition Approach in Clustering of Big Data

  • Chapter
  • First Online:
Big Data Analytics

Abstract

Addressing the problem of clustering, i.e. splitting the data into homogeneous groups in an unsupervised way, is one of the major challenges in big data analytics. Volume, variety and velocity associated with such big data make this problem even more complex. Standard clustering techniques might fail due to this inherent complexity of the data cloud. Some adaptations are required or demand for novel methods are to be fulfilled towards achieving a reasonable solution to this problem without compromising the performance, at least beyond a certain limit. In this article we discuss the salient features, major challenges and prospective solution paths to this problem of clustering big data. Discussion on current state of the art reveals the existing problems and some solutions to this issue. The current paradigm and research work specific to the complexities in this area is outlined with special emphasis on the characteristic of big data in this context. We develop an adaptation of a standard method that is more suitable to big data clustering when the data cloud is relatively regular with respect to inherent features. We also discuss a novel method for some special types of data where it is a more plausible and realistic phenomenon to leave some data points as noise or scattered in the domain of whole data cloud while a major portion form different clusters. Our demonstration through simulations reveals the strength and feasibility of applying the proposed algorithm for practical purpose with a very low computation time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications, 1st edn. Chapman & Hall/CRC

    Google Scholar 

  2. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. SIGMOD Rec 28(2):49–60

    Article  Google Scholar 

  3. Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, pp 25–71

    Chapter  Google Scholar 

  4. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, pp 226–231

    Google Scholar 

  5. He Y, Tan H, Luo W, Feng S, Fan J (2014) MR-DBSCAN: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99

    Article  MathSciNet  Google Scholar 

  6. Hinneburg A, Hinneburg E, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. AAAI Press, pp 58–65

    Google Scholar 

  7. Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4):387–415

    Google Scholar 

  8. Januzaj E, Kriegel HP, Pfeifle M (2004) DBDC: Density based distributed clustering. In: Bertino E, Christodoulakis S, Plexousakis D, Christophides V, Koubarakis M, Bhm K, Ferrari E (eds) Advances in database technology—EDBT 2004. Lecture notes in computer science, vol 2992. Springer, Berlin, pp 88–105

    Google Scholar 

  9. Karypis G, Kumar V (1999) Parallel multilevel k-way partitioning for irregular graphs. SIAM Rev 41(2):278–300

    Article  MathSciNet  MATH  Google Scholar 

  10. Katal A, Wazid M, Goudar RH (2013) Big data: Issues, challenges, tools and good practices. In: 2013 Sixth International conference on contemporary computing (IC3), pp 404–409

    Google Scholar 

  11. Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley Interscience

    Google Scholar 

  12. Kim Y, Shim K, Kim MS, Lee JS (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using mapreduce. Inf Syst 42:15–35

    Google Scholar 

  13. Kleinberg JM (2003) An impossibility theorem for clustering. In: Becker S, Thrun S, Obermayer K (eds) Advances in neural information processing systems, vol 15. MIT Press, pp 463–470

    Google Scholar 

  14. Kogan J, Nicholas C, Teboulle M (2006) A survey of clustering data mining techniques. Springer, Berlin, pp 25–71

    Google Scholar 

  15. Ng RT, Han J (2002) CLARANS: A method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016

    Google Scholar 

  16. Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: Murgante B, Misra S, Rocha A, Torre C, Rocha J, Falco M, Taniar D, Apduhan B, Gervasi O (eds) Computational science and its applications ICCSA 2014. Lecture notes in computer science, vol 8583. Springer International Publishing, pp 707–720

    Google Scholar 

  17. Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19):2405–2412

    Google Scholar 

  18. Tseng GC, Wong WH (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 61(1):10–16

    Google Scholar 

  19. Wang S, Fan J, Fang M, Yuan H (2014) Hgcudf: hierarchical grid clustering using data field. Chin J Electron 23(1):37–42

    Google Scholar 

  20. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114

    Article  Google Scholar 

  21. Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Jaatun M, Zhao G, Rong C (eds) Cloud computing, vol 5931., Lecture notes in computer science Springer, Berlin, pp 674–679

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Indranil Mukhopadhayay .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer India

About this chapter

Cite this chapter

Karmakar, B., Mukhopadhayay, I. (2016). An Efficient Partition-Repetition Approach in Clustering of Big Data. In: Pyne, S., Rao, B., Rao, S. (eds) Big Data Analytics. Springer, New Delhi. https://doi.org/10.1007/978-81-322-3628-3_5

Download citation

Publish with us

Policies and ethics