An Efficient Partition-Repetition Approach in Clustering of Big Data

  • Bikram Karmakar
  • Indranil Mukhopadhayay


Addressing the problem of clustering, i.e. splitting the data into homogeneous groups in an unsupervised way, is one of the major challenges in big data analytics. Volume, variety and velocity associated with such big data make this problem even more complex. Standard clustering techniques might fail due to this inherent complexity of the data cloud. Some adaptations are required or demand for novel methods are to be fulfilled towards achieving a reasonable solution to this problem without compromising the performance, at least beyond a certain limit. In this article we discuss the salient features, major challenges and prospective solution paths to this problem of clustering big data. Discussion on current state of the art reveals the existing problems and some solutions to this issue. The current paradigm and research work specific to the complexities in this area is outlined with special emphasis on the characteristic of big data in this context. We develop an adaptation of a standard method that is more suitable to big data clustering when the data cloud is relatively regular with respect to inherent features. We also discuss a novel method for some special types of data where it is a more plausible and realistic phenomenon to leave some data points as noise or scattered in the domain of whole data cloud while a major portion form different clusters. Our demonstration through simulations reveals the strength and feasibility of applying the proposed algorithm for practical purpose with a very low computation time.


Cluster Algorithm Localize Algorithm Data Cloud Rand Index Tight Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Aggarwal CC, Reddy CK (2013) Data clustering: algorithms and applications, 1st edn. Chapman & Hall/CRCGoogle Scholar
  2. 2.
    Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. SIGMOD Rec 28(2):49–60CrossRefGoogle Scholar
  3. 3.
    Berkhin P (2006) A survey of clustering data mining techniques. In: Kogan J, Nicholas C, Teboulle M (eds) Grouping multidimensional data. Springer, Berlin, pp 25–71CrossRefGoogle Scholar
  4. 4.
    Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press, pp 226–231Google Scholar
  5. 5.
    He Y, Tan H, Luo W, Feng S, Fan J (2014) MR-DBSCAN: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99MathSciNetCrossRefGoogle Scholar
  6. 6.
    Hinneburg A, Hinneburg E, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. AAAI Press, pp 58–65Google Scholar
  7. 7.
    Hinneburg A, Keim DA (2003) A general approach to clustering in large databases with noise. Knowl Inf Syst 5(4):387–415Google Scholar
  8. 8.
    Januzaj E, Kriegel HP, Pfeifle M (2004) DBDC: Density based distributed clustering. In: Bertino E, Christodoulakis S, Plexousakis D, Christophides V, Koubarakis M, Bhm K, Ferrari E (eds) Advances in database technology—EDBT 2004. Lecture notes in computer science, vol 2992. Springer, Berlin, pp 88–105Google Scholar
  9. 9.
    Karypis G, Kumar V (1999) Parallel multilevel k-way partitioning for irregular graphs. SIAM Rev 41(2):278–300MathSciNetCrossRefMATHGoogle Scholar
  10. 10.
    Katal A, Wazid M, Goudar RH (2013) Big data: Issues, challenges, tools and good practices. In: 2013 Sixth International conference on contemporary computing (IC3), pp 404–409Google Scholar
  11. 11.
    Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley InterscienceGoogle Scholar
  12. 12.
    Kim Y, Shim K, Kim MS, Lee JS (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using mapreduce. Inf Syst 42:15–35Google Scholar
  13. 13.
    Kleinberg JM (2003) An impossibility theorem for clustering. In: Becker S, Thrun S, Obermayer K (eds) Advances in neural information processing systems, vol 15. MIT Press, pp 463–470Google Scholar
  14. 14.
    Kogan J, Nicholas C, Teboulle M (2006) A survey of clustering data mining techniques. Springer, Berlin, pp 25–71Google Scholar
  15. 15.
    Ng RT, Han J (2002) CLARANS: A method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng 14(5):1003–1016Google Scholar
  16. 16.
    Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: Murgante B, Misra S, Rocha A, Torre C, Rocha J, Falco M, Taniar D, Apduhan B, Gervasi O (eds) Computational science and its applications ICCSA 2014. Lecture notes in computer science, vol 8583. Springer International Publishing, pp 707–720Google Scholar
  17. 17.
    Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC (2006) Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19):2405–2412Google Scholar
  18. 18.
    Tseng GC, Wong WH (2005) Tight clustering: a resampling-based approach for identifying stable and tight patterns in data. Biometrics 61(1):10–16Google Scholar
  19. 19.
    Wang S, Fan J, Fang M, Yuan H (2014) Hgcudf: hierarchical grid clustering using data field. Chin J Electron 23(1):37–42Google Scholar
  20. 20.
    Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec 25(2):103–114CrossRefGoogle Scholar
  21. 21.
    Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Jaatun M, Zhao G, Rong C (eds) Cloud computing, vol 5931., Lecture notes in computer science Springer, Berlin, pp 674–679Google Scholar

Copyright information

© Springer India 2016

Authors and Affiliations

  1. 1.Department of Statistics, The Wharton SchoolUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.Human Genetics UnitIndian Statistical InstituteKolkataIndia

Personalised recommendations