Advertisement

Data Mining and Knowledge Discovery

, Volume 32, Issue 4, pp 1121–1176 | Cite as

Anytime parallel density-based clustering

  • Son T. Mai
  • Ira Assent
  • Jon Jacobsen
  • Martin Storgaard Dieu
Article
  • 288 Downloads

Abstract

The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, DBSCAN requires neighborhood queries for all objects and propagation of labels from object to object. This scheme is time consuming and thus limits its applicability for large datasets. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters to reduce the label propagation time. Moreover, instead of passively performing range queries for all objects as in existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still satisfying the cluster definition of DBSCAN. Moreover, by processing queries in block and merging the results into the current cluster structure, AnyDBC can be efficiently parallelized on shared memory architectures to further accelerate the performance, uniquely making it a parallel and anytime technique at the same time. Experiments show speedup factors of orders of magnitude compared to DBSCAN and its fastest variants as well as a high parallel scalability on multicore processors for very large real and synthetic complex datasets.

Keywords

Density-based clustering Anytime clustering Active clustering Parallel processing Multicore processors 

Notes

Acknowledgements

We thank Prof. Yufei Tao for providing us with the binary file of DBSCANR and the authors of PDSDBSCAN for making their source code available for download. Our special thanks to the anonymous reviewers for their invaluable comments. We appreciate the support and discussion with Sihem Amer-Yahia, Diep Phan, Jon Jacobsen, Jesper Kristensen, Ky Nguyen, Kenneth Bøgh, Sean Chester, and Manuel Ciosici during the preparation of this paper. Part of this research was funded by a Villum postdoc fellowship.

References

  1. Aggarwal CC, Reddy CK (eds) (2014) Data clustering: algorithms and applications. CRC Press, Boca RatonzbMATHGoogle Scholar
  2. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: International conference on management of data (SIGMOD), pp 49–60Google Scholar
  3. Arlia D, Coppola M (2001) Experiments in parallel clustering with DBSCAN. In: International Euro-par conference, pp 326–331Google Scholar
  4. Assent I, Kranen P, Baldauf C, Seidl T (2012) AnyOut: anytime outlier detection on streaming data. In: International conference on database systems for advanced applications (DASFAA) (1), pp 228–242Google Scholar
  5. Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517MathSciNetCrossRefzbMATHGoogle Scholar
  6. Böhm C, Noll R, Plant C, Wackersreuther B (2009) Density-based clustering using graphics processors. In: International conference on information and knowledge management (CIKM), pp 661–670Google Scholar
  7. Böhm C, Feng J, He X, Mai ST, Plant C, Shao J (2011) A novel similarity measure for fiber clustering using longest common subsequence. In: Proceedings of the 2011 workshop on data mining for medicine and healthcare, pp 1–9Google Scholar
  8. Borah B, Bhattacharyya DK (2004) An improved sampling-based DBSCAN for large spatial databases. In: International conference on intelligent sensing and information processing (ICISIP), pp 92–96Google Scholar
  9. Brecheisen S, Kriegel H, Pfeifle M (2004) Efficient density-based clustering of complex objects. In: IEEE international conference on data mining (ICDM), pp 43–50Google Scholar
  10. Brecheisen S, Kriegel H, Pfeifle M (2006a) Parallel density-based clustering of complex objects. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 179–188Google Scholar
  11. Brecheisen S, Kriegel HP, Pfeifle M (2006b) Multi-step density-based clustering. Knowl. Inf Syst 9(3):284–308Google Scholar
  12. Chen L, Ng RT (2004) On the marriage of Lp-norms and edit distance. In: Very large data bases (VLDB), pp 792–803Google Scholar
  13. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, CambridgezbMATHGoogle Scholar
  14. Dai BR, Lin IC (2012) Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In: IEEE international conference on cloud computing (CLOUD), pp 59–66Google Scholar
  15. Dash M, Liu H, Xu X (2001) ‘\(1 + 1 > 2\)’: Merging distance and density based clustering. In: International conference on database systems for advanced applications (DASFAA), pp 32–39Google Scholar
  16. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 226–231Google Scholar
  17. Fonollosa J, Sheik S, Huerta R, Marco S (2015) Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sensors Actuators B: Chem 215:618–629CrossRefGoogle Scholar
  18. Francis Z, Villagrasa C, Clairand I (2011) Simulation of DNA damage clustering after proton irradiation using an adapted DBSCAN algorithm. Comput Methods Programs Biomed 101(3):265–270CrossRefGoogle Scholar
  19. Gan J, Tao Y (2015) DBSCAN revisited: mis-claim, un-fixability, and approximation. In: International conference on management of data (SIGMOD), pp 519–530Google Scholar
  20. Gan J, Tao Y (2017) On the hardness and approximation of Euclidean DBSCAN. ACM Trans Database Syst 42(3):14:1–14:45MathSciNetCrossRefGoogle Scholar
  21. Götz M, Bodenstein C, Riedel M (2015) HPDBSCAN: highly parallel DBSCAN. In: Proceedings of the workshop on machine learning in high-performance computing environments, pp 2:1–2:10Google Scholar
  22. Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proceedings of the 6th annual ACM symposium on parallel algorithms and architectures (SSPA), pp 16–25Google Scholar
  23. Gunawan A (2013) A faster algorithm for DBSCAN. Msc thesis, TU EindhovenGoogle Scholar
  24. He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J (2011) MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: International conference on parallel and distributed systems (ICPADS), pp 473–480Google Scholar
  25. Januzaj E, Kriegel HP, Pfeifle M (2004) Scalable density-based distributed clustering. In: European conference on principles of data mining and knowledge discovery (PKDD), pp 231–244Google Scholar
  26. Kobayashi T, Iwamura M, Matsuda T, Kise K (2013) An anytime algorithm for camera-based character recognition. In: International conference on document analysis and recognition (ICDAR), pp 1140–1144Google Scholar
  27. Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 672–677Google Scholar
  28. Kriegel H, Schubert E, Zimek A (2017) The (black) art of runtime evaluation: are we comparing algorithms or implementations? Knowl Inf Syst 52(2):341–378CrossRefGoogle Scholar
  29. Kumar V (2002) Introduction to parallel computing, 2nd edn. Addison-Wesley, BostonGoogle Scholar
  30. Li T, Heinis T, Luk W (2016) Hashing-based approximate DBSCAN. In: Symposium on advances in databases and information systems (ADBIS), pp 31–45Google Scholar
  31. Li T, Heinis T, Luk W (2017) ADvaNCE—efficient and scalable approximate density-based clustering based on hashing. Informatica 28(1):105–130CrossRefGoogle Scholar
  32. Lulli A, Dell’Amico M, Michiardi P, Ricci L (2016) NG-DBSCAN: scalable density-based clustering for arbitrary data. Proc VLDB Endow (PVLDB) 10(3):157–168CrossRefGoogle Scholar
  33. Mahran S, Mahar K (2008) Using grid for accelerating density-based clustering. In: IEEE international conference on computer and information technology (CIT), pp 35–40Google Scholar
  34. Mai ST, Goebl S, Plant C (2012) A similarity model and segmentation algorithm for white matter fiber tracts. In: IEEE international conference on data mining (ICDM), pp 1014–1019Google Scholar
  35. Mai ST, He X, Feng J, Böhm C (2013a) Efficient anytime density-based clustering. In: SIAM international conference on data mining (SDM), pp 112–120Google Scholar
  36. Mai ST, He X, Hubig N, Plant C, Böhm C (2013b) Active density-based clustering. In: IEEE international conference on data mining (ICDM), pp 508–517Google Scholar
  37. Mai ST, He X, Feng J, Plant C, Böhm C (2015) Anytime density-based clustering of complex data. Knowl Inf Syst 45(2):319–355CrossRefGoogle Scholar
  38. Mai ST, Assent I, Le A (2016a) Anytime OPTICS: an efficient approach for hierarchical density-based clustering. In: International conference on database systems for advanced applications (DASFAA), pp 164–179Google Scholar
  39. Mai ST, Assent I, Storgaard M (2016b) AnyDBC: an efficient anytime density-based clustering algorithm for very large complex datasets. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 1025–1034Google Scholar
  40. Mai ST, Dieu MS, Assent I, Jacobsen J, Kristensen J, Birk M (2017) Scalable and interactive graph clustering algorithm on multicore CPUs. In: IEEE international conference on data engineering (ICDE), pp 349–360Google Scholar
  41. Patwary MMA, Palsetia D, Agrawal A, Liao W, Manne F, Choudhary AN (2012) A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC), p 62Google Scholar
  42. Patwary MMA, Satish N, Sundaram N, Manne F, Habib S, Dubey P (2014) Pardicle: parallel approximate density-based clustering. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC), pp 560–571Google Scholar
  43. Reiss A, Stricker D (2012) Introducing a new benchmarked dataset for activity monitoring. In: International symposium on wearable computers (ISWC), pp 108–109Google Scholar
  44. Sakai T, Tamura K, Kitakami H (2017) Cell-based DBSCAN algorithm using minimum bounding rectangle criteria. In: International conference on database systems for advanced applications (DASFAA), pp 133–144Google Scholar
  45. Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194CrossRefGoogle Scholar
  46. Schubert E, Sander J, Ester M, Kriegel H, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst 42(3):19:1–19:21MathSciNetCrossRefGoogle Scholar
  47. Tramacere A, Vecchio C (2013) \(\gamma \)-Ray DBSCAN: a clustering algorithm applied to fermi-LAT \(\gamma \)-ray data—I. Detection performances with real and simulated data. Astron Astrophys 549:A138CrossRefGoogle Scholar
  48. Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: International conference on machine learning (ICML), pp 1073–1080Google Scholar
  49. Wang X, Hamilton HJ (2003) DBRS: a density-based spatial clustering method with random sampling. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 563–575Google Scholar
  50. Xu X, Jäger J, Kriegel HP (1999) A fast parallel clustering algorithm for large spatial databases. Data Min Knowl Discov 3(3):263–290CrossRefGoogle Scholar
  51. Zaki MJ, M W Jr (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, New YorkCrossRefGoogle Scholar
  52. Zhao W, Hopke PK, Prather KA (2008) Comparison of two cluster analysis methods using single particle mass spectra. Atmos Environ 42(5):881–892CrossRefGoogle Scholar
  53. Zhou S, Zhou A, Cao J, Jin W, Fan Y, Hu Y (2000) Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 169–172Google Scholar
  54. Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73–83Google Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Son T. Mai
    • 1
  • Ira Assent
    • 1
  • Jon Jacobsen
    • 1
  • Martin Storgaard Dieu
    • 1
  1. 1.Aarhus UniversityAarhusDenmark

Personalised recommendations