Skip to main content

Dboost: A Fast Algorithm for DBSCAN-based Clustering on High Dimensional Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Abstract

DBSCAN is a classic density-based clustering technique, which is well known in discovering clusters of arbitrary shapes and handling noise. However, it is very time-consuming in density calculation when facing high dimensional data, which makes it inefficient in many areas, such as multi-document summarization, product recommendation, etc. Therefore, how to efficiently calculate the density on high dimensional data becomes one key issue for DBSCAN-based clustering technique. In this paper, we propose a fast algorithm for DBSCAN-based clustering on high dimensional data, named Dboost. In our algorithm, a ranked retrieval technique adaption named \(WAND^\#\) is novelly applied to improving the density calculations without accuracy loss, and we further improve this acceleration by reducing the invoking times of \(WAND^\#\). Experiments were conducted on wire voltage data, Netflix dataset and microblog corpora. The results showed that an acceleration of over 50 times were achieved on wire voltage data and Netflix dataset, and 100 more times can be expected on microblog data.

This work was supported by Natural Science Foundation of China (Grant No. 61572043, 61300003, 61502115), State Grid Basic Research Program (DZ71-15-004), the Fundamental Research Funds for the Central Universities (Grant No. 3262014T75).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Viswanath, P., Pinkesh, R.: l-dbscan: a fast hybrid density based clustering method. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol 1, pp. 912–915. IEEE (2006)

    Google Scholar 

  2. Dharni, C., Bansal, M.: Survey on improved dbscan algorithm. Int. J. Comput. Sci. Technol. 4 (2013)

    Google Scholar 

  3. Ali, T., Asghar, S., Sajid, N.A.: Critical analysis of dbscan variations. In: 2010 International Conference on Information and Emerging Technologies (ICIET), pp. 1–6. IEEE (2010)

    Google Scholar 

  4. Borah, B., Bhattacharyya, D.: An improved sampling-based dbscan for large spatial databases. In: Proceedings of International Conference on Intelligent Sensing and Information Processing, 2004, pp. 92–96. IEEE (2004)

    Google Scholar 

  5. Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 426–434. ACM (2003)

    Google Scholar 

  6. Corporation of netflix: the netflix prize (1997-2009). http://www.netflixprize.com/

  7. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)

    Google Scholar 

  8. El-Sonbaty, Y., Ismail, M., Farouk, M.: An efficient density based clustering algorithm for large databases. In: 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, pp. 673–677. IEEE (2004)

    Google Scholar 

  9. Patwary, M.M.A., Palsetia, D., Agrawal, A., Liao, W.k., Manne, F., Choudhary, A.: A new scalable parallel dbscan algorithm using the disjoint-set data structure. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11. IEEE (2012)

    Google Scholar 

  10. Cheu, E.Y., Keongg, C., Zhou, Z.: On the two-level hybrid clustering algorithm. In: International Conference on Artificial Intelligence in Science and Technology, pp. 138–142 (2004)

    Google Scholar 

  11. Fontoura, M., Josifovski, V., Liu, J., Venkatesan, S., Zhu, X., Zien, J.: Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB Endowment 4(12), 1213–1224 (2011)

    Google Scholar 

  12. Lacour, P., Macdonald, C., Ounis, I.: Efficiency comparison of document matching techniques. In: European Conference for Information Retrieval Efficiency Issues in Information Retrieval Workshop, pp. 37–46 (2008)

    Google Scholar 

  13. Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. (TOIS) 14(4), 349–379 (1996)

    Article  Google Scholar 

  14. Wu, Y.P., Guo, J.J., Zhang, X.J.: A linear dbscan algorithm based on lsh. In: 2007 International Conference on Machine Learning and Cybernetics, vol. 5, pp. 2608–2614. IEEE (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bingyang Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhang, Y., Wang, X., Li, B., Chen, W., Wang, T., Lei, K. (2016). Dboost: A Fast Algorithm for DBSCAN-based Clustering on High Dimensional Data. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31750-2_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31749-6

  • Online ISBN: 978-3-319-31750-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics