Dboost: A Fast Algorithm for DBSCAN-based Clustering on High Dimensional Data

Zhang, Yuxiao; Wang, Xiaorong; Li, Bingyang; Chen, Wei; Wang, Tengjiao; Lei, Kai

doi:10.1007/978-3-319-31750-2_20

Dboost: A Fast Algorithm for DBSCAN-based Clustering on High Dimensional Data

Yuxiao Zhang^19,22,
Xiaorong Wang²⁰,
Bingyang Li²¹,
Wei Chen^22,23,
Tengjiao Wang^19,22,23 &
…
Kai Lei¹⁹

Conference paper
First Online: 12 April 2016

3263 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9652))

Abstract

DBSCAN is a classic density-based clustering technique, which is well known in discovering clusters of arbitrary shapes and handling noise. However, it is very time-consuming in density calculation when facing high dimensional data, which makes it inefficient in many areas, such as multi-document summarization, product recommendation, etc. Therefore, how to efficiently calculate the density on high dimensional data becomes one key issue for DBSCAN-based clustering technique. In this paper, we propose a fast algorithm for DBSCAN-based clustering on high dimensional data, named Dboost. In our algorithm, a ranked retrieval technique adaption named \(WAND^\#\) is novelly applied to improving the density calculations without accuracy loss, and we further improve this acceleration by reducing the invoking times of \(WAND^\#\). Experiments were conducted on wire voltage data, Netflix dataset and microblog corpora. The results showed that an acceleration of over 50 times were achieved on wire voltage data and Netflix dataset, and 100 more times can be expected on microblog data.

This work was supported by Natural Science Foundation of China (Grant No. 61572043, 61300003, 61502115), State Grid Basic Research Program (DZ71-15-004), the Fundamental Research Funds for the Central Universities (Grant No. 3262014T75).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Viswanath, P., Pinkesh, R.: l-dbscan: a fast hybrid density based clustering method. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol 1, pp. 912–915. IEEE (2006)
Google Scholar
Dharni, C., Bansal, M.: Survey on improved dbscan algorithm. Int. J. Comput. Sci. Technol. 4 (2013)
Google Scholar
Ali, T., Asghar, S., Sajid, N.A.: Critical analysis of dbscan variations. In: 2010 International Conference on Information and Emerging Technologies (ICIET), pp. 1–6. IEEE (2010)
Google Scholar
Borah, B., Bhattacharyya, D.: An improved sampling-based dbscan for large spatial databases. In: Proceedings of International Conference on Intelligent Sensing and Information Processing, 2004, pp. 92–96. IEEE (2004)
Google Scholar
Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 426–434. ACM (2003)
Google Scholar
Corporation of netflix: the netflix prize (1997-2009). http://www.netflixprize.com/
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)
Google Scholar
El-Sonbaty, Y., Ismail, M., Farouk, M.: An efficient density based clustering algorithm for large databases. In: 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, pp. 673–677. IEEE (2004)
Google Scholar
Patwary, M.M.A., Palsetia, D., Agrawal, A., Liao, W.k., Manne, F., Choudhary, A.: A new scalable parallel dbscan algorithm using the disjoint-set data structure. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11. IEEE (2012)
Google Scholar
Cheu, E.Y., Keongg, C., Zhou, Z.: On the two-level hybrid clustering algorithm. In: International Conference on Artificial Intelligence in Science and Technology, pp. 138–142 (2004)
Google Scholar
Fontoura, M., Josifovski, V., Liu, J., Venkatesan, S., Zhu, X., Zien, J.: Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB Endowment 4(12), 1213–1224 (2011)
Google Scholar
Lacour, P., Macdonald, C., Ounis, I.: Efficiency comparison of document matching techniques. In: European Conference for Information Retrieval Efficiency Issues in Information Retrieval Workshop, pp. 37–46 (2008)
Google Scholar
Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. (TOIS) 14(4), 349–379 (1996)
Article Google Scholar
Wu, Y.P., Guo, J.J., Zhang, X.J.: A linear dbscan algorithm based on lsh. In: 2007 International Conference on Machine Learning and Cybernetics, vol. 5, pp. 2608–2614. IEEE (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics and Computer Engineering (ECE), Peking University, Shenzhen, 518055, China
Yuxiao Zhang, Tengjiao Wang & Kai Lei
Technology and Strategy Research Center, China Electric Power Research Institute, Beijing, 100192, China
Xiaorong Wang
School of Information Science and Technology, University of International Relations, Beijing, 100091, China
Bingyang Li
Key Laboratory of High Confidence Software Technologies, Peking University, Ministry of Education, Beijing, 100871, China
Yuxiao Zhang, Wei Chen & Tengjiao Wang
School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Wei Chen & Tengjiao Wang

Authors

Yuxiao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaorong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bingyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Tengjiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Lei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bingyang Li .

Editor information

Editors and Affiliations

The University of Melbourne, Melbourne, Victoria, Australia
James Bailey
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Osaka University, Osaka, Japan
Takashi Washio
University of Auckland, Auckland, New Zealand
Gill Dobbie
Shenzhen University, Shenzhen, China
Joshua Zhexue Huang
Massey University, Auckland, New Zealand
Ruili Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Wang, X., Li, B., Chen, W., Wang, T., Lei, K. (2016). Dboost: A Fast Algorithm for DBSCAN-based Clustering on High Dimensional Data. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-31750-2_20
Published: 12 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics