Abstract
DBSCAN is a classic density-based clustering technique, which is well known in discovering clusters of arbitrary shapes and handling noise. However, it is very time-consuming in density calculation when facing high dimensional data, which makes it inefficient in many areas, such as multi-document summarization, product recommendation, etc. Therefore, how to efficiently calculate the density on high dimensional data becomes one key issue for DBSCAN-based clustering technique. In this paper, we propose a fast algorithm for DBSCAN-based clustering on high dimensional data, named Dboost. In our algorithm, a ranked retrieval technique adaption named \(WAND^\#\) is novelly applied to improving the density calculations without accuracy loss, and we further improve this acceleration by reducing the invoking times of \(WAND^\#\). Experiments were conducted on wire voltage data, Netflix dataset and microblog corpora. The results showed that an acceleration of over 50 times were achieved on wire voltage data and Netflix dataset, and 100 more times can be expected on microblog data.
This work was supported by Natural Science Foundation of China (Grant No. 61572043, 61300003, 61502115), State Grid Basic Research Program (DZ71-15-004), the Fundamental Research Funds for the Central Universities (Grant No. 3262014T75).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Viswanath, P., Pinkesh, R.: l-dbscan: a fast hybrid density based clustering method. In: 18th International Conference on Pattern Recognition, ICPR 2006, vol 1, pp. 912–915. IEEE (2006)
Dharni, C., Bansal, M.: Survey on improved dbscan algorithm. Int. J. Comput. Sci. Technol. 4 (2013)
Ali, T., Asghar, S., Sajid, N.A.: Critical analysis of dbscan variations. In: 2010 International Conference on Information and Emerging Technologies (ICIET), pp. 1–6. IEEE (2010)
Borah, B., Bhattacharyya, D.: An improved sampling-based dbscan for large spatial databases. In: Proceedings of International Conference on Intelligent Sensing and Information Processing, 2004, pp. 92–96. IEEE (2004)
Broder, A.Z., Carmel, D., Herscovici, M., Soffer, A., Zien, J.: Efficient query evaluation using a two-level retrieval process. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, pp. 426–434. ACM (2003)
Corporation of netflix: the netflix prize (1997-2009). http://www.netflixprize.com/
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96, 226–231 (1996)
El-Sonbaty, Y., Ismail, M., Farouk, M.: An efficient density based clustering algorithm for large databases. In: 16th IEEE International Conference on Tools with Artificial Intelligence, ICTAI 2004, pp. 673–677. IEEE (2004)
Patwary, M.M.A., Palsetia, D., Agrawal, A., Liao, W.k., Manne, F., Choudhary, A.: A new scalable parallel dbscan algorithm using the disjoint-set data structure. In: 2012 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11. IEEE (2012)
Cheu, E.Y., Keongg, C., Zhou, Z.: On the two-level hybrid clustering algorithm. In: International Conference on Artificial Intelligence in Science and Technology, pp. 138–142 (2004)
Fontoura, M., Josifovski, V., Liu, J., Venkatesan, S., Zhu, X., Zien, J.: Evaluation strategies for top-k queries over memory-resident inverted indexes. Proc. VLDB Endowment 4(12), 1213–1224 (2011)
Lacour, P., Macdonald, C., Ounis, I.: Efficiency comparison of document matching techniques. In: European Conference for Information Retrieval Efficiency Issues in Information Retrieval Workshop, pp. 37–46 (2008)
Moffat, A., Zobel, J.: Self-indexing inverted files for fast text retrieval. ACM Trans. Inf. Syst. (TOIS) 14(4), 349–379 (1996)
Wu, Y.P., Guo, J.J., Zhang, X.J.: A linear dbscan algorithm based on lsh. In: 2007 International Conference on Machine Learning and Cybernetics, vol. 5, pp. 2608–2614. IEEE (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhang, Y., Wang, X., Li, B., Chen, W., Wang, T., Lei, K. (2016). Dboost: A Fast Algorithm for DBSCAN-based Clustering on High Dimensional Data. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-31750-2_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)