Abstract
The PreDeCon clustering algorithm finds arbitrarily shaped clusters in high-dimensional feature spaces, which remains an active research topic with many potential applications. However, it suffers from poor runtime performance, as well as a lack of user interaction. Our new method AnyPDC introduces a novel approach to cope with these problems by casting PreDeCon into an anytime algorithm. In this anytime scheme, it quickly produces an approximate result and iteratively refines it toward the result of PreDeCon at the end. AnyPDC not only significantly speeds up PreDeCon clustering but also allows users to interact with the algorithm during its execution. Moreover, by maintaining an underlying cluster structure consisting of so-called primitive clusters and by block processing of neighborhood queries, AnyPDC can be efficiently executed in parallel on shared memory architectures such as multi-core processors. Experiments on large real world datasets show that AnyPDC achieves high quality approximate results early on, leading to orders of magnitude speedup compared to PreDeCon. Moreover, while anytime techniques are usually slower than batch ones, the algorithmic solution in AnyPDC is actually faster than PreDeCon even if run to the end. AnyPDC also scales well with the number of threads on multi-cores CPUs.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
- 4.
- 5.
Since Ideal ignores the cluster expansion process of PreDeCon, its runtime is obviously lower than that of PreDeCon itself.
References
Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., Müller-Gorman, I., Zimek, A.: Finding hierarchies of subspace clusters. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS (LNAI), vol. 4213, pp. 446–453. Springer, Heidelberg (2006). https://doi.org/10.1007/11871637_42
Achtert, E., Böhm, C., Kriegel, H.-P., Kröger, P., Müller-Gorman, I., Zimek, A.: Detection and visualization of subspace cluster hierarchies. In: Kotagiri, R., Krishna, P.R., Mohania, M., Nantajeewarawat, E. (eds.) DASFAA 2007. LNCS, vol. 4443, pp. 152–163. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71703-4_15
Aggarwal, C.C., Procopiuc, C.M., Wolf, J.L., Yu, P.S., Park, J.S.: Fast algorithms for projected clustering. In: SIGMOD, pp. 61–72 (1999)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD, pp. 94–105 (1998)
Assent, I., Kranen, P., Baldauf, C., Seidl, T.: AnyOut: anytime outlier detection on streaming data. In: Lee, S., Peng, Z., Zhou, X., Moon, Y.-S., Unland, R., Yoo, J. (eds.) DASFAA 2012. LNCS, vol. 7238, pp. 228–242. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-29038-1_18
Böhm, C., Kailing, K., Kriegel, H.P., Kröger, P.: Density connected clustering with local subspace preferences. In: ICDM, pp. 27–34 (2004)
Chapman, B., Jost, G., Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Cambridge (2007)
Dang, M.T., Luong, A.V., Vu, T.-T., Nguyen, Q.V.H., Nguyen, T.T., Stantic, B.: An ensemble system with random projection and dynamic ensemble selection. In: Nguyen, N.T., Hoang, D.H., Hong, T.-P., Pham, H., Trawiński, B. (eds.) ACIIDS 2018. LNCS (LNAI), vol. 10751, pp. 576–586. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75417-8_54
Deng, X., Dou, Y., Lv, T., Nguyen, Q.V.H.: A novel centrality cascading based edge parameter evaluation method for robust influence maximization. IEEE Access 5, 22119–22131 (2017)
Duong, C.T., Nguyen, Q.V.H., Wang, S., Stantic, B.: Provenance-based rumor detection. In: Huang, Z., Xiao, X., Cao, X. (eds.) ADC 2017. LNCS, vol. 10538, pp. 125–137. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68155-9_10
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996)
Greiner, J.: A comparison of parallel algorithms for connected components. In: SPAA, pp. 16–25 (1994)
Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: VLDB, pp. 506–515 (2000)
Hung, N.Q.V., Anh, D.T.: Combining sax and piecewise linear approximation to improve similarity search on financial time series. In: ISITC, pp. 58–62 (2007)
Hung, N.Q.V., Anh, D.T.: An improvement of PAA for dimensionality reduction in large time series databases. In: Ho, T.-B., Zhou, Z.-H. (eds.) PRICAI 2008. LNCS (LNAI), vol. 5351, pp. 698–707. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89197-0_64
Hung, N.Q.V., Anh, D.T.: Using motif information to improve anytime time series classification. In: SoCPaR, pp. 1–6 (2013)
Hung, N.Q.V., et al.: Argument discovery via crowdsourcing. VLDB J. 26, 511–535 (2017)
Hung, N.Q.V., Jeung, H., Aberer, K.: An evaluation of model-based approaches to sensor data compression. TKDE 25, 2434–2447 (2013)
Hung, N.Q.V., Luong, X.H., Miklós, Z., Quan, T.T., Aberer, K.: An MAS negotiation support tool for schema matching. In: AAMAS, pp. 1391–1392 (2013)
Hung, N.Q.V., Sathe, S., Duong, C.T., Aberer, K.: Towards enabling probabilistic databases for participatory sensing. In: CollaborateCom, pp. 114–123 (2014)
Quoc Viet Hung, N., Tam, N.T., Tran, L.N., Aberer, K.: An evaluation of aggregation techniques in crowdsourcing. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds.) WISE 2013. LNCS, vol. 8181, pp. 1–15. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41154-0_1
Hung, N.Q.V., Tam, N.T., Miklós, Z., Aberer, K.: On leveraging crowdsourcing techniques for schema matching networks. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013. LNCS, vol. 7826, pp. 139–154. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37450-0_10
Hung, N.Q.V., Tam, N.T., Miklós, Z., Aberer, K.: Reconciling schema matching networks through crowdsourcing. EAI 1, e2 (2014)
Hung, N.Q.V., et al.: Answer validation for generic crowdsourcing tasks with minimal efforts. VLDB J. 26, 855–880 (2017)
Hung, N.Q.V., Thang, D.C., Weidlich, M., Aberer, K.: Minimizing efforts in validating crowd answers. In: SIGMOD, pp. 999–1014 (2015)
Nguyen, Q.V.H., Do, S.T., Nguyen, T.T., Aberer, K.: Tag-based paper retrieval: minimizing user effort with diversity awareness. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M.A. (eds.) DASFAA 2015. LNCS, vol. 9049, pp. 510–528. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18120-2_30
Hung, N.Q.V., Viet, H.H., Tam, N.T., Weidlich, M., Yin, H., Zhou, X.: Computing crowd consensus with partial agreement. IEEE Trans. Knowl. Data Eng. 30(1), 1–14 (2018)
Quoc Viet Nguyen, H., et al.: Minimizing human effort in reconciling match networks. In: Ng, W., Storey, V.C., Trujillo, J.C. (eds.) ER 2013. LNCS, vol. 8217, pp. 212–226. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41924-9_19
Kleinberg, R.D.: Anytime algorithms for multi-armed bandit problems. In: SODA, pp. 928–936 (2006)
Kriegel, H.-P., Kröger, P., Ntoutsi, I., Zimek, A.: Density based subspace clustering over dynamic data. In: Bayard Cushing, J., French, J., Bowers, S. (eds.) SSDBM 2011. LNCS, vol. 6809, pp. 387–404. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22351-8_24
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1), 1 (2009)
Kristensen, J., Mai, S.T., Assent, I., Jacobsen, J., Vo, B., Le, A.: Interactive exploration of subspace clusters for high dimensional data. In: Benslimane, D., Damiani, E., Grosky, W.I., Hameurlain, A., Sheth, A., Wagner, R.R. (eds.) DEXA 2017. LNCS, vol. 10438, pp. 327–342. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64468-4_25
Kröger, P., Kriegel, H.P., Kailing, K.: Density-connected subspace clustering for high-dimensional data. In: SDM, pp. 246–256 (2004)
Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2002)
Kywe, W.W., Fujiwara, D., Murakami, K.: Scheduling of image processing using anytime algorithm for real-time system. In: ICPR, vol. 3, pp. 1095–1098 (2006)
Mai, S.T., et al.: Scalable interactive dynamic graph clustering on multicore CPUs. TKDE
Mai, S.T., Amer-Yahia, S., Chouakria, A.D.: Scalable active temporal constrained clustering. In: EDBT, pp. 449–452 (2018)
Mai, S.T., Amer-Yahia, S., Chouakria, A.D., Nguyen, K.T., Nguyen, A.-D.: Scalable active constrained clustering for temporal data. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10827, pp. 566–582. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91452-7_37
Mai, S.T., Assent, I., Jacobsen, J., Dieu, M.S.: Anytime parallel density-based clustering. Data Min. Knowl. Discov. 32(4), 1121–1176 (2018)
Mai, S.T., Assent, I., Le, A.: Anytime OPTICS: an efficient approach for hierarchical density-based clustering. In: Navathe, S.B., Wu, W., Shekhar, S., Du, X., Wang, X.S., Xiong, H. (eds.) DASFAA 2016. LNCS, vol. 9642, pp. 164–179. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32025-0_11
Mai, S.T., Assent, I., Storgaard, M.: AnyDBC: an efficient anytime density-based clustering algorithm for very large complex datasets. In: SIGKDD, pp. 1025–1034 (2016)
Mai, S.T., Dieu, M.S., Assent, I., Jacobsen, J., Kristensen, J., Birk, M.: Scalable and interactive graph clustering algorithm on multicore CPUs. In: ICDE, pp. 349–360 (2017)
Mai, S.T., He, X., Feng, J., Böhm, C.: Efficient anytime density-based clustering. In: SDM, pp. 112–120 (2013)
Mai, S.T., He, X., Feng, J., Plant, C., Böhm, C.: Anytime density-based clustering of complex data. Knowl. Inf. Syst. 45(2), 319–355 (2015)
Mai, S.T., He, X., Hubig, N., Plant, C., Böhm, C.: Active density-based clustering. In: ICDM, pp. 508–517 (2013)
Ntoutsi, I., Zimek, A., Palpanas, T., Kröger, P., Kriegel, H.: Density-based projected clustering over high dimensional data streams. In: SDM, pp. 987–998 (2012)
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Explor. 6(1), 90–105 (2004)
Peixoto, D.A., Hung, N.Q.V.: Scalable and fast top-k most similar trajectories search using mapreduce in-memory. In: Cheema, M.A., Zhang, W., Chang, L. (eds.) ADC 2016. LNCS, vol. 9877, pp. 228–241. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46922-5_18
Peixoto, D.A., Zhou, X., Hung, N.Q.V., He, D., Stantic, B.: A system for spatial-temporal trajectory data integration and representation. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10828, pp. 807–812. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91458-9_53
Settles, B.: Active learning literature survey. Computer Sciences Technical report 1648, University of Wisconsin-Madison (2009)
Shieh, J., Keogh, E.J.: Polishing the right apple: anytime classification also benefits data streams with constant arrival times. In: ICDM, pp. 461–470 (2010)
Sim, K., Gopalkrishnan, V., Zimek, A., Cong, G.: A survey on enhanced subspace clustering. Data Min. Knowl. Discov. 26(2), 332–397 (2013)
Smyth, P., Wolpert, D.: Anytime exploratory data analysis for massive data sets. In: KDD, pp. 54–60 (1997)
Tam, N.T., Hung, N.Q.V., Weidlich, M., Aberer, K.: Result selection and summarization for web table search. In: ICDE, pp. 231–242 (2015)
Tam, N.T., Weidlich, M., Thang, D.C., Yin, H., Hung, N.Q.V.: Retaining data from streams of social platforms with minimal regret. In: IJCAI, pp. 2850–2856 (2017)
Thang, D.C., Tam, N.T., Hung, N.Q.V., Aberer, K.: An evaluation of diversification techniques. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds.) DEXA 2015. LNCS, vol. 9262, pp. 215–231. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22852-5_19
Toan, N.T., Cong, P.T., Tam, N.T., Hung, N.Q.V., Stantic, B.: Diversifying group recommendation. IEEE Access 6, 17776–17786 (2018)
Ueno, K., Xi, X., Keogh, E.J., Lee, D.J.: Anytime classification using the nearest neighbor algorithm with applications to stream mining. In: ICDM, pp. 623–632 (2006)
Wang, W., Yin, H., Huang, Z., Sun, X., Hung, N.Q.V.: Restricted Boltzmann machine based active learning for sparse recommendation. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10827, pp. 100–115. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91452-7_7
Yin, H., Chen, H., Sun, X., Wang, H., Wang, Y., Nguyen, Q.V.H.: SPTF: a scalable probabilistic tensor factorization model for semantic-aware behavior prediction. In: ICDM, pp. 585–594 (2017)
Yin, H., Chen, L., Wang, W., Du, X., Hung, N.Q.V., Zhou, X.: Mobi-SAGE: a sparse additive generative model for mobile app recommendation. In: ICDE, pp. 75–78 (2017)
Yin, H., et al.: Discovering interpretable geo-social communities for user behavior prediction. In: ICDE, pp. 942–953 (2016)
Yin, H., Zhou, X., Cui, B., Wang, H., Zheng, K., Hung, N.Q.V.: Adapting to user interest drift for POI recommendation. TKDE 28, 2566–2581 (2016)
Zaki, M.J., Meira Jr., W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, New York (2014)
Zilberstein, S.: Using anytime algorithms in intelligent systems. AI Mag. 17(3), 73–83 (1996)
Zilberstein, S., Russell, S.J.: Anytime sensing planning and action: a practical model for robot control. In: IJCAI, pp. 1402–1407 (1993)
Acknowledgments
We special thank to anonymous reviewers for their helpful comments. Part of this research was funded by a Villum postdoc fellowship, Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2015.10 and the CDP Life Project.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer-Verlag GmbH Germany, part of Springer Nature
About this chapter
Cite this chapter
Pham, T.H. et al. (2018). Interactive Exploration of Subspace Clusters on Multicore Processors. In: Hameurlain, A., Wagner, R., Benslimane, D., Damiani, E., Grosky, W. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXIX. Lecture Notes in Computer Science(), vol 11310. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58415-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-662-58415-6_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58414-9
Online ISBN: 978-3-662-58415-6
eBook Packages: Computer ScienceComputer Science (R0)