Abstract
Due to the restriction of computing resources, it is often inconvenient to directly conduct analysis on massive datasets. Instead, a set of representatives can be extracted to approximate the spatial distribution of data objects. Standard data mining algorithms are then performed on these selected points only, which typically account for a small fraction of the original data, reducing the computational time significantly. In practice, the boundary points of data clusters can be regarded as a compact and effective representation of the original data, with great potential in clustering, outlier or anomaly detection and classification. As a result, given a complex dataset, how to reliably identify a set of effective boundary points creates a new challenge in data mining. In this paper, we present a boundary extraction technique similar to the method in SCUBI (Scalable Clustering Using Boundary Information). The key difference is that our technique exploits the clustering information in a feedback loop to further refine the boundary. Experimental results show that our technique is more robust and can produce more representative boundary points than SCUBI, especially on complex datasets with large inhomogeneity in terms of cluster density.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Jain, K., Murty, N., Flynn, J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, Hoboken (2008)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistic and Probability, vol. 1, pp. 281–297 (1967)
Arthur, D., Manthey, B., Röglin, H.: K-means has polynomial smoothed complexity. In: Foundations of Computer Science, vol. 157, pp. 405–414 (2009)
Ester, M., Kriegel, H.P., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, Portland (1996)
Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2
Tong, Q.H., Li, X., Yuan, B.: A highly scalable clustering scheme using boundary information. Pattern Recogn. Lett. 89, 1–7 (2017)
Edelsbrunner, H., Kirkpatrick, D., Seidel, R.: On the shape of a set of points in the plane. IEEE Trans. Inf. Theory 29(4), 551–559 (1983)
Moreira, A.J.C., Santos, M.Y.: Concave hull: a k-nearest neighbors approach for the computation of the region occupied by a set of points. In: Proceedings of the Second International Conference on Computer Graphics Theory and Applications, vol. 3520, pp. 61–68. Springer, Barcelona (2006)
López Chau, A., Li, X., Yu, W., Cervantes, J., Mejía-Álvarez, P.: Border samples detection for data mining applications using non convex hulls. In: Batyrshin, I., Sidorov, G. (eds.) MICAI 2011. LNCS (LNAI), vol. 7095, pp. 261–272. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25330-0_23
Hoogs, A., Collins, R.: Object boundary detection in images using a semantic ontology. In: Conference on Computer Vision and Pattern Recognition Workshop, pp. 956–963 (2006)
Liu, D., Nosovskiy, G.V., Sourina, O.: Effective clustering and boundary detection algorithm based on delaunay triangulation. Pattern Recogn. Lett. 29, 1261–1273 (2008)
Estivill-Castro, V., Lee, I.: AUTOCLUST: automatic clustering via boundary extraction for mining massive point-data sets. In: International Conference on Geocomputation, vol. 26, pp. 23–25 (2000)
Yang, J., Estivill-Castro, V., Chalup, S.K.: Support vector clustering through proximity graph modelling. In: International Conference on Neural Information Processing, vol. 2, pp. 898–903. IEEE, Singapore (2002)
Chen, X.J., Zhang, G., Hua, X.H.: Point cloud simplification based on the information entropy of normal vector angle. Chin. J. Lasers 42(8), 328–336 (2015)
Xia, C., Hsu, W., Lee, M.L.: BORDER: efficient computation of boundary points. IEEE Trans. Knowl. Data Eng. 18(3), 289–303 (2006)
Nosovskiy, G.V., Liu, D., Sourina, O.: Automatic clustering and boundary detection algorithm based on adaptive influence function. Pattern Recogn. 41, 2757–2776 (2008)
Zhu, F., Ye, N., Yu, W., Xu, S., Li, G.: Boundary detection and sample reduction for one-class support vector machines. Neurocomputing 123, 166–173 (2014)
Qiu, B.-Z., Yue, F., Shen, J.-Y.: BRIM: an efficient boundary points detecting algorithm. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 761–768. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-71701-0_83
Li, Y.: Selecting training points for one-class support vector machines. Pattern Recogn. Lett. 32(11), 1517–1522 (2011)
He, Y.Z., Wang, C.H., Qiu, B.Z.: Clustering boundary points detection algorithm based on gradient binarization. Appl. Mech. Mater. 266, 2358–2363 (2013)
Silva, J.A., Faria, E.R., Barros, R.C.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13 (2013)
Pokrajac, D., Lazarevic, A., Latecki, L.J.: Incremental local outlier detection for data streams. In: IEEE Symposium on Computational Intelligence and Data Mining, pp. 504–515. IEEE, Honolulu (2007)
Salehi, M., Leckie, C., Bezdek, J.C.: Fast memory efficient local outlier detection in data streams. IEEE Trans. Knowl. Data Eng. 28(12), 3246–3260 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, H., Chen, Z., Tong, Q., Bo, Y. (2018). Towards a Compact and Effective Representation for Datasets with Inhomogeneous Clusters. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11304. Springer, Cham. https://doi.org/10.1007/978-3-030-04212-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-04212-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04211-0
Online ISBN: 978-3-030-04212-7
eBook Packages: Computer ScienceComputer Science (R0)