Abstract
For high-dimensional data, the failure in distance calculation and the inefficient index tree that are respectively derived from equidistance and redundant attribute, have affected the performance of clustering algorithm seriously. To solve these problems, this paper introduces a clustering algorithm of high-dimensional data based on sequential Psim matrix and differential truncation. Firstly, the similarity of high-dimensional data is calculated with Psim function, which avoids the equidistance. Secondly, the data is organized with sequential Psim matrix, which improves the indexing performance. Thirdly, the initial clusters are produced with differential truncation. Finally, the K-Medoids algorithm is used to refine cluster. This algorithm was compared with K-Medoids and spectral clustering algorithms in two types of datasets. The experiment result indicates that our proposed algorithm reaches high value of Macro-F1 and Micro-F1 at the small number of iterations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Han, J.W., Kamber, H.L., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
Ericson, K.L., Pallickara, S.D.: On the performance of high dimensional data clustering and classification algorithms. Future Gener. Comput. Syst. 29(4), 1024–1034 (2013)
Keogh, E., Mueen, A.: Curse of dimensionality. In: Encyclopedia of Machine Learning, pp. 257–258. Springer, Berlin (2010)
Yang, Q., Wu, X.D.: 10 Challenging problems in data mining research. Int. J. Inf. Technol. Decis. Making 5(4), 597–604 (2006)
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2
Parsons, L., Haque, E.S., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor. Newsl. 6(1), 90–105 (2004)
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–274. ACM Press, New York (2001)
Fu, Q., Li, Z.F.: The research of clustering algorithm based on CLIQUE. J. East China Jiaotong Univ. 23(5), 79–82 (2006)
Feng, Z.H., Zhou, B., Shen, J.Y.: A parallel hierarchical clustering algorithm for PCs cluster system. Neurocomputing 70, 809–818 (2007)
Du, Z., Lin, F.: A novel parallelization approach for hierarchical clustering. Parallel Comput. 31, 523–527 (2005)
Wu, H.Y., Wang, W.T., Wen, J.H., He, G.H.: Research on clustering algorithm of high-dimensional dataset with input knowledge. Comput. Sci. 33(1), 240–242 (2006)
Yi, L.H.: Research on clustering algorithm for high dimensional data. Master’s thesis, Yan Shan University, Qinhuangdao Hebei, China (2011)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Publishing Company, Boston (2005)
Yang, F.Z., Zhu, Y.Y.: An efficient method for similarity search on quantitative transaction data. J. Comput. Res. Dev. 41(2), 361–368 (2004)
Huang, S.D., Chen, Q.M.: On clustering algorithm of high dimensional data based on similarity measurement. Comput. Appl. Softw. 26(9), 102–105 (2009)
Shao, C.S., Lou, W., Yan, L.M.: Optimization of algorithm of similarity measurement in high-dimensional data. Comput. Technol. Dev. 21(2), 1–4 (2011)
Wang, X.Y., Zhang, H.Y., Shen, L.Z., Chi, W.L.: Research on high dimensional clustering algorithm based on similarity measurement. Comput. Technol. Dev. 23(5), 30–33 (2013)
Jia, X.Y.: A high dimensional data clustering algorithm based on twice similarity. J. Comput. Appl. 25(B12), 176–177 (2005)
Brakatsoulas, S., Pfoser, D., Theodoridis, Y.: Revisiting R-tree construction principles. In: Manolopoulos, Y., Návrat, P. (eds.) ADBIS 2002. LNCS, vol. 2435, pp. 149–162. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45710-0_13
Nielsen, F., Piro, P., Barlaud, M.: Bregman vantage point trees for efficient nearest Neighbor Queries. In: 10th IEEE International Conference on Multimedia and Expo, pp. 878–881. IEEE Computer Society, Birmingham (2009)
Kunze, M., Weske, M.: Metric trees for efficient similarity search in large process model repositories. Lect. Notes Bus. Inf. Process. 66, 535–546 (2011)
Navarro, G.Z.: Searching in metric spaces by spatial approximation. VLDB J. 11(1), 28–46 (2002)
Chen, J.B.: The Research and Application of Key Technologies in Knowledge Discovery of High-Dimensional Clustering. Publishing House of Electronics Industry, Beijing (2011)
Andrew, Y.N., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and algorithm. In: Advances in Neural Information Processing Systems, pp. 121–526. MIT Press, Cambridge (2002)
Raymond, T.N., Han, J.W.: Efficient and effective clustering methods for spatial data mining. In: 20th International Conference on Very Large Data Bases, pp. 144–155. IEEE Computer Society, Birmingham (1994)
Chen, L.F., Ye, Y.F., Jiang, Q.S.: A new centroid-based classifier for text categorization. In: 22nd IEEE International Conference on Advanced Information Networking and Applications, pp. 1217–1222. IEEE Computer Society, Birmingham (2008)
Acknowledgments
This work is partly supported by the National Nature Science Foundation of China (No. 61502475, 61602285) and the Importation and Development of High-Caliber Talents Project of the Beijing Municipal Institutions (No. CIT & TCD201504039).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, G., Li, W., Xu, W. (2018). A Clustering Algorithm of High-Dimensional Data Based on Sequential Psim Matrix and Differential Truncation. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11335. Springer, Cham. https://doi.org/10.1007/978-3-030-05054-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-05054-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05053-5
Online ISBN: 978-3-030-05054-2
eBook Packages: Computer ScienceComputer Science (R0)