A Clustering Algorithm of High-Dimensional Data Based on Sequential Psim Matrix and Differential Truncation

Wang, Gongming; Li, Wenfa; Xu, Weizhi

doi:10.1007/978-3-030-05054-2_23

Gongming Wang¹⁶,
Wenfa Li¹⁷ &
Weizhi Xu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11335))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1672 Accesses

Abstract

For high-dimensional data, the failure in distance calculation and the inefficient index tree that are respectively derived from equidistance and redundant attribute, have affected the performance of clustering algorithm seriously. To solve these problems, this paper introduces a clustering algorithm of high-dimensional data based on sequential Psim matrix and differential truncation. Firstly, the similarity of high-dimensional data is calculated with Psim function, which avoids the equidistance. Secondly, the data is organized with sequential Psim matrix, which improves the indexing performance. Thirdly, the initial clusters are produced with differential truncation. Finally, the K-Medoids algorithm is used to refine cluster. This algorithm was compared with K-Medoids and spectral clustering algorithms in two types of datasets. The experiment result indicates that our proposed algorithm reaches high value of Macro-F1 and Micro-F1 at the small number of iterations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Han, J.W., Kamber, H.L., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, San Francisco (2011)
MATH Google Scholar
Ericson, K.L., Pallickara, S.D.: On the performance of high dimensional data clustering and classification algorithms. Future Gener. Comput. Syst. 29(4), 1024–1034 (2013)
Article Google Scholar
Keogh, E., Mueen, A.: Curse of dimensionality. In: Encyclopedia of Machine Learning, pp. 257–258. Springer, Berlin (2010)
Google Scholar
Yang, Q., Wu, X.D.: 10 Challenging problems in data mining research. Int. J. Inf. Technol. Decis. Making 5(4), 597–604 (2006)
Article Google Scholar
Berkhin, P.: A survey of clustering data mining techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data, pp. 25–71. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-28349-8_2
Chapter Google Scholar
Parsons, L., Haque, E.S., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor. Newsl. 6(1), 90–105 (2004)
Article Google Scholar
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 269–274. ACM Press, New York (2001)
Google Scholar
Fu, Q., Li, Z.F.: The research of clustering algorithm based on CLIQUE. J. East China Jiaotong Univ. 23(5), 79–82 (2006)
Google Scholar
Feng, Z.H., Zhou, B., Shen, J.Y.: A parallel hierarchical clustering algorithm for PCs cluster system. Neurocomputing 70, 809–818 (2007)
Article Google Scholar
Du, Z., Lin, F.: A novel parallelization approach for hierarchical clustering. Parallel Comput. 31, 523–527 (2005)
Article Google Scholar
Wu, H.Y., Wang, W.T., Wen, J.H., He, G.H.: Research on clustering algorithm of high-dimensional dataset with input knowledge. Comput. Sci. 33(1), 240–242 (2006)
Google Scholar
Yi, L.H.: Research on clustering algorithm for high dimensional data. Master’s thesis, Yan Shan University, Qinhuangdao Hebei, China (2011)
Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley Publishing Company, Boston (2005)
Google Scholar
Yang, F.Z., Zhu, Y.Y.: An efficient method for similarity search on quantitative transaction data. J. Comput. Res. Dev. 41(2), 361–368 (2004)
Google Scholar
Huang, S.D., Chen, Q.M.: On clustering algorithm of high dimensional data based on similarity measurement. Comput. Appl. Softw. 26(9), 102–105 (2009)
Google Scholar
Shao, C.S., Lou, W., Yan, L.M.: Optimization of algorithm of similarity measurement in high-dimensional data. Comput. Technol. Dev. 21(2), 1–4 (2011)
Google Scholar
Wang, X.Y., Zhang, H.Y., Shen, L.Z., Chi, W.L.: Research on high dimensional clustering algorithm based on similarity measurement. Comput. Technol. Dev. 23(5), 30–33 (2013)
Google Scholar
Jia, X.Y.: A high dimensional data clustering algorithm based on twice similarity. J. Comput. Appl. 25(B12), 176–177 (2005)
Google Scholar
Brakatsoulas, S., Pfoser, D., Theodoridis, Y.: Revisiting R-tree construction principles. In: Manolopoulos, Y., Návrat, P. (eds.) ADBIS 2002. LNCS, vol. 2435, pp. 149–162. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45710-0_13
Chapter Google Scholar
Nielsen, F., Piro, P., Barlaud, M.: Bregman vantage point trees for efficient nearest Neighbor Queries. In: 10th IEEE International Conference on Multimedia and Expo, pp. 878–881. IEEE Computer Society, Birmingham (2009)
Google Scholar
Kunze, M., Weske, M.: Metric trees for efficient similarity search in large process model repositories. Lect. Notes Bus. Inf. Process. 66, 535–546 (2011)
Article Google Scholar
Navarro, G.Z.: Searching in metric spaces by spatial approximation. VLDB J. 11(1), 28–46 (2002)
Article Google Scholar
Chen, J.B.: The Research and Application of Key Technologies in Knowledge Discovery of High-Dimensional Clustering. Publishing House of Electronics Industry, Beijing (2011)
Google Scholar
Andrew, Y.N., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and algorithm. In: Advances in Neural Information Processing Systems, pp. 121–526. MIT Press, Cambridge (2002)
Google Scholar
Raymond, T.N., Han, J.W.: Efficient and effective clustering methods for spatial data mining. In: 20th International Conference on Very Large Data Bases, pp. 144–155. IEEE Computer Society, Birmingham (1994)
Google Scholar
Chen, L.F., Ye, Y.F., Jiang, Q.S.: A new centroid-based classifier for text categorization. In: 22nd IEEE International Conference on Advanced Information Networking and Applications, pp. 1217–1222. IEEE Computer Society, Birmingham (2008)
Google Scholar

Download references

Acknowledgments

This work is partly supported by the National Nature Science Foundation of China (No. 61502475, 61602285) and the Importation and Development of High-Caliber Talents Project of the Beijing Municipal Institutions (No. CIT & TCD201504039).

Author information

Authors and Affiliations

Institute of Biophysics, Chinese Academy of Sciences, No. 15 Datun Road, Beijing, China
Gongming Wang
College of Information Technology, Beijing Union University, No. 97 Beisihuan East Road, Beijing, China
Wenfa Li
School of Information Science and Engineering, Shandong Normal University, No. 88 East Wenhua Road, Jinan, China
Weizhi Xu

Authors

Gongming Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wenfa Li
View author publications
You can also search for this author in PubMed Google Scholar
Weizhi Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gongming Wang .

Editor information

Editors and Affiliations

Rutgers University, Newark, NJ, USA
Jaideep Vaidya
Guangzhou University, Guangzhou, China
Jin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, G., Li, W., Xu, W. (2018). A Clustering Algorithm of High-Dimensional Data Based on Sequential Psim Matrix and Differential Truncation. In: Vaidya, J., Li, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2018. Lecture Notes in Computer Science(), vol 11335. Springer, Cham. https://doi.org/10.1007/978-3-030-05054-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-05054-2_23
Published: 07 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05053-5
Online ISBN: 978-3-030-05054-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics