Clustering Categorical Sequences with Variable-Length Tuples Representation

Yuan, Liang; Hong, Zhiling; Chen, Lifei; Cai, Qiang

doi:10.1007/978-3-319-47650-6_2

Liang Yuan¹⁵,
Zhiling Hong¹⁶,
Lifei Chen¹⁷ &
…
Qiang Cai¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9983))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1675 Accesses
1 Citations

Abstract

Clustering categorical sequences is currently a difficult problem due to the lack of an efficient representation model for sequences. Unlike the existing models, which mainly focus on the fixed-length tuples representation, in this paper, a new representation model on the variable-length tuples is proposed. The variable-length tuples are obtained using a pruning method applied to delete the redundant tuples from the suffix tree, which is created for the fixed-length tuples with a large memory-length of sequences, in terms of the entropy-based measure evaluating the redundancy of tuples. A partitioning algorithm for clustering categorical sequences is then defined based on the normalized representation using tuples collected from the pruned tree. Experimental studies on six real-world sequence sets show the effectiveness and suitability of the proposed method for subsequence-based clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aggarwal, C.C.: Data Mining: The Textbook. Springer, New York (2015)
Book MATH Google Scholar
Xu, R., Wunsch, D.C.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16, 645–678 (2005)
Article Google Scholar
Yang, J., Wang, W.: CLUSEQ: Efficient and effective sequence clustering. In: Proceedings of IEEE ICDE, pp. 101–112 (2003)
Google Scholar
Dong, G., Pei, J.: Classification, clustering, features and distances of sequence data. Seq. Data Min. 33, 47–65 (2007)
Article Google Scholar
Kelil, A., Wang, S.: SCS: a new similarity measure for categorical sequences. In: Proceedings of IEEE ICDM, pp. 343–352 (2008)
Google Scholar
Vinga, S., Almeida, J.: Alignment-free sequence comparison: a review. Bioinformatics 19, 513–523 (2003)
Article Google Scholar
Leopold, E., Kindermann, J.: Text categorization with support vector machines: how to represent texts in input space? Mach. Learn. 46, 423–444 (2002)
Article MATH Google Scholar
Kondrak, G.: N-Gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). doi:10.1007/11575832_13
Chapter Google Scholar
Wei, D., Jiang, Q., Wei, Y., Wang, S.: A novel hierarchical clustering algorithm for gene sequences. BMC Bioinform. 13, 174 (2012)
Article Google Scholar
Xiong, T., Wang, S., Jiang, Q., Huang, J.Z.: A novel variable-order Markov model for clustering categorical sequences. IEEE Trans. Knowl. Data Eng. 26, 2339–2353 (2014)
Article Google Scholar
Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensinoal sparse data. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)
Article Google Scholar
Chen, L., Jiang, Q., Wang, S.: Model-based method for projective clustering. IEEE Trans. Knowl. Data Eng. 24, 1291–1305 (2012)
Article Google Scholar
Herranz, J., Nin, J.: Sol\(\acute{e}\) M.: optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans. Knowl. Data Eng. 23, 1541–1554 (2011)
Article Google Scholar
Chen, L.: EM-type method for measuring graph dissimilarity. Int. J. Mach. Learn. Cybern. 5, 625–633 (2014)
Article Google Scholar
Wu, T.J., Burke, J.P., Davison, D.B.: A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics. 53, 1431–1439 (1997)
Article MathSciNet MATH Google Scholar
Wu, T., Fan, Y., Hong, Z., Chen, L.: Subspace clustering on mobile data for discovering circle of friends. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 703–711. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25159-2_64
Chapter Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)
Article Google Scholar
Loiselle, S., Rouat, J., Pressnitzer, D., Thorpe, S.: Exploration of rank order coding with spiking neural networks for speech recognition. Proc. IEEE IJCNN 4, 2076–2080 (2005)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant No. 61175123, and partially supported by the Natural Science Foundation of Fujian Province of China under Grant No. 2015J01238.

Author information

Authors and Affiliations

Network Operation Maintenance Center, University of Electronic Science and Technology of China, Chengdu, 611731, China
Liang Yuan
Software School, Xiamen University, Xiamen, 361005, China
Zhiling Hong
School of Mathematics and Computer Science, Fujian Normal University, Fuzhou, 350117, Fujian, China
Lifei Chen
Technique Department, Xiamen Customs, Xiamen, 361000, China
Qiang Cai

Authors

Liang Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Zhiling Hong
View author publications
You can also search for this author in PubMed Google Scholar
Lifei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhiling Hong .

Editor information

Editors and Affiliations

University of Passau, Passau, Germany
Franz Lehner
University of Passau , Passau, Germany
Nora Fteimi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yuan, L., Hong, Z., Chen, L., Cai, Q. (2016). Clustering Categorical Sequences with Variable-Length Tuples Representation. In: Lehner, F., Fteimi, N. (eds) Knowledge Science, Engineering and Management. KSEM 2016. Lecture Notes in Computer Science(), vol 9983. Springer, Cham. https://doi.org/10.1007/978-3-319-47650-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-47650-6_2
Published: 05 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47649-0
Online ISBN: 978-3-319-47650-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics