Skip to main content

Clustering Categorical Sequences with Variable-Length Tuples Representation

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9983))

Abstract

Clustering categorical sequences is currently a difficult problem due to the lack of an efficient representation model for sequences. Unlike the existing models, which mainly focus on the fixed-length tuples representation, in this paper, a new representation model on the variable-length tuples is proposed. The variable-length tuples are obtained using a pruning method applied to delete the redundant tuples from the suffix tree, which is created for the fixed-length tuples with a large memory-length of sequences, in terms of the entropy-based measure evaluating the redundancy of tuples. A partitioning algorithm for clustering categorical sequences is then defined based on the normalized representation using tuples collected from the pruned tree. Experimental studies on six real-world sequence sets show the effectiveness and suitability of the proposed method for subsequence-based clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aggarwal, C.C.: Data Mining: The Textbook. Springer, New York (2015)

    Book  MATH  Google Scholar 

  2. Xu, R., Wunsch, D.C.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16, 645–678 (2005)

    Article  Google Scholar 

  3. Yang, J., Wang, W.: CLUSEQ: Efficient and effective sequence clustering. In: Proceedings of IEEE ICDE, pp. 101–112 (2003)

    Google Scholar 

  4. Dong, G., Pei, J.: Classification, clustering, features and distances of sequence data. Seq. Data Min. 33, 47–65 (2007)

    Article  Google Scholar 

  5. Kelil, A., Wang, S.: SCS: a new similarity measure for categorical sequences. In: Proceedings of IEEE ICDM, pp. 343–352 (2008)

    Google Scholar 

  6. Vinga, S., Almeida, J.: Alignment-free sequence comparison: a review. Bioinformatics 19, 513–523 (2003)

    Article  Google Scholar 

  7. Leopold, E., Kindermann, J.: Text categorization with support vector machines: how to represent texts in input space? Mach. Learn. 46, 423–444 (2002)

    Article  MATH  Google Scholar 

  8. Kondrak, G.: N-Gram similarity and distance. In: Consens, M., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 115–126. Springer, Heidelberg (2005). doi:10.1007/11575832_13

    Chapter  Google Scholar 

  9. Wei, D., Jiang, Q., Wei, Y., Wang, S.: A novel hierarchical clustering algorithm for gene sequences. BMC Bioinform. 13, 174 (2012)

    Article  Google Scholar 

  10. Xiong, T., Wang, S., Jiang, Q., Huang, J.Z.: A novel variable-order Markov model for clustering categorical sequences. IEEE Trans. Knowl. Data Eng. 26, 2339–2353 (2014)

    Article  Google Scholar 

  11. Jing, L., Ng, M.K., Huang, J.Z.: An entropy weighting k-means algorithm for subspace clustering of high-dimensinoal sparse data. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)

    Article  Google Scholar 

  12. Chen, L., Jiang, Q., Wang, S.: Model-based method for projective clustering. IEEE Trans. Knowl. Data Eng. 24, 1291–1305 (2012)

    Article  Google Scholar 

  13. Herranz, J., Nin, J.: Sol\(\acute{e}\) M.: optimal symbol alignment distance: a new distance for sequences of symbols. IEEE Trans. Knowl. Data Eng. 23, 1541–1554 (2011)

    Article  Google Scholar 

  14. Chen, L.: EM-type method for measuring graph dissimilarity. Int. J. Mach. Learn. Cybern. 5, 625–633 (2014)

    Article  Google Scholar 

  15. Wu, T.J., Burke, J.P., Davison, D.B.: A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics. 53, 1431–1439 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  16. Wu, T., Fan, Y., Hong, Z., Chen, L.: Subspace clustering on mobile data for discovering circle of friends. In: Zhang, S., Wirsing, M., Zhang, Z. (eds.) KSEM 2015. LNCS (LNAI), vol. 9403, pp. 703–711. Springer, Heidelberg (2015). doi:10.1007/978-3-319-25159-2_64

    Chapter  Google Scholar 

  17. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005)

    Article  Google Scholar 

  18. Loiselle, S., Rouat, J., Pressnitzer, D., Thorpe, S.: Exploration of rank order coding with spiking neural networks for speech recognition. Proc. IEEE IJCNN 4, 2076–2080 (2005)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant No. 61175123, and partially supported by the Natural Science Foundation of Fujian Province of China under Grant No. 2015J01238.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiling Hong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Yuan, L., Hong, Z., Chen, L., Cai, Q. (2016). Clustering Categorical Sequences with Variable-Length Tuples Representation. In: Lehner, F., Fteimi, N. (eds) Knowledge Science, Engineering and Management. KSEM 2016. Lecture Notes in Computer Science(), vol 9983. Springer, Cham. https://doi.org/10.1007/978-3-319-47650-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47650-6_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47649-0

  • Online ISBN: 978-3-319-47650-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics