Advertisement

Data Mining and Knowledge Discovery

, Volume 29, Issue 2, pp 466–502 | Cite as

A framework for dissimilarity-based partitioning clustering of categorical time series

  • Manuel García-Magariños
  • José A. Vilar
Article

Abstract

A new framework for clustering categorical time series is proposed. In our approach, a dissimilarity-based partitioning method is considered. We suggest measuring the dissimilarity between two categorical time series by assessing both closeness of raw categorical values and proximity between dynamic behaviours. For the latter, a particular index computing the temporal correlation for categorical-valued sequences is introduced. The dissimilarity measure is then used to perform clustering by considering a modified version of the \(k\)-modes algorithm specifically designed to provide with a better characterization of the clusters. Furthermore, the problem of determining the number of clusters in this framework is analyzed by comparing a range of procedures, including a prediction-based resampling method properly adjusted to deal with our dissimilarity. Several graphical devices to interpret and visualize the temporal pattern of each cluster are also provided. Performance of this clustering methodology is studied on different simulated scenarios and its effectiveness is concluded by comparison with alternative approaches. Real data use is illustrated by analyzing navigation patterns of users visiting a specific news web site.

Keywords

Categorical time series Dissimilarity-based clustering \(k\)-Means algorithm Estimating number of clusters Data visualization 

Notes

Acknowledgments

This work has been carried out as part of the research project “Digital Human Behaviour”, supported by Prisa Digital SL, a subsidiary company of Grupo Prisa SA, and the Centre for Industrial Technological Development (CDTI) under the Ministerio de Economía y Competitividad. Databases used in Section 4.1 were supplied by Prisa Digital SL. Special thanks go to Miguel Rodríguez and Susana Ladra of the LBD research group of University of A Coruña for their assistance in data pre-processing, and to Samuel Piñeiro of Prisa Digital for his valuable help interpreting the problem and objectives. The authors are grateful to the Prisa Digital staff and different research groups involved in the Digital HUB Project for their helpful comments and discussions. Suggestions made by three reviewers have meant a substantial improvement to the paper.

References

  1. Ahmad A, Dey L (2007) A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl Eng 63(2):503–527CrossRefGoogle Scholar
  2. Backer E, Jain AK (1981) A clustering performance measure based on fuzzy set decomposition. IEEE Trans Pattern Anal Machine Intell 3(1):66–75CrossRefMATHGoogle Scholar
  3. Bai L, Liang J, Dang C, Cao F (2011) A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recogn 44(12):2843–2861CrossRefMATHGoogle Scholar
  4. Baldi P, Frasconi P, Smyth P (2003) Modeling the internet and the web: probabilistic methods and algorithms. Wiley, ChichesterGoogle Scholar
  5. Bouguessa M (2013) Clustering categorical data in projected spaces. Data Min Knowl Discov 24:1–36Google Scholar
  6. Cadez I, Heckerman D, Meek C, Smyth P, White S (2003) Model-based clustering and visualization of navigation patterns on a web site. Data Min Knowl Discov 7(4):399–424CrossRefMathSciNetGoogle Scholar
  7. Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat-Simul Comput 3(1):1–27CrossRefMATHGoogle Scholar
  8. Cao F, Liang J, Bai L, Zhao X, Dang C (2010) A framework for clustering categorical time-evolving data. IEEE Trans Fuzzy Syst 18(5):872–882CrossRefGoogle Scholar
  9. Cesario E, Manco G, Ortale R (2007) Top-down parameter-free clustering of high-dimensional categorical data. IEEE Trans Knowl Data Eng 19(12):1607–1624CrossRefGoogle Scholar
  10. Chen HL, Chen MS, Lin SC (2009) Catching the trend: a framework for clustering concept-drifting categorical data. IEEE Trans Knowl Data Eng 21(5):652–665CrossRefGoogle Scholar
  11. Chen K, Liu L (2005) The “best k for entropy-based categorical data clustering. In: Frew J (ed) SSDBM, pp 253–262Google Scholar
  12. Chen L, Wang S (2013) Central clustering of categorical data with automated feature weighting. In: IJCAI. BeijingGoogle Scholar
  13. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc B 39(1):1–38MATHMathSciNetGoogle Scholar
  14. Do HJ, Kim JY (2008) Categorical data clustering using the combinations of attribute values. In: Gervasi O, Murgante B, Lagan A, Taniar D, Mun Y, Gavrilova ML (eds) ICCSA (2). Lecture Notes in Computer Science, vol 5073. Springer, New York, pp 220–231Google Scholar
  15. Douzal-Chouakria A, Nagabhushan PN (2007) Adaptive dissimilarity index for measuring time series proximity. Adv Data Anal Classif 1(1):5–21CrossRefMATHMathSciNetGoogle Scholar
  16. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):RESEARCH0036CrossRefGoogle Scholar
  17. Everitt BS, Landau S, Leese M (2009) Cluster analysis, 4th edn. Wiley, New YorkGoogle Scholar
  18. Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78(383):553–569CrossRefMATHGoogle Scholar
  19. Gavrilov M, Anguelov D, Indyk P, Motwani R (2000) Mining the stock market (extended abstract): Which measure is best? In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, KDD’00, pp 487–496Google Scholar
  20. Hartigan JA (1975) Clustering algorithms, 99th edn. Wiley, New YorkMATHGoogle Scholar
  21. Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283–304CrossRefGoogle Scholar
  22. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRefGoogle Scholar
  23. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle RiverMATHGoogle Scholar
  24. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323CrossRefGoogle Scholar
  25. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: An introduction to cluster analysis, 9th edn. Wiley, New YorkCrossRefGoogle Scholar
  26. Krzanowski WJ, Lai YT (1988) A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44(1):23–34CrossRefMATHMathSciNetGoogle Scholar
  27. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, California, pp 281–297Google Scholar
  28. Nasraoui O, Soliman M, Saka E, Badia A, Germain R (2008) A web usage mining framework for mining evolving user profiles in dynamic web sites. IEEE Trans Knowl Data Eng 20(2):202–215CrossRefGoogle Scholar
  29. Pamminger C (2012) bayesMCClust: Mixtures-of-experts Markov chain clustering and dirichlet multinomial clustering. http://CRAN.R-project.org/package=bayesMCClust, R package version 1.0
  30. Pamminger C, Frühwirth-Schnatter S (2010) Model-based clustering of categorical time series. Bayesian Anal 5(2):345–368CrossRefMathSciNetGoogle Scholar
  31. Pértega S, Vilar JA (2010) Comparing several parametric and nonparametric approaches to time series clustering: a simulation study. J Classif 27(3):333–362CrossRefGoogle Scholar
  32. Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRefGoogle Scholar
  33. Ripley BD, Hjort NL (1995) Pattern recognition and neural networks, 1st edn. Cambridge University Press, New YorkGoogle Scholar
  34. Xiong T, Wang S, Mayers A, Monga E (2009) A new mca-based divisive hierarchical algorithm for clustering categorical data. In: Proceedings of the 2009 ninth IEEE international conference on data mining, IEEE Computer Society, Washington, DC ICDM ’09, pp 1058–1063.Google Scholar
  35. Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678CrossRefGoogle Scholar
  36. Xu R, Wunsch D (2008) Clustering. IEEE press series on computational intelligence. Wiley, New YorkGoogle Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  1. 1.Department of MathematicsUniversity of A CoruñaA CoruñaSpain

Personalised recommendations