A multi-breakpoints approach for symbolic discretization of time series

Abstract

Time series discretization is a technique commonly used to tackle time series classification problems. This manuscript presents an enhanced multi-objective approach for the symbolic discretization of time series called eMODiTS. The method proposed uses a different breakpoints vector, defined per each word segment, to increase the search space of the discretization schemes. eMODiTS’ search mechanism is the well-known evolutionary multi-objective algorithm NSGA-II, which finds a set of possible solutions according to entropy, complexity, and information loss estimations. Final solutions were appraised depending on the misclassification rate computed through the decision tree classifier. The trees obtained also produce graphical and significant information from the regions, relationships, or patterns in each database. Our proposal was compared against ten state-of-the-art time symbolic discretization algorithms. The results suggest that our proposal finds a suitable discretization scheme regarding classification, dimensionality, cardinality reduction, and information loss.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

References

  1. 1.

    Chaudhari P, Rana DP, Mehta RG, Mistry NJ, Raghuwanshi MM (2014) Discretization of temporal data: a survey. CoRR abs/1402.4283. arXiv:1402.4283

  2. 2.

    Esling P, Agon C (2012) Time-series data mining. ACM Comput Surv 45(1):12:1–12:34. https://doi.org/10.1145/2379776.2379788

    Article  MATH  Google Scholar 

  3. 3.

    Azulay R, Moskovitch R, Stopel D, Verduijn M, De Jonge E, Shahar Y (2007) Temporal discretization of medical time series: a comparative study. In: Workshop on intelligent data analysis in biomedicine and pharmacology. Amsterdam, The Netherlands

  4. 4.

    Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms. In: Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery, DMKD ’03. ACM, New York, NY, USA, pp 2–11

  5. 5.

    Acosta-Mesa HG, Rechy-Ramírez F, Mezura-Montes E, Cruz-Ramírez N, Jiménez-Hernández R (2014) Application of time series discretization using evolutionary programming for classification of precancerous cervical lesions. J Biomed Inform 49:73–83

    Article  Google Scholar 

  6. 6.

    Song W, Wang Z, Zhang F, Ye Y, Fan M (2017) Empirical study of symbolic aggregate approximation for time series classification. Intell Data Anal 21(1):135–150

    Article  Google Scholar 

  7. 7.

    Keogh E, Xi X, Wei L, Ratanamahatana C (2006) The UCR time series classification/clustering homepage. www.cs.ucr.edu/~eamonn/time_series_data/

  8. 8.

    Garcia S, Luengo J, Sáez JA, Lopez V, Herrera F (2012) A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750

    Article  Google Scholar 

  9. 9.

    Ramírez-Gallego S, García S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Alonso-Betanzos A, Benítez JM, Herrera F (2016) Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev Data Min Knowl Discov 6(1):5–21

    Article  Google Scholar 

  10. 10.

    Fu TC (2011) A review on time series data mining. Eng Appl Artif Intell 24(1):164–181. https://doi.org/10.1016/j.engappai.2010.09.007

    Article  Google Scholar 

  11. 11.

    Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Machine learning proceedings 1995. Elsevier, pp 194–202

  12. 12.

    Mörchen F, Ultsch A (2005) Optimizing time series discretization for knowledge discovery. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, pp 660–665

  13. 13.

    Chaves R, Ramírez J, Gorriz J, Initiative ADN et al (2013) Integrating discretization and association rule-based classification for alzheimer’s disease diagnosis. Expert Syst Appl 40(5):1571–1578

    Article  Google Scholar 

  14. 14.

    Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. In: International conference on foundations of data organization and algorithms. Springer, pp 69–84

  15. 15.

    Chan F, Fu A, Yu C (2003) Haar wavelets for efficient similarity search of time-series: with and without time warping. IEEE Trans Knowl Data Eng 15:686–705. https://doi.org/10.1109/TKDE.2003.1198399

    Article  Google Scholar 

  16. 16.

    Yang K, Shahabi C (2005) On the stationarity of multivariate time series for correlation-based data analysis. In: Fifth IEEE international conference on data mining (ICDM’05). IEEE, p 4

  17. 17.

    Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing sax: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144. https://doi.org/10.1007/s10618-007-0064-z

    MathSciNet  Article  Google Scholar 

  18. 18.

    Pham ND, Le QL, Dang TK (2010) Two novel adaptive symbolic representations for similarity search in time series databases. In: 2010 12th international asia-pacific web conference, pp 181–187. https://doi.org/10.1109/APWeb.2010.23

  19. 19.

    Bondu A, Boullé M, Cornuéjols A (2016) Symbolic representation of time series: a hierarchical coclustering formalization. Springer International Publishing, Cham, pp 3–16

    Google Scholar 

  20. 20.

    Bondu A, Boullé M, Grossin B (2013) Saxo: an optimized data-driven symbolic representation of time series. In: The 2013 international joint conference on neural networks (IJCNN), pp 1–9. https://doi.org/10.1109/IJCNN.2013.6706816

  21. 21.

    Ahmed A, Bakar A, Hamdan A (2011) Harmony search algorithm for optimal word size in symbolic time series representation. In: 2011 3rd conference on data mining and optimization (DMO), pp 57–62

  22. 22.

    Ahmed AM, Bakar AA, Hamdan AR (2014) A harmony search algorithm with multi-pitch adjustment rate for symbolic time series data representation. Int J Mod Educ Comput Sci 6(6):58

    Article  Google Scholar 

  23. 23.

    Muhammad Fuad MM (2012) Genetic algorithms-based symbolic aggregate approximation. Springer, Berlin, pp 105–116. https://doi.org/10.1007/978-3-642-32584-7_9

    Google Scholar 

  24. 24.

    Fuad MMM (2012) Differential evolution versus genetic algorithms: Towards symbolic aggregate approximation of non-normalized time series. In: Proceedings of the 16th international database engineering & applications symposium. ACM, pp 205–210

  25. 25.

    Acosta-Mesa H, Cruz-Ramírez N, García-López D (2008) Entropy based linear approximation algorithm for time series discretization. Adv Artif Intell Appl 32:214–224

    Google Scholar 

  26. 26.

    García-López D, Acosta-Mesa HG (2009) Discretization of time series dataset with a genetic search. In: Proceedings of the 8th Mexican international conference on artificial intelligence, MICAI ’09. Springer, Berlin, pp 201–212

  27. 27.

    Rechy-Ramírez F, Acosta-Mesa HG, Mezura-Montes E, Cruz-Ramírez N (2011) Times series discretization using evolutionary programming. In: Batyrshin IZ, Sidorov G (eds) MICAI (2), vol 7095. Lecture notes in computer science. Springer, Berlin, pp 225–234

    Google Scholar 

  28. 28.

    Marler R, Arora J (2004) Survey of multi-objective optimization methods for engineering. Struct Multidiscip Optim 26(6):369–395. https://doi.org/10.1007/s00158-003-0368-6

    MathSciNet  Article  MATH  Google Scholar 

  29. 29.

    Lkhagva B, Suzuki Y, Kawagoe K (2006) Extended SAX: Extension of symbolic aggregate approximation for financial time series data representation. DEWS2006 4A-i8 7

  30. 30.

    Sant’Anna A, Wickström N (2011) Symbolization of time-series: an evaluation of sax, persist, and aca. In: 2011 4th international congress on image and signal processing, vol 4, pp 2223–2228. https://doi.org/10.1109/CISP.2011.6100559

  31. 31.

    Malinowski S, Guyet T, Quiniou R, Tavenard R (2013) 1d-SAX: a novel symbolic representation for time series. Springer, Berlin, pp 273–284. https://doi.org/10.1007/978-3-642-41398-8_24

    Google Scholar 

  32. 32.

    Lkhagva B, Suzuki Y, Kawagoe K (2006) New time series data representation ESAX for financial applications. In: Proceedings of 22nd international conference on data engineering workshops, 2006. IEEE, pp x115–x115

  33. 33.

    Bai X, Xiong Y, Zhu Y, Zhu H (2013) Time series representation: a random shifting perspective. In: International conference on web-age information management. Springer, pp 37–50

  34. 34.

    dos Santos Passos H, Teodoro FGS, Duru BM, de Oliveira EL, Peres SM, Lima CAM (2017) Symbolic representations of time series applied to biometric recognition based on ecg signals. In: 2017 international joint conference on neural networks (IJCNN), pp 3199–3207. https://doi.org/10.1109/IJCNN.2017.7966255

  35. 35.

    Moskovitch R, Shahar Y (2015) Classification-driven temporal discretization of multivariate time series. Data Min Knowl Discov 29(4):871–913

    MathSciNet  Article  Google Scholar 

  36. 36.

    Márquez-Grajales A, Acosta-Mesa HG, Mezura-Montes E (2017) An adaptive symbolic discretization scheme for the classification of temporal datasets using nsga-ii. In: 2017 IEEE international autumn meeting on power, electronics and computing (ROPEC), pp 1–8. https://doi.org/10.1109/ROPEC.2017.8261674

  37. 37.

    Sammut C, Webb GI (eds) (2017) Encyclopedia of machine learning and data mining. Springer, Berlin. https://doi.org/10.1007/978-1-4899-7687-1

    Google Scholar 

  38. 38.

    Deb K, Kalyanmoy D (2001) Multi-objective optimization using evolutionary algorithms. Wiley, New York

    Google Scholar 

  39. 39.

    Rangaiah G (2009) Multi-objective optimization: techniques and applications in chemical engineering. Advances in process systems engineering. World Scientific, Singapore

    Google Scholar 

  40. 40.

    Edgeworth FY (1881) Mathematical psychics: an essay on the application of mathematics to the moral sciences. Kegan Paul, London

    Google Scholar 

  41. 41.

    Pareto V (1896) Cours d’Economie Politique. Droz, Genève

    Google Scholar 

  42. 42.

    Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: Nsga-ii. Trans Evol Comp 6(2):182–197. https://doi.org/10.1109/4235.996017

    Article  Google Scholar 

  43. 43.

    Dimitrova ES, Licona MPV, McGee J, Laubenbacher R (2010) Discretization of time series data. J Comput Biol 17(6):853–868

    MathSciNet  Article  Google Scholar 

  44. 44.

    Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Bajesy (ed) Proceedings of the 13th international joint conference on artificail intelligence. Morgan Kaufmann, Chambèry, France, pp 1022–1029

  45. 45.

    Kurgan LA, Cios KJ (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153

    Article  Google Scholar 

  46. 46.

    Bagnall A, Lines J, Bostrom A, Large J, Keogh E (2016) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Min Knowl Discov 31:606–660

    MathSciNet  Article  Google Scholar 

  47. 47.

    Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesley Longman Publishing Co.,Inc, Boston

    Google Scholar 

  48. 48.

    Dau HA, Bagnall A, Kamgar K, Yeh CCM, Zhu Y, Gharghabi S, Ratanamahatana CA, Keogh E (2018) The UCR time series archive. arXiv preprint arXiv:1810.07758

  49. 49.

    Batista GE, Wang X, Keogh EJ (2011) A complexity-invariant distance measure for time series. In: Proceedings of the 2011 SIAM international conference on data mining. SIAM, pp 699–710

  50. 50.

    Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30

    MathSciNet  MATH  Google Scholar 

  51. 51.

    García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  52. 52.

    Vlachos M, Kollios G, Gunopulos D (2002) Discovering similar multidimensional trajectories. In: Proceedings of 18th international conference on data engineering, 2002. IEEE, pp 673–684

  53. 53.

    Ding H, Trajcevski G, Wang X, Keogh E (2008) Querying and mining of time series data: experimental comparison of representations and distance measures. In: Proceedings of the 34 th VLDB, pp 1542–1552

Download references

Acknowledgements

The first author acknowledges the support from the Mexican National Council for Science and Technology (CONACyT) through scholarship number 389200 to pursue graduate studies at the University of Veracruz. Besides, the first author especially acknowledges to the INFOTEC Centro de Investigación e Innovación en Tecnologías de la Información y Comunicación for the borrowed computational resources and the received advice by the researchers during the research stay. The third author acknowledges support from CONACYT through project No. 220522. All the authors acknowledge to all the people who created, collected, and made available the temporal databases used in this work since without them this work would not have been successful.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Aldo Márquez-Grajales.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Márquez-Grajales, A., Acosta-Mesa, H., Mezura-Montes, E. et al. A multi-breakpoints approach for symbolic discretization of time series. Knowl Inf Syst 62, 2795–2834 (2020). https://doi.org/10.1007/s10115-020-01437-4

Download citation

Keywords

  • Time series classification
  • Symbolic representation
  • Multi-breakpoints approach
  • Multi-objective optimization