Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Design issues in Time Series dataset balancing algorithms

  • 171 Accesses

Abstract

Nowadays, the Internet of Things and the e-Health are producing huge collections of Time Series that are analyzed in order to classify current status or to detect certain events, among others. In two-class problems, when the positive events to detect are infrequent, the gathered data lack balance. Even in unsupervised learning, this imbalance causes models to decrease their generalization capability. In order to solve such problem, Time Series balancing algorithms have been proposed. Time Series balancing algorithms have barely been studied; the different approaches make use of either a single bag of Time Series extracting some of them in order to generate a synthetic new one or ghost points in the distance space. These solutions are suitable when there is one only data source and they are univariate datasets. However, in the context of the Internet of Things, where multiple data sources are available, these approaches may not perform coherently. Besides, up to our knowledge there is not multiple datasources and multivariate TS balancing algorithms in the literature. In this research, we study two main concerns that should be considered when designing balancing Time Series algorithms: on the one hand, the TS balancing algorithms should deal with multiple multivariate data sources; on the other hand, the balancing algorithms should be shape preserving. A new algorithm is proposed for balancing multivariate Time Series datasets, as part of our work. A complete evaluation of the algorithm is performed dealing with two real-world multivariate Time Series datasets coming from the e-Health domain: one about epilepsy crisis identification and the other on fall detection. A thorough analysis of the performance is discussed, showing the advantages of considering the Time Series issues within the balancing algorithm.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. 1.

    In medical record databases regarding a rare disease, where there is a large number of patients who do not have that disease, the counterpart class is the one corresponding to patients without the desease.

  2. 2.

    Remember that only the FALL TSs are TS_SMOTEd.

  3. 3.

    R Caret package.

References

  1. 1.

    Abbate S, Avvenuti M, Corsini P, Light J, Vecchio A (2010) Monitoring of human movements for fall detection and activities recognition in elderly care using wireless sensor network: a survey. In: Merret GV, Tan YK (eds) Wireless sensor networks: application-centric design. InTech, Rijeka, Croatia, pp 147–166

  2. 2.

    Alvarez-Alvarez A, Triviño G, Cordón O (2012) Human gait modeling using a genetic fuzzy finite state machine. IEEE Trans Fuzzy Syst 20(2):205–223

  3. 3.

    Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newsl 6(1):20–29

  4. 4.

    Baydogan MG, Runger G (2015) Learning a symbolic representation for multivariate time series classification. Data Min Knowl Discov 29(2):400–422

  5. 5.

    Beniczky S, Polster T, Kjaer T, Hjalgrim H (2013) Detection of generalized tonic-clonic seizures by a wireless wrist accelerometer: a prospective, multicenter study. Epilepsia 4(54):e58–61

  6. 6.

    Berndt D.J, Clifford J (1994) Using dynamic time warping to find patterns in time series. In: Proceedings of the 3rd international conference on knowledge discovery and data mining, AAAIWS’94. AAAI Press, pp 359–370

  7. 7.

    Breiman L, Friedman J, Stone Charles J, Olshen Richard A (1984) Classification and regression trees. CRC Press, Cambridge

  8. 8.

    Casilari E, Santoyo-Ramn JA, Cano-Garca JM (2017) UMAFALL: a multisensor dataset for the research on automatic fall detection. Procedia Comput Sci 110(Supplement C):32–39

  9. 9.

    Chan TK, Chin CS (2018) Health stages diagnostics of underwater thruster using sound features with imbalanced dataset. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3407-3

  10. 10.

    Chawla NV (2005) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer, Boston, MA, pp 853–867

  11. 11.

    Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

  12. 12.

    Coppersmith D, Hong SJ, Hosking JRM (1999) Partitioning nominal attributes in decision trees. Data Min Knowl Discov 3(8):197–217

  13. 13.

    de la Cal E, Villar J, Vergara P, Sedano J (2017) An study on the distances of an extension of the smote algorithm for time series. In: Proceedings of the 17th international conference on computational and mathematical methods in science and engineering (CMMSE 2017), pp 722–733

  14. 14.

    de la Cal E, Villar J, Vergara P, Sedano J, Herrero A (2017) A smote extension for balancing multivariate epilepsy-related time series datasets. In: Proceedings of 12th international conference on soft computing models in industrial and environmental applications (SOCO 2017), pp 439–448

  15. 15.

    Friedman JHA, Finkel JBR (1977) An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Softw 3(3):209–226

  16. 16.

    Fu T (2011) A review on time series data mining. Eng Appl Artif Intell 24(1):164–181

  17. 17.

    Galar M, Fernández A, Barrenechea E, Herrera F (2013) EUSBOOST: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognit 46(12):3460–3471

  18. 18.

    Hardjono T, Pentland AS (2016) Preserving data privacy in the IoT world. Technical report, Massachusetts Institute of Technology (Connection Science & Engineering)

  19. 19.

    He H, Bai Y, Garcia E, Li S et al (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: IEEE International joint conference on neural networks. IEEE, pp 1322–1328

  20. 20.

    Khojasteh S, Villar J, Chira C, González V, de la Cal E (2018) Improving fall detection using an on-wrist wearable accelerometer. J Sens 18(5):1350

  21. 21.

    Köknar-Tezel S, Latecki LJ (2011) Improving svm classification on imbalanced time series data sets with ghost points. Knowl Inf Syst 28(1):1–23

  22. 22.

    Lopes Vinicius M, Barradas Filho Oliveira A, Barros Kardec A, Viegas Moraes Amorim I, Silva Claudio OL, Marques Pereira E, Marques Lopes BA (2017) Attesting compliance of biodiesel quality using composition data and classification methods. Neural Comput Appl. https://doi.org/10.1007/s00521-017-3087-4

  23. 23.

    López V, Fernández A, del Jesus M, Herrera F (2013) A hierarchical genetic fuzzy system based on genetic programming for addressing classification with highly imbalanced and borderline data-sets. Knowl Based Syst 38:85–104

  24. 24.

    Mishra S, Saravanan C, Dwivedi V, Pathak K (2015) Discovering flood rising pattern in hydrological time series data mining during the pre monsoon period. Indian J Mar Sci 44(3):3

  25. 25.

    Montgomery DC, Jennings CL, Kulahci M (2015) Introduction to time series analysis and forecasting. Wiley, New York

  26. 26.

    Moses D et al (2015) A survey of data mining algorithms used in cardiovascular disease diagnosis from multi-lead ECG data. Kuwait J Sci 42(2):206–235

  27. 27.

    Nooralishahi P, Seera M, Loo CK (2017) Online semi-supervised multi-channel time series classifier based on growing neural gas. Neural Comput Appl 28(11):3491–3505

  28. 28.

    Sez JA, Krawczyk B, Woniak M (2016) Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets. Pattern Recognit 57:164–178

  29. 29.

    Stefanowski J, Wilk S (2008) Selective pre-processing of imbalanced data for improving classification performance. In: Proceedings of the 10th international conference in data warehousing and knowledge discovery (DaWaK 2008), pp 283–292

  30. 30.

    Suto J, Oniga S, Lung C, Orha I (2018) Comparison of offline and real-time human activity recognition results using machine learning techniques. Neural Comput Appl. https://doi.org/10.1007/s00521-018-3437-x

  31. 31.

    Tang S, Chen S (2008) The generation mechanism of synthetic minority class examples. In: Proceedings of 5th international conference on information technology and applications in biomedicine (ITAB 2008), pp 444–447

  32. 32.

    Villar JR, González S, Sedano J, Chira C, Trejo-Gabriel-Galán JM (2015) Improving human activity recognition and its application in early stroke diagnosis. Int J Neural Syst 25(4):1450,036–1450,055

  33. 33.

    Villar JR, Menéndez M, de la Cal E, González VM, Sedano J (2017) Identification of abnormal movements with 3D accelerometer sensors for its application to seizure recognition. J Appl Logic Part B 24:54–61

  34. 34.

    Villar JR, Vergara P, Menéndez M, de la Cal E, González VM, Sedano J (2016) Generalized models for the classification of abnormal movements in daily life and its applicability to epilepsy convulsion recognition. Int J Neural Syst 26(6):1650,037–1650,052

Download references

Author information

Correspondence to Enrique A. de la Cal.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research has been funded by the Spanish Ministry of Economy, Industry and Competitiveness (MINECO), under Grants TIN2014-56967-R and TIN2017-84804-R.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

de la Cal, E.A., Villar, J.R., Vergara, P.M. et al. Design issues in Time Series dataset balancing algorithms. Neural Comput & Applic 32, 1287–1304 (2020). https://doi.org/10.1007/s00521-019-04011-4

Download citation

Keywords

  • Imbalanced Time Series
  • Correlation measures
  • Human activity recognition
  • Epilepsy onset recognition
  • Fall detection