Accelerating pattern-based time series classification: a linear time and space string mining approach

Abstract

Subsequences-based time series classification algorithms provide interpretable and generally more accurate classification models compared to the nearest neighbor approach, albeit at a considerably higher computational cost. A number of discretized time series-based algorithms have been proposed to reduce the computational complexity of these algorithms; however, the asymptotic time complexity of the proposed algorithms is also cubic or higher-order polynomial. We present a remarkably fast and resource-efficient time series classification approach which employs a linear time and space string mining algorithm for extracting frequent patterns from discretized time series data. Compared to other subsequence or pattern-based classification algorithms, the proposed approach only requires a few parameters, which can be chosen arbitrarily and do not require any fine-tuning for different datasets. The time series data are discretized using symbolic aggregate approximation, and frequent patterns are extracted using a string mining algorithm. An independence test is used to select the most discriminative frequent patterns, which are subsequently used to create a transformed version of the time series data. Finally, a classification model can be trained using any off-the-shelf algorithm. Extensive empirical evaluations demonstrate the competitive classification accuracy of our approach compared to other state-of-the-art approaches. The experiments also show that our approach is at least one to two orders of magnitude faster than the existing pattern-based methods due to the extremely fast frequent pattern extraction, which is the most computationally intensive process in pattern-based time series classification approaches.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. 1.

    Throughout the text, we refer to real-valued time series segments as “subsequences” and the discretized/symbolic segments as “patterns.”

  2. 2.

    The presented mathematical notation is for the simple case of integer values of p; later SAX refinements enable handling non-integer window sizes as well.

  3. 3.

    Research suggests that a large number of time series datasets follow the Gaussian distribution. For the minority of datasets which do not follow this assumption, selecting the breakpoints using the Gaussian curve can deteriorate the efficiency of SAX; however, the “correctness of the algorithm is unaffected” [10].

  4. 4.

    Usually, multi-view learning refers to learning with different sets of features of vectorial data; however, here we use the term for multiple representations of a time series data originating from different parameterizations.

  5. 5.

    1. This criterion and procedure are not to be confused with closed or open/free patterns [16]. 2. Note that there can be two patterns p and q, with one pattern p being more general than the other, \(p \prec q\), both having the same value of \(\chi ^2\) (\(\chi ^2(p) = \chi ^2(q)\)), but yet occurring in different sets of positive and negative examples. However, this should be expected to be a rather infrequent case. The overall filtering procedure of patterns just makes sure that the patterns are frequent enough in the positives, infrequent enough in the negatives, highly discriminative and, given the same discriminative power, as general as possible.

  6. 6.

    A discussion about calculation of the \(\chi ^2\) test statistic and the information gain is provided in “Appendix.”

  7. 7.

    http://www.timeseriesclassification.com/.

  8. 8.

    Our implementation is available from https://github.com/atifraza/MiSTiCl.

  9. 9.

    http://www.cs.ucr.edu/~eamonn/time_series_data/.

  10. 10.

    Following the parameter settings provided by the UEA Time Series Repository.

  11. 11.

    The results for ST have been taken from the UEA Time Series Repository.

References

  1. 1.

    Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324

    Article  MATH  Google Scholar 

  2. 2.

    Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  3. 3.

    Dhaliwal J, Puglisi SJ, Turpin A (2012) Practical efficient string mining. IEEE Trans Knowl Data Eng 24(4):735–744. https://doi.org/10.1109/TKDE.2010.242

    Article  Google Scholar 

  4. 4.

    Ding H, Trajcevski G, Scheuermann P, Wang X, Keogh E (2008) Querying and mining of time series data. Proc VLDB Endow 1(2):1542–1552

    Article  Google Scholar 

  5. 5.

    Fischer J, Heun V, Kramer S (2005) Fast frequent string mining using suffix arrays. In: 5th International conference on data mining, IEEE, ICDM ’05, pp 609–612. https://doi.org/10.1109/ICDM.2005.62

  6. 6.

    Fischer J, Heun V, Kramer S (2006) Optimal string mining under frequency constraints. In: Knowledge discovery in databases, PKDD 2006, lecture notes in computer science, vol 4213. Springer, Berlin, pp 139–150. https://doi.org/10.1007/11871637_17

    Google Scholar 

  7. 7.

    Freund Y (1995) Boosting a Weak Learning Algorithm by Majority. Inf Comput 121(2):256–285. https://doi.org/10.1006/inco.1995.1136

    MathSciNet  Article  Google Scholar 

  8. 8.

    Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42. https://doi.org/10.1007/s10994-006-6226-1

    Article  MATH  Google Scholar 

  9. 9.

    Hills J, Lines J, Baranauskas E, Mapp J, Bagnall A (2014) Classification of time series by shapelet transformation. Data Min Knowl Discov 28(4):851–881. https://doi.org/10.1007/s10618-013-0322-1

    MathSciNet  Article  MATH  Google Scholar 

  10. 10.

    Lin J, Keogh E, Wei L, Lonardi S (2007) Experiencing SAX: a novel symbolic representation of time series. Data Min Knowl Discov 15(2):107–144. https://doi.org/10.1007/s10618-007-0064-z

    MathSciNet  Article  Google Scholar 

  11. 11.

    Lin J, Khade R, Li Y (2012) Rotation-invariant similarity in time series using bag-of-patterns representation. J Intell Inf Syst 39(2):287–315. https://doi.org/10.1007/s10844-012-0196-5

    Article  Google Scholar 

  12. 12.

    Rakthanmanon T, Keogh E (2013) Fast shapelets: a scalable algorithm for discovering time series shapelets. In: Proceedings of the 2013 SIAM international conference on data mining, SDM, Society for Industrial and Applied Mathematics, pp 668–676. https://doi.org/10.1137/1.9781611972832.74

  13. 13.

    Schäfer P (2015) The BOSS is concerned with time series classification in the presence of noise. Data Min Knowl Discov 29(6):1505–1530. https://doi.org/10.1007/s10618-014-0377-7

    MathSciNet  Article  MATH  Google Scholar 

  14. 14.

    Schäfer P (2016) Scalable time series classification. Data Min Knowl Discov 30(5):1273–1298. https://doi.org/10.1007/s10618-015-0441-y

    MathSciNet  Article  MATH  Google Scholar 

  15. 15.

    Senin P, Malinchik S (2013) SAX-VSM: interpretable time series classification using SAX and vector space model. In: 13th International conference on data mining, IEEE, ICDM ’13, pp 1175–1180. https://doi.org/10.1109/ICDM.2013.52

  16. 16.

    Toivonen H (2017) Frequent pattern. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, Boston, pp 524–529. https://doi.org/10.1007/978-1-4899-7687-1_318

    Google Scholar 

  17. 17.

    Ye L, Keogh E (2011) Time series shapelets: a novel technique that allows accurate, interpretable and fast classification. Data Min Knowl Discov 22(1):149–182. https://doi.org/10.1007/s10618-010-0179-5

    MathSciNet  Article  MATH  Google Scholar 

Download references

Acknowledgements

We are grateful to the reviewers for their comments and suggestions which helped in improving the quality of this paper. The first author was supported by a scholarship from the Higher Education Commission (HEC), Pakistan, and the German Academic Exchange Service (DAAD), Germany.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Atif Raza.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Calculating independence test statistics

Appendix: Calculating independence test statistics

Section 3.3 provides the algorithmic details for selecting frequent patterns based on their discriminative power. \(\chi ^2\) independence test or information gain values can be used to determine the discriminative power of a given pattern and find out how effectively it can identify the instances of a given class. This section explains the procedure of calculating these statistics based on the occurrence frequency of a pattern in the positive and negative class dataset splits. In this regard, the notation used for the following discussion is given below.

SymbolRepresentation
\(\widehat{P}\)Positive class dataset
\(\widehat{N}\)Negative class dataset
\(N_{\widehat{P}}\)Number of instances in \(\widehat{P}\)
\(N_{\widehat{N}}\)Number of instances in \(\widehat{N}\)
pFrequent pattern
\(f_{\widehat{P}}\)Occurrence frequency of p in \(\widehat{P}\)
\(f_{\widehat{N}}\)Occurrence frequency of p in \(\widehat{N}\)

Calculating the \(\chi ^2\) test statistic

The \(\chi ^2\) test statistic is calculated based on observed (\(O_{ij}\)) and expected (\(E_{ij}\)) values for the given categorical variables. The formula for calculating the \(\chi ^2\) statistic is given below.

$$\begin{aligned} \chi ^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \end{aligned}$$

Observed values (\(O_{ij}\)) correspond to the number of instances observed as belonging to a certain categorical variable. In our case, it is the number of instances labeled as belonging to the positive or negative class given a particular frequent pattern. This can be determined using the instance counts of the positive and negative class datasets and occurrence frequency values of the given pattern in the respective dataset splits. Based on these values, a contingency table can be created as follows.

 Dataset splits 
 Positive, \(\widehat{P}\)Negative, \(\widehat{N}\) 
\( With (p)\)\(O_{11}=\lfloor f_{\widehat{P}} \times N_{\widehat{P}} \rceil \)\(O_{12}=\lfloor f_{\widehat{N}} \times N_{\widehat{N}} \rceil \)\(RSum_{1.} = O_{11}+O_{12}\)
\( WithOut (p)\)\(O_{21}=N_{\widehat{P}}-O_{11}\)\(O_{22}=N_{\widehat{N}}-O_{12}\)\(RSum_{2.} = O_{21}+O_{22}\)
 \(CSum_{.1}=O_{11}+O_{21}\)\(CSum_{.2}=O_{12}+O_{22}\)\(n=\sum _{i,j} O_{ij}\)

The rows of this contingency table correspond to the number of instances containing or not containing the given pattern p, while the columns correspond to the positive and negative dataset splits, respectively. The combined total of row and column sums equals the total number of instances in the positive and negative dataset splits. Finally, the expected values (\(E_{ij}\)) are calculated using the following formula.

$$\begin{aligned} E_{ij} = \frac{RSum_{i.} \times CSum_{.j}}{n} \end{aligned}$$

The \(\chi ^2\) test statistic determines whether any relationship between the positive and negative dataset splits exists given the frequent pattern. If the pattern occurs in both datasets, then the \(\chi ^2\) value will be close to zero which signifies a relationship exists between the two dataset splits. We can order the frequent patterns based on their \(\chi ^2\) statistic and select the ones for which the dataset splits do not exhibit any mutual relationship.

Calculating the information gain value

Entropy (H) is a measure for establishing whether a dataset has a uniform or varying distribution in terms of the different classes of instances. Given a dataset with positive and negative class instances, the entropy of the dataset can be calculated using the following formula.

$$\begin{aligned} H = -\Bigg (\frac{N_{\widehat{P}}}{N_{\widehat{P}}+N_{\widehat{N}}} \times \mathrm{log}_2 \frac{N_{\widehat{P}}}{N_{\widehat{P}}+N_{\widehat{N}}}\Bigg ) -\Bigg (\frac{N_{\widehat{N}}}{N_{\widehat{P}}+N_{\widehat{N}}} \times \mathrm{log}_2 \frac{N_{\widehat{N}}}{N_{\widehat{P}}+N_{\widehat{N}}}\Bigg ) \end{aligned}$$

If a pattern p occurs frequently in either class of instances in the dataset, we can create positive and negative class subsets based on the presence or absence of this pattern in each of the instances. The entropy of these subsets can then be calculated using the following equations.

$$\begin{aligned} H_{\widehat{P}}= & {} -\Bigg ( \frac{ f_{\widehat{P}} \times N_{\widehat{P}} }{ f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}} } \times \mathrm{log}_2 \frac{ f_{\widehat{P}} \times N_{\widehat{P}} }{ f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}} } \Bigg )\\&-\Bigg ( \frac{ f_{\widehat{N}} \times N_{\widehat{N}} }{ f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}} } \times \mathrm{log}_2 \frac{ f_{\widehat{N}} \times N_{\widehat{N}} }{ f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}} } \Bigg ) \\ H_{\widehat{N}}= & {} -\Bigg ( \frac{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} }{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}} } \times \mathrm{log}_2 \frac{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} }{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}} }\Bigg )\\&-\Bigg ( \frac{ (1-f_{\widehat{N}}) \times N_{\widehat{N}} }{ (1-f_{\widehat{P}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}} } \times \mathrm{log}_2 \frac{ (1-f_{\widehat{N}}) \times N_{\widehat{N}} }{ (1-f_{\widehat{p}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}} } \Bigg ) \end{aligned}$$

Using the entropy values of the source dataset and the positive and negative subsets, we can calculate the information gain value using the following formula.

$$\begin{aligned} IG = H - \Bigg ( \frac{f_{\widehat{P}} \times N_{\widehat{P}} + f_{\widehat{N}} \times N_{\widehat{N}}}{N_{\widehat{P}}+N_{\widehat{N}}} \times H_{\widehat{P}} + \frac{(1-f_{\widehat{P}}) \times N_{\widehat{P}} + (1-f_{\widehat{N}}) \times N_{\widehat{N}}}{N_{\widehat{P}}+N_{\widehat{N}}} \times H_{\widehat{N}} \Bigg ) \end{aligned}$$

If the frequent pattern effectively distinguishes between the two classes, the positive and negative class subsets will have very few or no instances of the other class resulting in a smaller value of entropy for the two subsets. This in turn will cause a higher information gain value indicating that the pattern is a good candidate for distinguishing between the two classes of instances. If, however, the converse is true, then the pattern is not a good candidate. This way the candidates can be selected on the basis of their discriminative power.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Raza, A., Kramer, S. Accelerating pattern-based time series classification: a linear time and space string mining approach. Knowl Inf Syst 62, 1113–1141 (2020). https://doi.org/10.1007/s10115-019-01378-7

Download citation

Keywords

  • Time series
  • Classification
  • String mining
  • Linear time and space