Data Mining and Knowledge Discovery

, Volume 7, Issue 4, pp 349–371 | Cite as

On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration

  • Eamonn Keogh
  • Shruti Kasetty

Abstract

In the last decade there has been an explosion of interest in mining time series data. Literally hundreds of papers have introduced new algorithms to index, classify, cluster and segment time series. In this work we make the following claim. Much of this work has very little utility because the contribution made (speed in the case of indexing, accuracy in the case of classification and clustering, model accuracy in the case of segmentation) offer an amount of “improvement” that would have been completely dwarfed by the variance that would have been observed by testing on many real world datasets, or the variance that would have been observed by changing minor (unstated) implementation details.

To illustrate our point, we have undertaken the most exhaustive set of time series experiments ever attempted, re-implementing the contribution of more than two dozen papers, and testing them on 50 real world, highly diverse datasets. Our empirical results strongly support our assertion, and suggest the need for a set of time series benchmarks and more careful empirical evaluation in the data mining community.

time series data mining experimental evaluation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Faloutsos, C., and Swami, A. 1993. Efficient similarity search in sequence databases. In Proceedings of the 4th Int'l. Conference on Foundations of Data Organization and Algorithms, Chicago, IL, Oct. 13–15, pp. 69–84.Google Scholar
  2. Agrawal, R., Lin, K.I., Sawhney, H.S., and Shim, K. 1995a. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. In Proceedings of the 21st Int'l. Conference on Very Large Databases, Zurich, Switzerland,(Sept)., pp. 490–501.Google Scholar
  3. Agrawal, R., Psaila, G., Wimmers, E.L., and Zait, M. 1995b. Querying shapes of histories. In Proceedings of the 21st Int'l. Conference on Very Large Databases, Zurich, Switzerland, Sept. 11–15, pp. 502–514.Google Scholar
  4. André-Jönsson, H. and Badal, D. 1997.Using signature files for querying time-series data. In Proceedings of Principles of Data Mining and Knowledge Discovery, 1st European Symposium, Trondheim, Norway, June 24–27, pp. 211–220.Google Scholar
  5. Bailey, D. 1991. Twelve ways to fool the masses when giving performance results on parallel computers. Supercomputing Review, (Aug.), pp. 54–55.Google Scholar
  6. Bay, S. 1999. UCI Repository of Kdd databases [http://kdd.ics.uci.edu/]. Irvine, CA: University of California, Department of Information and Computer Science.Google Scholar
  7. Berndt, D.J. and Clifford, J. 1996. Finding patterns in time series: A dynamic programming approach. Advances in Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI/MIT Press, pp. 229–248.Google Scholar
  8. Bozkaya, T., Yazdani, N., and Ozsoyoglu, Z.M. 1997. Matching and indexing sequences of different lengths. In Proceedings of the 6th Int'l. Conference on Information and Knowledge Management, Las Vegas, NV, Nov. 10–14, pp. 128–135.Google Scholar
  9. Cara¸ca-Valente, J.P. and Lopez-Chavarrias, I. 2000. Discovering similar patterns in time series. In Proceedings of the 6th ACMSIGKDD Int'l. Conference on Knowledge Discovery and Data Mining, Boston, MA, Aug. 20–23, pp 497–505.Google Scholar
  10. Chan, K. and Fu, A.W. 1999. Efficient time series matching by wavelets. In Proceedings of the 15th IEEE Int'l. Conference on Data Engineering, Sydney, Australia, March 23–26, pp. 126–133.Google Scholar
  11. Chu, K. and Wong, M. 1999. Fast time-series searching with scaling and shifting. In Proceedings of the 18th ACM Symposium on Principles of Database Systems, Philadelphia, PA, May 31–June 2, pp. 237–248.Google Scholar
  12. Cohen, W. 1993. Efficient pruning methods for separate-and-conquer rule learning systems. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambery, France, pp. 988–994.Google Scholar
  13. Das, G., Gunopulos, D., and Mannila, H. 1997. Finding similar time series. In Proceedings of Principles of Data Mining and Knowledge Discovery, 1st European Symposium, Trondheim, Norway, June 24–27, pp. 88–100.Google Scholar
  14. Das, G., Lin, K., Mannila, H., Renganathan, G., and Smyth, P. 1998. Rule discovery from time series. In Proceedings of the 4th Int'l. Conference on Knowledge Discovery and Data Mining, New York, NY, Aug. 27–31, pp. 16–22.Google Scholar
  15. Debregeas, A. and Hebrail, G. 1998. Interactive interpretation of Kohonen maps applied to curves. In Proceedings of the 4th Int'l. Conference of Knowledge Discovery and Data Mining, New York, NY, Aug. 27–31, pp. 179–183.Google Scholar
  16. Faloutsos, C., Jagadish, H., Mendelzon, A., and Milo, T. 1997. A signature technique for similarity-based queries. In Proceedings of the Int'l. Conference on Compression and Complexity of Sequences, Positano-Salerno, Italy, June 11–13.Google Scholar
  17. Faloutsos, C., Ranganathan, M., and Manolopoulos, Y. 1994. Fast subsequence matching in time-series databases. In Proceedings of theACMSIGMOD Int'l. Conference on Management of Data, Minneapolis, MN, May 25–27, pp. 419–429.Google Scholar
  18. Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., and El Abbadi, A. 2001. Approximate nearest neighbor searching in multimedia databases. In Proceedings of the 17th IEEE Int'l. Conference on Data Engineering, Heidelberg, Germany, April 2–6, pp. 503–511.Google Scholar
  19. Gavrilov, M., Anguelov, D., Indyk, P., and Motwani, R. 2000. Mining the stock market: Which measure is best? In Proceedings of the 6th ACM Int'l. Conference on Knowledge Discovery and Data Mining, Boston, MA, Aug. 20–23, pp. 487–496.Google Scholar
  20. Ge, X. and Smyth, P. 2000. Deformable markov model templates for time-series pattern matching. In Proceedings of the 6th ACM SIGKDD Int'l. Conference on Knowledge Discovery and Data Mining, Boston, MA, Aug. 20–23, pp. 81–90.Google Scholar
  21. Geurts, P. 2001. Pattern extraction for time series classification. In Proceedings of Principles of Data Mining and Knowledge Discovery, 5th European Conference, Freiburg, Germany, Sept. 3–5, pp. 115–127.Google Scholar
  22. Goldin, D. and Kanellakis, P. 1995 On similarity queries for time-series data: Constraint specification and implementation.In Proceedings of the 1st Int'l. Conference on the Principles and Practice of Constraint Programming, Cassis, France, Sept. 19–22, pp. 137–153.Google Scholar
  23. Guralnik, V. and Srivastava, J. 1999. Event detection from time series data. In Proceedings of the 5th ACMSIGKDD Int'l. Conference on Knowledge Discovery and Data Mining, San Diego, CA, Aug. 15–18, pp. 33–42.Google Scholar
  24. Huang, Y. and Yu, P.S. 1999. Adaptive query processing for time-series data. In Proceedings of the 5th Int'l. Conference on Knowledge Discovery and Data Mining, San Diego, CA, Aug. 15–18, pp. 282–286.Google Scholar
  25. Huhtala, Y., K¨arkk¨ainen, J., and Toivonen, H. 1999. Mining for similarities in aligned time series using wavelets. Data Mining and Knowledge Discovery: Theory, Tools, and Technology, SPIE Proceedings Series, Vol. 3695, Orlando, FL, (April), pp. 150–160.Google Scholar
  26. Indyk, P., Koudas, N., and Muthukrishnan, S. 2000. Identifying representative trends in massive time series data sets using sketches. In Proceedings of the 26th Int'l. Conference on Very Large Data Bases, Cairo, Egypt, Sept. 10–14, pp. 363–372.Google Scholar
  27. Kahveci, T. and Singh, A. 2001. Variable length queries for time series data. In Proceedings of the 17th Int'l. Conference on Data Engineering, Heidelberg, Germany, April 2–6, pp. 273–282.Google Scholar
  28. Kahveci, T., Singh, A., and Gurel, A. 2002. An efficient index structure for shift and scale invariant search of multi-attribute time sequences. In Proceedings of the 18th Int'l. Conference on Data Engineering, San Jose, CA, Feb. 26–March 1, p. 266.Google Scholar
  29. Kalpakis, K., Gada, D., and Puttagunta, V. 2001. Distance measures for effective clustering of ARIMA time-series. In Proceedings of the IEEE Int'l. Conference on Data Mining, San Jose, CA, Nov. 29–Dec. 2, pp. 273–280.Google Scholar
  30. Kawagoe, K. and Ueda, T. 2002. A similarity search method of time series data with combination of Fourier andwavelet transforms. In Proceedings of 9th International Symposium on Temporal Representation and Reasoning.Google Scholar
  31. Keogh, E. 2002. Exact indexing of dynamic time warping. In Proceedings of the 26th Int'l. Conference on Very Large Data Bases, Hong Kong, pp. 406–417.Google Scholar
  32. Keogh, E. and Pazzani, M. 1998. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In Proceedings of the 4th Int'l. Conference on Knowledge Discovery and Data Mining, New York, NY, Aug. 27–31, pp. 239–241.Google Scholar
  33. Keogh, E. and Smyth, P. 1997. A probabilistic approach to fast pattern matching in time series databases. In Proceedings of the 3rd Int'l. Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, Aug. 14–17, pp.24–20.Google Scholar
  34. Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S. 2001. Locally adaptive dimensionality reduction for indexing large time series databases. In Proceedings of ACM SIGMOD Conference on Management of Data, Santa Barbara, CA, May 21–24, pp. 151–162.Google Scholar
  35. Kibler, D. and Langley, P. 1988. Machine learning as an experimental science. In Proceedings of the 3rd European Working Session on Learning, pp. 81–92.Google Scholar
  36. Kim, S., Park, S., and Chu, W. 2001. An Index-based approach for similarity search supporting time warping in large sequence databases. In Proceedings 17th International Conference on Data Engineering, pp. 607–614.Google Scholar
  37. Kim, E., Lam, J.M., and Han, J. 2000. AIM: Approximate intelligent matching for time series data. In Proceedings of Data Warehousing and Knowledge Discovery, 2nd Int'l. Conference, London, UK, Sept. 4–6, pp. 347–357.Google Scholar
  38. Korn, F., Jagadish, H., and Faloutsos, C. 1997. Efficiently supporting ad hoc queries in large datasets of time sequences. In Proceedings of the ACM SIGMOD Int'l. Conference on Management of Data, Tucson, AZ, May 13–15, pp. 289–300.Google Scholar
  39. Lam, S.K. and Wong, M.H. 1998. A fast projection algorithm for sequence data searching. Data, and Knowledge Engineering, 28(3):321–339.Google Scholar
  40. Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D., and Allan, J. 2000. Mining of concurrent text and time series. In Proceedings of the 6th ACM SIGKDD Int'l. Conference on Knowledge Discovery and Data Mining Workshop on Text Mining, Boston, MA, Aug. 20–23, pp. 37–44.Google Scholar
  41. Lee, S., Chun, S., Kim, D., Lee, J., and Chung, C. 2000. Similarity search for multidimensional data sequences. In Proceedings of the 16th Int'l. Conference on Data Engineering, San Diego, CA, Feb. 28–March 3, pp. 599–608.Google Scholar
  42. Li, C., Yu, P.S., and Castelli, V. 1998.MALM: A framework for mining sequence database at multiple abstraction levels. In Proceedings of the 7th ACM CIKM Int'l. Conference on Information and Knowledge Management, Bethesda, MD, Nov. 3–7, pp. 267–272.Google Scholar
  43. Loh, W., Kim, S., and Whang, K. 2000. Index interpolation: An approach to subsequence matching supporting normalization transform in time-series databases. In Proceedings of the 9th ACM CIKM Int'l. Conference on Information and Knowledge Management, McLean, VA, Nov. 6–11, pp. 480–487.Google Scholar
  44. Park, S. 2001. Personal communication.Google Scholar
  45. Park, S., Chu, W.W., Yoon, J., and Hsu, C. 2000. Efficient searches for similar subsequences of different lengths in sequence databases. In Proceedings of the 16th Int'l. Conference on Data Engineering, San Diego, CA, Feb. 28–March 3, pp. 23–32.Google Scholar
  46. Park, S., Kim, S., and Chu, W.W. 2001. Segment-based approach for subsequence searches in sequence databases. In Proceedings of the 16thACMSymposium on Applied Computing, LasVegas,NV, March 11–14, pp. 248–252.Google Scholar
  47. Park, S., Lee, D., and Chu, W.W. 1999. Fast retrieval of similar subsequences in long sequence databases. In Proceedings of the 3rd IEEE Knowledge and Data Engineering Exchange Workshop, Chicago, IL, Nov. 7.Google Scholar
  48. Polly, W.P.M. and Wong, M.H. 2001. Efficient and robust feature extraction and pattern matching of time series by a lattice structure. In Proceedings of the 10th ACM CIKM Int'l. Conference on Information and Knowledge Management, Atlanta, GA, Nov. 5–10, pp. 271–278.Google Scholar
  49. Popivanov, I. and Miller, R.J. 2002. Similarity search over time series data using wavelets. In Proceedings of the 18th Int'l. Conference on Data Engineering, San Jose, CA,Feb. 26–March 1, pp. 212–221.Google Scholar
  50. Pratt, K.B. and Fink, E. 2002. Search for patterns in compressed time series.Int'l. Journal of Image and Graphics, 2(1):86–106.Google Scholar
  51. Prechelt, L. 1995.A quantitative study of neural network learning algorithm evaluation practices. In Proceedings of the 4th Int'l. Conference on Artificial Neural Networks, pp. 223–227.Google Scholar
  52. Qu, Y., Wang, C., and Wang, X.S. 1998. Supporting fast search in time series for movement patterns in multiples scales. In Proceedings of the 7th ACM CIKM Int'l. Conference on Information and Knowledge Management, Bethesda, MD, Nov. 3–7, pp. 251–258.Google Scholar
  53. Rafiei, D. 1999.On similarity-based queries for time series data. In Proceedings of the 15th IEEE Int'l. Conference on Data Engineering, Sydney, Australia, March 23–26, pp. 410–417.Google Scholar
  54. Rafiei, D. and Mendelzon, A.O. 1998. Efficient retrieval of similar time sequences using DFT. In Proceedings of the 5th Int'l. Conference on Foundations of Data Organization and Algorithms, Kobe, Japan, Nov. 12–13.Google Scholar
  55. Shahabi, C., Tian, X., and Zhao, W. 2000. TSA-tree: A wavelet based approach to improve the efficiency of multi-level surprise and trend queries. In Proceedings of the 12th Int'l. Conference on Scientific and Statistical Database Management, Berlin, Germany, July 26–28, pp. 55–68.Google Scholar
  56. Shatkay, H. and Zdonik, S. 1996. Approximate queries and representations for large data sequences. In Proceedings of the 12th IEEE Int'l. Conference on Data Engineering, New Orleans, LA, Feb. 26–March 1, pp. 536–545.Google Scholar
  57. Simon, J.L. 1994. What some puzzling problems teach about the theory of simulation and the use of resampling. The American Statistician, 48(4):1–4.Google Scholar
  58. Struzik, Z. and Siebes, A. 1999. The Haar wavelet transform in the time series similarity paradigm. In Proceedings of Principles of Data Mining and Knowledge Discovery, 3rd European Conference, Prague, Czech Republic, Sept. 15–18, pp. 12–22.Google Scholar
  59. Walker, J. 2001. HotBits: Genuine random numbers generated by radioactive decay. www.fourmilab.ch/hotbits/Google Scholar
  60. Wang, C. and Wang, X.S. 2000a. Multilevel filtering for high dimensional nearest neighbor search. In Proceedings of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, Dallas, TX, May 14, pp. 37–43.Google Scholar
  61. Wang, C. and Wang, X.S. 2000b. Supporting content-based searches on time series via approximation. In Proceedings of the 12th Int'l. Conference on Scientific and Statistical Database Management, Berlin, Germany, July 26–28, pp. 69–81.Google Scholar
  62. Wang, C. and Wang, X.S. 2000c. Supporting sub series nearest neighbor search via approximation. In Proceedings of the 9th ACM CIKM Int'l. Conference on Information and Knowledge Management, McLean, VA, Nov. 6–11, pp. 314–321.Google Scholar
  63. Wu, L., Faloutsos, C., Sycara, K., and Payne, T.R. 2000a. FALCON: Feedback adaptive loop for content-based retrieval. In Proceedings of the 26th Int'l. Conference on Very Large Data Bases, Cairo, Egypt, Sept. 10–14, pp. 297–306.Google Scholar
  64. Wu, Y., Agrawal, D., and El Abbadi, A. 2000b. A comparison of DFT and DWT based similarity search in time-series databases. In Proceedings of the 9th ACM CIKM Int'l. Conference on Information and Knowledge Management, McLean, VA, Nov. 6–11, pp. 488–495.Google Scholar
  65. Yi, B. and Faloutsos, C. 2000. Fast time sequence indexing for arbitrary lp norms. In Proceedings of the 26th Int'l. Conference on Very Large Databases, Cairo, Egypt, Sept. 10–14, pp. 385–394.Google Scholar
  66. Yi, B., Jagadish, H., and Faloutsos, C. 1998. Efficient retrieval of similar time sequences under time warping. In Proceedings of the 14th Int'l. Conference on Data Engineering, Orlando, FL, Feb. 23–27, pp. 201–220.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Eamonn Keogh
    • 1
  • Shruti Kasetty
    • 1
  1. 1.University of CaliforniaRiverside

Personalised recommendations