# Experiencing SAX: a novel symbolic representation of time series

## Abstract

Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series.

In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.

## Keywords

Time series Data mining Symbolic representation Discretize## Preview

Unable to display preview. Download preview PDF.

## References

- Agrawal R, Psaila G, Wimmers EL, Zait M (1995) Querying shapes of histories. In: Proceedings of the 21st International conference on very large databases, Zurich, Switzerland, Sept 11–15, pp 502–514Google Scholar
- Andre-Jonsson H, Badal D (1997) Using signature files for querying time-series data (1997) In: Proceedings of principles of data mining and knowledge discovery, 1st European symposium, Trondheim, Norway. June 24–27, pp 211–220Google Scholar
- Androulakis IP (2005) New approaches for representing, analyzing and visualizing complex kinetic mechanisms. In: Proceedings of the 15th European symposium on computer aided process engineering, Barcelona, Spain. May 29–June 1Google Scholar
- Apostolico A, Bock ME, Lonardi S (2002) Monotony of surprise in large-scale quest for unusual words. In: Proceedings of the 6th International conference on research in computational molecular biology, Washington, DC, April 18–21, pp 22–31Google Scholar
- Bagnall AJ, Janakec G (2004) Clustering time series from arma models with clipped data. In: Proceedings of the 10th ACM SIGKDD International conference on knowledge discovery and data mining, Seattle, WA. August 22–25, pp 49–58Google Scholar
- Bakalov P, Hadjieleftherious M, Tsotras VJ (2005) Time relaxed spatiotemporal trajectory. In: Proceedings of the ACM international symposium on advances in geographic information systems, Bremen, Germany. November 4–5Google Scholar
- Bastogne T, Noura H, Richard A, Hittinger JM (2002) Application of subspace methods to the identification of a winding process. In: Proceedings of the 4th European control conference, Brussels, BelgiumGoogle Scholar
- Berndt D, Clifford J (1994) Using dynamic time warping to find patterns in time series, The workshop on knowledge discovery in databases, the 12th International conference on artificial intelligence, Seattle, WA, pp 229–248Google Scholar
- Celly B, Zordan VB (2004) Animated people textures. In: Proceedings of the 17th international conference on computer animation and social agents, Geneva, Switzerland, July 7–9Google Scholar
- Chan K, Fu AW (1999) Efficient time series matching by wavelets. In: Proceedings of the 15th IEEE International conference on data engineering, Sydney, Australia, March 23–26, pp 126–133Google Scholar
- Chen JS, Moon YS, Yeung HW (2005) Palmprint authentication using time series. In: Proceedings of the 5th international conference on audio- and video-based biometric person authentication, Hilton Rye Town, NY, July 20–22Google Scholar
- Chiu B, Keogh E, Lonardi S (2003) Probabilistic discovery of time series motifs. In: Proceedings of the 9th ACM SIGKDD International conference on knowledge discovery and data mining, Washington DC, USA. August 24–27, pp 493–498Google Scholar
- Dasgupta D, Forrest S (1999) Novelty detection in time series data using ideas from immunology. In: Proceedings of the 8th International conference on intelligent systems, Denver, CO, June 24–26Google Scholar
- Daw CS, Finney CEA, Tracy ER (2001) Symbolic analysis of experimental data. Review of Scientific Instruments 74:915–930CrossRefGoogle Scholar
- Ding C, He X, Zha H, Simon H (2002) Adaptive dimension reduction for clustering high dimensional data. In: Proceedings of the 2nd IEEE International conference on data mining. Maebashi, Japan, December 9–12, pp 147–154Google Scholar
- Duchene F, Garbay C (2005) Apprentissage de Motifs Temporels, Multidimensionnels et Heterogenes: Application a la Telesurveillance Medicale. In: Proceedings of conference francophone sur l’apprentissage Automatique. Nice, France. May 31–June 3Google Scholar
- Duchene F, Garbay C, Rialle V (2004) Mining heterogeneous multivariate time-series for learning meaningful patterns: application to home health telecare. Research Report 1070-I, Institue de Informatique et Mathematiques Appliquees de Grenoble, Grenoble, FranceGoogle Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University PressGoogle Scholar
- Faloutsos C, Ranganathan M, Manolopulos Y (1994) Fast subsequence matching in time-series databases. SIGMOD Record, vol 23, pp 419–429Google Scholar
- Fayyad U, Reina C, Bradley P (1998) Initialization of iterative refinement clustering algorithms. In: Proceedings of the 4th International conference on knowledge discovery and data mining, New York, NY, August 27–31, pp 194–198Google Scholar
- Ferreira PG, Azevedo P, Silva C, Brito R (2006) Mining approximate motifs in time series. In: Proceedings of the 9th international conference on discovery science, Barcelona, Spain, October 7–10Google Scholar
- Gavrilov M, Anguelov D, Indyk P, Motwani R (2000) Mining the stock market: which measure is best? In: Proceedings of the 6th ACM International conference on knowledge discovery and data mining, Boston, MA, August 20–23, pp 487–496Google Scholar
- Geurts P (2001) Pattern extraction for time series classification. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, Freiburg, Germany, pp 115–127Google Scholar
- Gionis A, Mannila H (2003) Finding recurrent sources in sequences. In: Proceedings of the 7th International conference on research in computational molecular biology. Berlin, Germany, pp 123–130Google Scholar
- Hellerstein JM, Papadimitriou CH, Koutsoupias E (1997) Towards an analysis of indexing schemes. In: Proceedings of the 16th ACM symposium on principles of database systems, Tucson, AZ, May 12–14, pp 249–256Google Scholar
- Huang YW, Yu PS (1999) Adaptive query processing for time-series data. In: Proceedings of the 5th ACM SIGKDD International conference on knowledge discovery and data mining, San Diego, CA, Aug 15–18, pp 282–286Google Scholar
- Hugueney B (2006) Adaptive segmentation-based symbolic representation of time series for better modeling and lower bounding distance measures. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, September 18–22, pp 545–552Google Scholar
- Kalpakis K, Gada D, Puttagunta V (2001) Distance measures for effective clustering of arima time-series. In: Proceedings of the 2001 IEEE International conference on data mining, San Jose, CA, November 29-December 2, pp 273–280Google Scholar
- Keogh E, Chakrabarti K, Pazzani M (2001a) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of ACM SIGMOD conference on management of data, Santa Barbara, May 21–24, pp 151–162Google Scholar
- Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001b) Dimensionality reduction for fast similarity search in large time series databases. J Knowledge Inform Syst. 3:263–286MATHCrossRefGoogle Scholar
- Keogh E, Kasetty S (2002) On the need for time series data mining benchmarks: a survey and empirical demonstration. In: Proceedings of the 8th ACM SIGKDD International conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, July 23–26, pp 102–111Google Scholar
- Keogh E, Lin J, Fu AW (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: Proceedings of the 5th IEEE international conference on data mining, Houston, TX, November 27–30, pp 226–233Google Scholar
- Keogh E, Lonardi S, Chiu B (2002) Finding surprising patterns in a time series database in linear time and space. In: Proceedings of the 8th ACM SIGKDD International conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, July 23–26, pp 550–556Google Scholar
- Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle. August 22–25, pp 206–215Google Scholar
- Keogh E, Pazzani M (1998) An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In: Proceedings of the 4th International conference on knowledge discovery and data mining, New York, NY, August 27–31, pp 239–241Google Scholar
- Kumar N, Lolla N, Keogh E, Lonardi S, Ratanamahatana CA, Wei L (2005) Time series bitmaps: a practical visualization tool for working with large time series databases. In: Proceedings of the 2005 SIAM international conference on data mining, Newport Beach, CA, April 21–23, pp 531–535Google Scholar
- Larsen RJ, Marx ML (1986) An introduction to mathematical statistics and its applications, 2nd edn. Prentice Hall, Englewood, Cliffs, NJMATHGoogle Scholar
- Lin J, Keogh E (2006) Group SAX: extending the notion of contrast sets to time series and multimedia data. In: Proceedings of the 10th european conference on principles and practice of knowledge discovery in databases. Berlin, Germany, September 18–22, pp 284–296Google Scholar
- Lin J, Keogh E, Lonardi S (2005) Visualizing and discovering non-trivial patterns in large time series databases. Inform Visual 4:61–82CrossRefGoogle Scholar
- Lin J, Keogh E, Lonardi S, Chiu B (2003) A symbolic representation of time series, with implications for streaming algorithms, Workshop on Research Issues in Data Mining and Knowledge Discovery, the 8th ACM SIGMOD, San Diego, CAGoogle Scholar
- Lin J, Keogh E, Lonardi S, Lankford JP, Nystrom DM (2004) Visually mining and monitoring massive time series. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, WA, August 22–25, pp 460–469Google Scholar
- Lin J, Keogh E, Patel P, Lonardi S (2002) Finding motifs in time series, the 2nd Workshop on Temporal Data Mining, the 8th ACM International conference on knowledge discovery and data mining, Edmonton, Alberta, Canada, pp 53–68Google Scholar
- Lkhagva B, Suzuki Y, Kawagoe K (2006) New time series data representation ESAX for financial applications. In: Proceedings of the 22nd international conference on data engineering workshops, Atlanta, GA, April 3–8, pp 115Google Scholar
- Lonardi S (2001) Global detectors of unusual words: design implementation and applications to pattern discovery in biosequences. Department of Computer Sciences, Purdue UniversityGoogle Scholar
- McGovern A, Kruger A, Rosendahl D, Droegemeier K (2006) Open problem: dynamic relational models for improved hazardous weather prediction. In: Proceedings of ICML workshop on open problems in statistical relational learning, Pittsburgh, PA, June 29Google Scholar
- Mörchen F, Ultsch A (2005) Optimizing time series discretization for knowledge discovery. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, Chicago, IL, August 21–24, pp 660–665Google Scholar
- Murakami K, Yano Y, Doki S, Okuma S (2004) Behavior extraction from a series of observed robot motion. In: Proceedings of JSME conference on robotics and mechatronics. Nagoya, Japan, JuneGoogle Scholar
- NIST/SEMATECH e-Handbook of Statistical Methods. http://www.itl.nist.gov/div898/handbook/Google Scholar
- Ohsaki M, Sato Y, Yokoi H, Yamaguchi T (2003) A rule discovery support system for sequential medical data, in the case study of a chronic hepatitis dataset. Discovery challenge workshop, the 14th European conference on machine learning/the 7th european conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, CroatiaGoogle Scholar
- Pouget F, Urvoy-Keller G, Dacier M (2006) Time signature to detect multi-headed stealthy attack tools. In: Proceedings of the 18th annual first conference, Baltimore, MD, June 25–30Google Scholar
- Ratanamahatana CA, Keogh E, Bagnall AJ, Lonardi S (2005) A novel bit level time series representation with implications for similarity search and clustering. In: Proceedings of advances in knowledge discovery and data mining, 9th Pacific-Asia conference, Hanoi Vietnam, May 18–20, pp 771–777Google Scholar
- Reinert G, Schbath S, Waterman MS (2000) Probabilistic and statistical properties of words: an overview. J Comput Biol 7:1–46CrossRefGoogle Scholar
- Roddick, JF, Hornsby K, Spiliopoulou M (2001) An updated bibliography of temporal, spatial and spatio-temporal data mining research. Post-workshop proceedings of the international workshop on temporal, spatial and spatio-temporal data mining, Springer, Berlin, pp 147–163Google Scholar
- Shahabi C, Tian X, Zhao W (2000) TSA-tree: a wavelet-based approach to improve the efficiency of multi-level surprise and trend queries. In: Proceedings of the 12th international conference on scientific and statistical database management, Berlin, Germany, July 26–28, pp 55–68Google Scholar
- Silvent A, Carbay C, Carry, PY, Dojat M (2003) Data information and knowledge for medical scenario construction. In: Proceedings of the intelligent data analysis in medicine and pharmacology workshop, Protaras, Cyprus, October 19–22Google Scholar
- Silvent A, Dojat M, Garbay C (2004) Multi-level temporal abstraction for medical scenario construction. Int J Adapt Control Signal ProcessGoogle Scholar
- Staden R (1989) Methods for discovering novel motifs in nucleic acid sequences. Comput Appl Biosci 5:293–298Google Scholar
- Tanaka Y, Uehara K (2003) Discover motifs in multi dimensional time-series using the principal component analysis and the MDL principle. In: Proceedings of the 3rd international conference on machine learning and data mining in pattern recognition, Leipzig, Germany, July 5–7, pp 252–265Google Scholar
- Tanaka Y, Uehara K (2004) Motif discovery algorithm from motion data. In: Proceedings of the 18th annual conference of the japanese society for artificial intelligence, Kanazawa, Japan, June 2–4Google Scholar
- Tompa M, Buhler J (2001) Finding motifs using random projections. In: Proceedings of the 5th International conference on computational molecular biology, Montreal, Canada, April 22–25, pp 67–74Google Scholar
- Vlachos M, Kollios G, Gunopulos G (2002) Discovering similar multidimensional trajectories. In: Proceedings of the 18th international conference on data engineering, San Jose, CA, February 26-Mar 1, pp 673–684Google Scholar
- Wei L, Keogh E, Xi X (2006) SAXually explicit images: finding unusual shapes. In: Proceedings of the 2006 IEEE international conference on data mining, Hong Kong, December 18–22Google Scholar
- Yi BK, Faloutsos C (2000) Fast time sequence indexing for arbitrary lp norms. In: Proceedings of the 26th international conference on very large databases, Cairo, Egypt, September 10–14, pp 385–394Google Scholar