Finding Segmentations of Sequences

  • Ella Bingham


We describe a collection of approaches to inductive querying systems for data that contain segmental structure. The main focus in this chapter is on work done in Helsinki area in 2004-2008. Segmentation is a general data mining technique for summarizing and analyzing sequential data. We first introduce the basic problem setting and notation.We then briefly present an optimal way to accomplish the segmentation, in the case of no added constraints. The challenge, however, lies in adding constraints that relate the segments to each other and make the end result more interpretable for the human eye, and/or make the computational task simpler. We describe various approaches to segmentation, ranging from efficient algorithms to added constraints and modifications to the problem. We also discuss topics beyond the basic task of segmentation, such as whether an output of a segmentation algorithm is meaningful or not, and touch upon some applications.


Bayesian Information Criterion Segmentation Algorithm Segmentation Problem Optimal Segmentation Fast Heuristic 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Miriam Ayer, H. D. Brunk, G. M. Ewing, W. T. Reid, and Edward Silverman. An empirical distribution function for sampling with incomplete information. Annals of Mathematical Statistics, 26(4):641–647, 1955.MATHCrossRefMathSciNetGoogle Scholar
  2. 2.
    Richard Bellman. On the approximation of curves by line segments using dynamic programming. Communications of the ACM, 4(6), 1961.Google Scholar
  3. 3.
    K.D. Bennett. Determination of the number of zones in a biostratigraphical sequence. New Phytologist, 132(1):155–170, 1996.CrossRefGoogle Scholar
  4. 4.
    Pedro Bernaola-Galván, Ramón Román-Roldán, and José L. Oliver. Compositional segmentation and long-range fractal correlations in dna sequences. Phys. Rev. E, 53(5):5181–5189, 1996.CrossRefGoogle Scholar
  5. 5.
    Ella Bingham, Aristides Gionis, Niina Haiminen, Heli Hiisilä, Heikki Mannila, and Evimaria Terzi. Segmentation and dimensionality reduction. In 2006 SIAM Conference on Data Mining, pages 372–383, 2006.Google Scholar
  6. 6.
    Harmen J. Bussemaker, Hao Li, and Eric D. Siggia. Regulatory element detection using a probabilistic segmentation model. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 67–74, 2000.Google Scholar
  7. 7.
    A. Cantoni. Optimal curve fitting with piecewise linear functions. IEEE Transactions on Computers, C-20(1):59–67, 1971.CrossRefMathSciNetGoogle Scholar
  8. 8.
    K. Chakrabarti, E. Keogh, S. Mehrotra, and M. J. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 27(2):188–228, 2002.CrossRefGoogle Scholar
  9. 9.
    G.A. Churchill. Stochastic models for heterogenous dna sequences. Bulletin of Mathematical Biology, 51(1):79–94, 1989.MATHMathSciNetGoogle Scholar
  10. 10.
    Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley, 1991.Google Scholar
  11. 11.
    David Douglas and Thomas Peucker. Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Canadian Cartographer, 10(2):112–122, 1973.Google Scholar
  12. 12.
    Sorabh Gandhi, Luca Foschini, and Subhash Suri. Space-efficient online approximation of time series data: Streams, amnesia, and out-of-order. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE), 2010.Google Scholar
  13. 13.
    Aristides Gionis and Heikki Mannila. Finding recurrent sources in sequences. In Proceedings of the Sventh Annual International Conference on Computational Biology (RECOMB 2003), 2003.Google Scholar
  14. 14.
    Aristides Gionis, Heikki Mannila, Taneli Mielikäinen, and Panayiotis Tsaparas. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(3), 2007. Article No. 14.Google Scholar
  15. 15.
    Aristides Gionis and Evimaria Terzi. Segmentations with rearrangements. In SIAM Data Mining Conference (SDM) 2007, 2007.Google Scholar
  16. 16.
    S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In Symposium on the Theory of Computing (STOC), pages 471–475, 2001.Google Scholar
  17. 17.
    Niina Haiminen. Mining sequential data — in search of segmental structure. PhD Thesis, Department of Computer Science, University of Helsinki, March 2008.Google Scholar
  18. 18.
    Niina Haiminen and Aristides Gionis. Unimodal segmentation of sequences. In ICDM ’04: Proceedings of the Fourth IEEE International Conference on Data Mining, pages 106–113, 2004.Google Scholar
  19. 19.
    Niina Haiminen and Heikki Mannila. Evaluation of BIC and cross validation for model selection on sequence segmentations. International Journal of Data Mining and Bioinformatics. In press.Google Scholar
  20. 20.
    Niina Haiminen, Heikki Mannila, and Evimaria Terzi. Comparing segmentations by applying randomization techniques. BMC Bioinformatics, 8(171), 23 May 2007.Google Scholar
  21. 21.
    Niina Haiminen, Heikki Mannila, and Evimaria Terzi. Determining significance of pairwise co-occurrences of events in bursty sequences. BMC Bioinformatics, 9:336, 2008.CrossRefGoogle Scholar
  22. 22.
    Trevor Hastie, R. Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer, 2001.Google Scholar
  23. 23.
    J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmäki, and H. T.T. Toivonen. Time series segmentation for context recognition in mobile devices. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 203–210, 2001.Google Scholar
  24. 24.
    Dorit S. Hochbaum. Heuristics for the fixed cost median problem. Mathematical Programming, 22(1):148–162, 1982.MATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    Saara Hyvönen, Aristides Gionis, and Heikki Mannila. Recurrent predictive models for sequence segmentation. In The 7th International Symposium on Intelligent Data Analysis, Lecture Notes in Computer Science. Springer, 2007.Google Scholar
  26. 26.
    Eamonn Keogh, Selina Chu, David Hart, and Michael Pazzani. An online algorithm for segmenting time series. In Proceedings of the 2001 IEEE International Conference on Data Mining, pages 289–296, 2001.Google Scholar
  27. 27.
    Eamonn Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. In Proceedings of the ACM SIGKDD ’02, pages 102–111, July 2002.Google Scholar
  28. 28.
    Eamonn Keogh and Michael J. Pazzani. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In Proceedings of the ACM SIGKDD ’98, pages 239–243, August 1998.Google Scholar
  29. 29.
    Victor Lavrenko, Matt Schmill, Dawn Lawrie, Paul Ogilvie, David Jensen, and James Allan. Mining of concurrent text and time series. In In proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Text Mining, pages 37–44, 2000.Google Scholar
  30. 30.
    W. Li. DNA segmentation as a model selection process. In Proceedings of the Fifth Annual International Conference on Computational Biology (RECOMB 2001), pages 204 – 210, 2001.Google Scholar
  31. 31.
    Jyh-Han Lin and Jeffrey Scott Vitter. ε-approximations with minimum packing constraint violation. In Proc. ACM Symposium on Theory of Computing (STOC’92), pages 771–781, 1992.Google Scholar
  32. 32.
    Jun S. Liu and Charles E. Lawrence. Bayesian inference on biopolymer models. Bioinformatics, 15(1):38–52, 1999.CrossRefGoogle Scholar
  33. 33.
    Taneli Mielikäinen, Evimaria Terzi, and Panayiotis Tsaparas. Aggregating time partitions. In The Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006), pages 347–356, 2006.Google Scholar
  34. 34.
    Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. Randomization of real-valued matrices for assessing the significance of data mining results. In Proc. SIAM Data Mining Conference (SDM’08), pages 494–505, 2008.Google Scholar
  35. 35.
    T. Palpanas, M. Vlachos, E. Keogh, D. Gunopulos, and W. Truppel. Online amnesic approximation of streaming time series. In ICDE 2004: Proceedings of the 20th International Conference on Data Engineering, pages 338–349, 2004.Google Scholar
  36. 36.
    Themis Palpanas, Michail Vlachos, Eamonn Keogh, and Dimitrios Gunopulos. Streaming time series summarization using user-defined amnesic functions. IEEE Transactions on Knowledge and Data Engineering, 20(7):992–1006, 2008.CrossRefGoogle Scholar
  37. 37.
    V.E. Ramensky, V.J. Makeev, M.A. Roytberg, and V.G. Tumanyan. DNA segmentation through the Bayesian approach. Journal of Computational Biology, 7(1-2):215–231, 2000.CrossRefGoogle Scholar
  38. 38.
    Marko Salmenkivi, Juha Kere, and Heikki Mannila. Genome segmentation using piecewise constant intensity models and reversible jump MCMC. Bioinformatics (European Conference on Computational Biology), 18(2):211–218, 2002.Google Scholar
  39. 39.
    G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464, 1978.MATHCrossRefMathSciNetGoogle Scholar
  40. 40.
    Hagit Shatkay and Stanley B. Zdonik. Approximate queries and representations for large data sequences. In ICDE ’96: Proceedings of the Twelfth International Conference on Data Engineering, pages 536–545, 1996.Google Scholar
  41. 41.
    P. Smyth. Model selection for probabilistic clustering using cross-validated likelihood. Statistics and Computing, 9:63–72, 2000.CrossRefMathSciNetGoogle Scholar
  42. 42.
    M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B, 36(2):111–147, 1974.MATHGoogle Scholar
  43. 43.
    Evimaria Terzi and Panayiotis Tsaparas. Efficient algorithms for sequence segmentation. In 2006 SIAM Conference on Data Mining, pages 314–325, 2006.Google Scholar
  44. 44.
    V. Vazirani. Approximation algorithms. Springer, 2003.Google Scholar
  45. 45.
    Y.-L. Wu, D. Agrawal, and A. El Abbadi. A comparison of DFT and DWT based similarity search in time series databases. In Proceedings of the Ninth ACM International Conference on Information and Knowledge Management (CIKM’00), pages 488–495, November 2000.Google Scholar
  46. 46.
    B. Yi and C. Faloutsos. Fast time sequence indexing for arbitrary LP-norms. In Proceedings of the 26th International Conference on Very Large Databases (VLDB’00), pages 385–394, September 2000.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  1. 1.Helsinki Institute for Information TechnologyUniversity of Helsinki and Aalto University School of Science and TechnologyHelsinkFinland

Personalised recommendations