Skip to main content

Efficient Algorithms for Segmentation of Item-Set Time Series

  • Chapter
  • 1051 Accesses

Abstract

We propose a special type of time series, which we call an item-set time series, to facilitate the temporal analysis of software version histories, email logs, stock market data, etc. In an item-set time series, each observed data value is a set of discrete items. We formalize the concept of an item-set time series and present efficient algorithms for segmenting a given item-set time series. Segmentation of a time series partitions the time series into a sequence of segments where each segment is constructed by combining consecutive time points of the time series. Each segment is associated with an item set that is computed from the item sets of the time points in that segment, using a function which we call a measure function. We then define a concept called the segment difference, which measures the difference between the item set of a segment and the item sets of the time points in that segment. The segment difference values are required to construct an optimal segmentation of the time series. We describe novel and efficient algorithms to compute segment difference values for each of the measure functions described in the paper. We outline a dynamic programming based scheme to construct an optimal segmentation of the given item-set time series. We use the item-set time series segmentation techniques to analyze the temporal content of three different data sets—Enron email, stock market data, and a synthetic data set. The experimental results show that an optimal segmentation of item-set time series data captures much more temporal content than a segmentation constructed based on the number of time points in each segment, without examining the item set data at the time points, and can be used to analyze different types of temporal data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. R. Bellman. On the approximation of curves by line segments using dynamic programming. Commun. ACM, 4(6):284, 1961.

    Article  Google Scholar 

  2. P. Chundi and D. J. Rosenkrantz. Constructing time decompositions for analyzing time-stamped documents. In Proceedings of the 4th SIAM International Conference on Data Mining, pages 57–68, Orlando, FL, Apr. 2004.

    Google Scholar 

  3. P. Chundi and D. J. Rosenkrantz. On lossy time decompositions of time-stamped documents. In Proc. 13th ACM Conference on Information and Knowledge Management (CIKM), pages 437–445, Washington, DC, Nov. 2004.

    Google Scholar 

  4. P. Chundi and D. J. Rosenkrantz. Information preserving time decompositions of time stamped documents. Data Min. Knowl. Discov., 13(1):41–65, 2006.

    Article  MathSciNet  Google Scholar 

  5. P. Chundi and D. J. Rosenkrantz. Segmentation of time series data. In J. Wang, editor, Encyclopedia of Data Warehousing and Mining. Information Science Reference, Hershey, 2nd edition, pages 1753–1758, 2008.

    Google Scholar 

  6. P. Chundi, R. Zhang, and D. J. Rosenkrantz. Efficient algorithms for constructing time decompositions of time stamped documents. In K. V. Andersen, J. K. Debenham, and R. Wagner, editors, Proc. 16th International Conference on Database and Expert Systems Applications (DEXA). Lecture Notes in Computer Science, volume 3588, pages 514–523. Springer, Berlin, 2005.

    Google Scholar 

  7. K. K. S. Chung, L. Hossain, and J. Davis. Exploring sociocentric and egocentric approaches for social network analysis. In Proc. 2nd International Conference on Knowledge Management in Asia Pacific, 2005.

    Google Scholar 

  8. P. Cohen and N. Adams. An algorithm for segmenting categorical time series into meaningful episodes. In Proc. 4th International Symposium on Intelligent Data Analysis. Lecture Notes in Computer Science, volume 2189, pages 198–207. Springer, Berlin, 2001.

    Google Scholar 

  9. G. Das, K. Lin, H. Mannila, G. Renganathan, and P. Smyth. Rule discovery from time series. In Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD), pages 16–22. AAAI Press, Menlo Park, 1998.

    Google Scholar 

  10. J. Diesner and K. Carley. Exploration of communication networks from the Enron Email Corpus. In Proc. 2005 Workshop on Link Analysis, Counterterrorism, and Security (held in conjunction with SDM 2005), 2005.

    Google Scholar 

  11. Enron, 2005, Enron Email Corpus. http://www.cs.cmu.edu/~enron/.

  12. J. A. Flanagan, J. Mantyjarvi, and J. Himberg. Unsupervised clustering of symbol strings and context recognition. In Proc. 2nd IEEE International Conference on Data Mining, page 171, 2002.

    Google Scholar 

  13. M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy. Mining data streams: A review. ACM SIGMOD Record, 34(2):18–26, 2005.

    Article  Google Scholar 

  14. X. Ge, W. Pratt, and P. Smyth. Discovering Chinese words from unsegmented text. In Proc. 22nd International Conference on Research and Development on Information Retrieval (SIGIR), pages 271–272, Berkeley, CA, 1999.

    Google Scholar 

  15. A. Gionis and H. Mannila. Finding recurrent sources in sequences. In Proc. 7th International Conference on Research in Computational Molecular Biology (RECOMB), pages 123–130, 2003.

    Google Scholar 

  16. A. Gionis and H. Mannila. Segmentation algorithms for time series and sequence data. In Tutorial at 5th SIAM International Conference on Data Mining, 2005.

    Google Scholar 

  17. R. Gwadera, A. Gionis, and H. Mannila. Optimal segmentation using tree models. In Proc. 6th International Conference on Data Mining (ICDM), pages 244–253, 2006.

    Google Scholar 

  18. J. Himberg, J. Toivonen, K. Korpiaho, and H. Mannila. Time series segmentation for context recognition in mobile devices. In Proc. 1st International Conference on Data Mining (ICDM), pages 203–210, 2001.

    Google Scholar 

  19. A. Kehagias and V. Petridis. Time-series segmentation using predictive modular neural networks. Neural Computation, 9(8):1691–1709, 1997.

    Article  Google Scholar 

  20. E. J. Keogh and S. Kasetty. On the need for time series data mining benchmarks: A survey and empirical demonstration. Data Min. Knowl. Discov., 7(4):349–371, 2003.

    Article  MathSciNet  Google Scholar 

  21. E. J. Keogh and M. J. Pazzani. An enhanced representation of time series which allows fast and accurate classification, clustering and relevance feedback. In Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD), pages 239–243. AAAI Press, Menlo Park, 1998.

    Google Scholar 

  22. E. J. Keogh, S. Chu, D. Hart, and M. J. Pazzani. An online algorithm for segmenting time series. In Proc. 1st IEEE International Conference on Data Mining (ICDM), pages 289–296, 2001.

    Google Scholar 

  23. B. Klimt and Y. Yang. Introducing the Enron Corpus. In First Conference on Email and Anti-Spam (CEAS), 2004.

    Google Scholar 

  24. J. Lin, E. J. Keogh, S. Lonardi, and B. Chiu. A symbolic representation of time series, with implications for streaming algorithms. In Proc. 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), pages 2–11, 2003.

    Google Scholar 

  25. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Min. Knowl. Discov., 1(3):259–289, 1997.

    Article  Google Scholar 

  26. N. Pathak, S. Mane, and J. Srivastava. Who thinks who knows who? Socio-cognitive analysis of Email networks. In Proc. 6th IEEE International Conference on Data Mining (ICDM), pages 466–477, 2006.

    Google Scholar 

  27. E. Perlman and A. Java. Predictive mining of time series data in astronomy. Proc. Astronomical Data Analysis Software and Systems XII, ASP Conference Series, 295:431–434, 2003.

    Google Scholar 

  28. J. Shetty and J. Adibi. Discovering important nodes through graph entropy – the case of Enron Email Database. In Workshop on Link Discovery: Issues, Approaches and Applications (held in conjunction with ACM SIGKDD 2005), pages 74–81, 2005.

    Google Scholar 

  29. H. Siy, P. Chundi, D. J. Rosenkrantz, and M. Subramaniam. Discovering dynamic developer relationships from software version histories by time series segmentation. In Proc. 23rd IEEE International Conference on Software Maintenance (ICSM), pages 415–424, Paris, Oct. 2007.

    Google Scholar 

  30. H. Siy, P. Chundi, D. J. Rosenkrantz, and M. Subramaniam. A segmentation-based approach for temporal analysis of software version repositories. J. Software Maintenance and Evolution: Research and Practice, 20(3):199–222, 2008.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Parvathi Chundi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science + Business Media B.V.

About this chapter

Cite this chapter

Chundi, P., Rosenkrantz, D.J. (2009). Efficient Algorithms for Segmentation of Item-Set Time Series. In: Ravi, S.S., Shukla, S.K. (eds) Fundamental Problems in Computing. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-9688-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-4020-9688-4_10

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-9687-7

  • Online ISBN: 978-1-4020-9688-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics