Skip to main content

Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams

  • Conference paper
Advances in Intelligent Data Analysis VIII (IDA 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5772))

Included in the following conference series:

Abstract

Data streams are dynamic, with frequent distributional changes. In this paper, we propose a statistical approach to detecting distributional shifts in multi-dimensional data streams. We use relative entropy, also known as the Kullback-Leibler distance, to measure the statistical distance between two distributions. In the context of a multi-dimensional data stream, the distributions are generated by data from two sliding windows. We maintain a sample of the data from the stream inside the windows to build the distributions.

Our algorithm is streaming, nonparametric, and requires no distributional or model assumptions. It employs the statistical theory of hypothesis testing and bootstrapping to determine whether the distributions are statistically different. We provide a full suite of experiments on synthetic data to validate the method and demonstrate its effectiveness on data from real-life applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C.: A framework for diagnosing changes in evolving data streams. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 575–586 (2003)

    Google Scholar 

  2. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: SODA, pp. 633–634 (2002)

    Google Scholar 

  3. Chawathe, S.S., Abiteboul, S., Widom, J.: Representing and querying changes in semistructured data. In: ICDE 1998, pp. 4–13 (1998)

    Google Scholar 

  4. Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley and Sons, Inc., Chichester (1991)

    Book  MATH  Google Scholar 

  5. Csiszár, I., Körner, J.: Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic Press, London (1981)

    MATH  Google Scholar 

  6. Dasu, T., Krishnan, S., Venkatasubramanian, S., Yi, K.: An information-theoretic approach to detecting changes in multi-dimensional data streams. In: Interface 2006 (2006)

    Google Scholar 

  7. Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman and Hall, Boca Raton (1993)

    Book  MATH  Google Scholar 

  8. Fleiss, J.L., Levin, B., Paik, M.: Statistical Methods for Rates and Proportions, 3rd edn. John Wiley and Sons, New York (2003)

    Book  MATH  Google Scholar 

  9. Ganti, V., Gehrke, J., Ramakrishnan, R., Loh, W.-Y.: A framework for measuring differences in data characteristics. pp. 126–137 (1999)

    Google Scholar 

  10. Gutman, M.: Asymptotically optimal classification for multiple tests with empirically observed statistics. IEEE Trans. Inf. Theory 35, 401–408 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  11. Hulten, G., Spencer, L., Domingos, P.: Mining time-changing data streams. In: KDD, pp. 97–106 (2001)

    Google Scholar 

  12. Johnson, D., Gruner, C.: Information-theoretic analysis of neural coding. Journal of Computational Neuroscience 10, 47–69 (2001)

    Article  Google Scholar 

  13. Keogh, E., Lonardi, S., Chiu, B.Y.: Finding surprising patterns in a time series database in linear time and space. In: KDD, pp. 550–556 (2002)

    Google Scholar 

  14. Kifer, D., Ben-David, S., Gehrke, J.: Detecting changes in data streams. In: Proceedings of the 30th International Conference on Very Large Databases, pp. 180–191 (2004)

    Google Scholar 

  15. Kleinberg, J.: Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery 7(4), 373–397 (2003)

    Article  MathSciNet  Google Scholar 

  16. Krichevsky, R.E., Trofimov, V.K.: The performance of universal encoding. IEEE Trans. Inf. Theory 27, 199–207 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  17. Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL, pp. 183–190 (1993)

    Google Scholar 

  18. Pietra, S.D., Pietra, V.D., Lafferty, J.: Inducing features of random fields. IEEE Trans. Pattern Analysis and Machine Intelligence 19, 380–393 (1995)

    Article  Google Scholar 

  19. Shibata, R.: Bootstrap estimate of Kullback-Liebler information for model selection. Statistica Sinica 7, 375–394 (1997)

    MathSciNet  MATH  Google Scholar 

  20. Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-dimensional data. In: ACM SIGKDD 2007, pp. 667–676 (2007)

    Google Scholar 

  21. Subramaniam, S., Palpanas, T., Papadopoulos, D., Kalogeraki, V., Gunopulos, D.: Online outlier detection in sensor data using non-parametric models. In: VLDB 2006, pp. 187–198 (2006)

    Google Scholar 

  22. Vitter, J.S.: Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dasu, T., Krishnan, S., Lin, D., Venkatasubramanian, S., Yi, K. (2009). Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams . In: Adams, N.M., Robardet, C., Siebes, A., Boulicaut, JF. (eds) Advances in Intelligent Data Analysis VIII. IDA 2009. Lecture Notes in Computer Science, vol 5772. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03915-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-03915-7_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-03914-0

  • Online ISBN: 978-3-642-03915-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics