Skip to main content

Concept Drift Based Multi-dimensional Data Streams Sampling Method

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11439))

Abstract

A summary can immensely reduce the time and space complexity of an algorithm. This concept is considered a research hotspot in the field of data stream mining. Data streams are characterized as having continuous data arrival, rapid speed, large scale, and cannot be completely stored in memory simultaneously. A summary is often formed in the memory to approximate the database query or data mining task. A sampling technique is a commonly used method for constructing data stream summaries. Traditional simple random sampling algorithms do not consider the conceptual drift of data distributions that change over time. Therefore, a challenging task is sampling the summary of the data distribution in multi-dimensional data streams of a concept drift. This study proposes a sampling algorithm that ensures the consistency of the data distribution with the data streams of the concept drift. First, probability statistics is used on the data stream cells in the reference window to obtain data distribution. A probability sampling is performed on the basis of this distribution. Second, the sliding window is used to continuously detect whether the data distribution has changed. If the data distribution does not change, then the original sampling data are maintained. Otherwise, the data distribution in the statistical window is restated to form a new sampling probability. The proposed algorithm ensures that the data distribution in the data profile is continually consistent with the population distribution. We compare our algorithm with the state-of-the-art algorithms on synthetic and real data sets. Experimental results demonstrate the effectiveness of our algorithm.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.cis.fordham.edu/wisdm/dataset.php.

  2. 2.

    http://www.cis.fordham.edu/wisdm/dataset.php.

  3. 3.

    http://kdd.ics.uci.edu/databases/el_nino/el_nino.html.

  4. 4.

    http://kdd.ics.uci.edu/databases/covertype/covertype.html.

  5. 5.

    http://www.pamap.org/demo.html.

References

  1. Agarwal, P.K., Cormode, G., Huang, Z., Phillips, J.M., Wei, Z., Yi, K.: Mergeable summaries. ACM Trans. Database Syst. (TODS) 38(4), 26 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  2. Rivetti, N., Busnel, Y., Mostefaoui, A.: Efficiently summarizing data streams over sliding windows. In: 2015 IEEE 14th International Symposium on Network Computing and Applications (NCA), pp. 151–158. IEEE (2015)

    Google Scholar 

  3. Cormode, G., Duffield, N.: Sampling for big data: a tutorial. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1975–1975. ACM (2014)

    Google Scholar 

  4. Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. (TOMS) 11(1), 37–57 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  5. Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-size reservoir sampling over data streams. In: 19th International Conference on Scientific and Statistical Database Management, p. 22. IEEE (2007)

    Google Scholar 

  6. Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–634. Society for Industrial and Applied Mathematics (2002)

    Google Scholar 

  7. Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-dimensional data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 667–676. ACM (2007)

    Google Scholar 

  8. Qahtan, A.A., Alharbi, B., Wang, S., Zhang, X.: A PCA-based change detection framework for multidimensional data streams: change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–944. ACM (2015)

    Google Scholar 

  9. Ahmed, M.: Data summarization: a survey. Knowl. Inf. Syst. 58, 1–25 (2018)

    Google Scholar 

  10. Hesabi, Z.R., Tari, Z., Goscinski, A., Fahad, A., Khalil, I., Queiroz, C.: Data summarization techniques for big data—a survey. In: Khan, S.U., Zomaya, A.Y. (eds.) Handbook on Data Centers, pp. 1109–1152. Springer, New York (2015). https://doi.org/10.1007/978-1-4939-2092-1_38

    Chapter  Google Scholar 

  11. Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: ACM SIGMOD Record, vol. 27, no. 2, pp. 331–342. ACM (1998)

    Google Scholar 

  12. Zhang, J., Xu, J., Liao, S.S.: Sampling methods for summarizing unordered vehicle-to-vehicle data streams. Transp. Res. Part C: Emerg. Technol. 23, 56–67 (2012)

    Article  Google Scholar 

  13. Chuang, K.-T., Chen, H.-L., Chen, M.-S.: Feature-preserved sampling over streaming data. ACM Trans. Knowl. Discov. Data (TKDD) 2(4), 15 (2009)

    Google Scholar 

  14. Tillé, Y.: Sampling algorithms. In: Lovric, M. (ed.) International Encyclopedia of Statistical Science, pp. 1273–1274. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  15. Al-Kateb, M., Lee, B.S.: Adaptive stratified reservoir sampling over heterogeneous data streams. Inf. Syst. 39, 199–216 (2014)

    Article  Google Scholar 

  16. Zhang, X., Furtlehner, C., Germain-Renaud, C., Sebag, M.: Data stream clustering with affinity propagation. IEEE Trans. Knowl. Data Eng. 26(7), 1644–1656 (2014)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by the National Key R&D Program of China (2017YFB0702600, 2017YFB0702601), the National Natural Science Foundation of China (61432008, U1435214, 61503178) and Yili Normal University Project (No. 2016WXDZD001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lin, L., Qi, X., Zhu, Z., Gao, Y. (2019). Concept Drift Based Multi-dimensional Data Streams Sampling Method. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11439. Springer, Cham. https://doi.org/10.1007/978-3-030-16148-4_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-16148-4_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-16147-7

  • Online ISBN: 978-3-030-16148-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics