Concept Drift Based Multi-dimensional Data Streams Sampling Method
A summary can immensely reduce the time and space complexity of an algorithm. This concept is considered a research hotspot in the field of data stream mining. Data streams are characterized as having continuous data arrival, rapid speed, large scale, and cannot be completely stored in memory simultaneously. A summary is often formed in the memory to approximate the database query or data mining task. A sampling technique is a commonly used method for constructing data stream summaries. Traditional simple random sampling algorithms do not consider the conceptual drift of data distributions that change over time. Therefore, a challenging task is sampling the summary of the data distribution in multi-dimensional data streams of a concept drift. This study proposes a sampling algorithm that ensures the consistency of the data distribution with the data streams of the concept drift. First, probability statistics is used on the data stream cells in the reference window to obtain data distribution. A probability sampling is performed on the basis of this distribution. Second, the sliding window is used to continuously detect whether the data distribution has changed. If the data distribution does not change, then the original sampling data are maintained. Otherwise, the data distribution in the statistical window is restated to form a new sampling probability. The proposed algorithm ensures that the data distribution in the data profile is continually consistent with the population distribution. We compare our algorithm with the state-of-the-art algorithms on synthetic and real data sets. Experimental results demonstrate the effectiveness of our algorithm.
KeywordsData stream clustering Sampling Summary
This work is supported by the National Key R&D Program of China (2017YFB0702600, 2017YFB0702601), the National Natural Science Foundation of China (61432008, U1435214, 61503178) and Yili Normal University Project (No. 2016WXDZD001).
- 2.Rivetti, N., Busnel, Y., Mostefaoui, A.: Efficiently summarizing data streams over sliding windows. In: 2015 IEEE 14th International Symposium on Network Computing and Applications (NCA), pp. 151–158. IEEE (2015)Google Scholar
- 3.Cormode, G., Duffield, N.: Sampling for big data: a tutorial. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1975–1975. ACM (2014)Google Scholar
- 5.Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-size reservoir sampling over data streams. In: 19th International Conference on Scientific and Statistical Database Management, p. 22. IEEE (2007)Google Scholar
- 6.Babcock, B., Datar, M., Motwani, R.: Sampling from a moving window over streaming data. In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–634. Society for Industrial and Applied Mathematics (2002)Google Scholar
- 7.Song, X., Wu, M., Jermaine, C., Ranka, S.: Statistical change detection for multi-dimensional data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 667–676. ACM (2007)Google Scholar
- 8.Qahtan, A.A., Alharbi, B., Wang, S., Zhang, X.: A PCA-based change detection framework for multidimensional data streams: change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–944. ACM (2015)Google Scholar
- 9.Ahmed, M.: Data summarization: a survey. Knowl. Inf. Syst. 58, 1–25 (2018)Google Scholar
- 11.Gibbons, P.B., Matias, Y.: New sampling-based summary statistics for improving approximate query answers. In: ACM SIGMOD Record, vol. 27, no. 2, pp. 331–342. ACM (1998)Google Scholar
- 13.Chuang, K.-T., Chen, H.-L., Chen, M.-S.: Feature-preserved sampling over streaming data. ACM Trans. Knowl. Discov. Data (TKDD) 2(4), 15 (2009)Google Scholar