Abstract
Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of feature selection (e.g., by monitoring the information gain of various features), still require very high communication overhead using naive, centralized algorithms.
We present a novel geometric approach by which an arbitrary global monitoring task can be split into a set of constraints applied locally on each of the streams. The constraints are used to locally filter out data increments that do not affect the monitoring outcome, thus avoiding unnecessary communication. As a result, our approach enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner.
We present experimental results on real-world data which demonstrate that our algorithms are highly scalable, and considerably reduce communication load in comparison to centralized algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Online data mining for co-evolving time sequences. In: ICDE 2000: Proceedings of the 16th International Conference on Data Engineering, Washington, DC, USA, p. 13. IEEE Computer Society, Los Alamitos (2000)
Fjording the stream: An architecture for queries over streaming sensor data. In: ICDE 2002: Proceedings of the 18th International Conference on Data Engineering (ICDE 2002), Washington, DC, USA, p. 555. IEEE Computer Society, Los Alamitos (2002)
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: STOC 1996: Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, pp. 20–29. ACM Press, New York (1996)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS 2002: Proceedings of the Twenty-First ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 1–16. ACM Press, New York (2002)
Babu, S., Widom, J.: Continuous queries over data streams. SIGMOD Rec. 30(3), 109–120 (2001)
Bulut, A., Singh, A.K., Vitenberg, R.: Distributed data streams indexing using content-based routing paradigm. In: IPDPS. IEEE Computer Society, Los Alamitos (2005)
Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Monitoring streams - a new class of data management applications. In: VLDB, pp. 215–226 (2002)
Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Cetintemel, U., Xing, Y., Zdonik, S.: Scalable Distributed Stream Processing. In: CIDR 2003 - First Biennial Conference on Innovative Data Systems Research, Asilomar, CA (January 2003)
Motwani, R., Widom, J., Arasu, A., Babcock, B., Babu, S., Datar, M., Manku, G., Olston, C., Rosenstein, J., Varma, R.: Query processing, resource management, and approximation in a data stream management system. In: CIDR 2003 - First Biennial Conference on Innovative Data Systems Research, Asilomar, CA, pp. 245–256 (2003)
Cormode, G., Garofalakis, M., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: distributed tracking of approximate quantiles. In: SIGMOD 2005: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 25–36 (2005)
Dilman, M., Raz, D.: Efficient reactive monitoring. In: INFOCOM, pp. 1012–1019 (2001)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 361–397 (2004)
Liu, L., Pu, C., Tang, W.: Continual queries for internet scale event-driven information delivery. IEEE Transactions on Knowledge and Data Engineering 11(4), 610–628 (1999)
Madden, S., Shah, M., Hellerstein, J.M., Raman, V.: Continuously adaptive continuous queries over streams. In: SIGMOD 2002: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 49–60. ACM Press, New York (2002)
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE 2005: Proceedings of the 21st International Conference on Data Engineering (ICDE 2005), Washington, DC, USA, pp. 767–778. IEEE Computer Society, Los Alamitos (2005)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: VLDB, pp. 346–357 (2002)
Terry, D., Goldberg, D., Nichols, D., Oki, B.: Continuous queries over append-only databases. In: SIGMOD 1992: Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data, pp. 321–330. ACM Press, New York (1992)
Zhu, Y., Shasha, D.: Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp. 358–369 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Sharfman, I., Schuster, A., Keren, D. (2010). A Geometric Approach to Monitoring Threshold Functions over Distributed Data Streams. In: May, M., Saitta, L. (eds) Ubiquitous Knowledge Discovery. Lecture Notes in Computer Science(), vol 6202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16392-0_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-16392-0_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16391-3
Online ISBN: 978-3-642-16392-0
eBook Packages: Computer ScienceComputer Science (R0)