An Hybrid Data Stream Summarizing Approach by Sampling and Clustering

Gabsi, Nesrine; Clérot, Fabrice; Hébrail, Georges

doi:10.1007/978-3-642-00580-0_11

An Hybrid Data Stream Summarizing Approach by Sampling and Clustering

Nesrine Gabsi^5,6,
Fabrice Clérot⁶ &
Georges Hébrail⁷

Chapter

865 Accesses
1 Citations

Part of the book series: Studies in Computational Intelligence ((SCI,volume 292))

Abstract

Computer systems generate a large amount of data that, in terms of space and time, is very expensive - even impossible - to store. Besides this, many applications need to keep an historical view of such data in order to provide historical aggregated information, perform data mining tasks or detect anomalous behavior in computer systems. One solution is to treat the data as streams being processed on the fly in order to build historical summaries. Many data summarizing techniques have already been developed such as sampling, clustering, histograms, etc. Some of them have been extended to be applied directly to data streams. This chapter presents a new approach to build such historical summaries of data streams. It is based on a combination of two existing algorithms: StreamSamp and CluStream. The combination takes advantages of the benefits of each algorithm and avoids their drawbacks. Some experiments are presented both on real and synthetic data. These experiments show that the new approach gives better results than using any one of the two mentioned algorithms.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. The VLDB Journal 12(2), 120–139 (2003), http://dx.doi.org/10.1007/s00778-003-0095-z
Article Google Scholar
Aggarwal, C. (ed.): Data Streams – Models and Algorithms. Springer, Heidelberg (2007)
MATH Google Scholar
Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: VLDB 2006: Proceedings of the 32nd international conference on Very large data bases, VLDB Endowment, pp. 607–618 (2006)
Google Scholar
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data Streams. In: VLDB, pp. 81–92 (2003)
Google Scholar
Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-Size Reservoir Sampling over Data Streams. In: SSDBM, p. 22 (2007)
Google Scholar
Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J., Widom, J.: STREAM: the stanford stream data manager (demonstration description). In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, p. 665. ACM, New York (2003), http://doi.acm.org/10.1145/872757.872854
Chapter Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS 2002: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–16. ACM, New York (2002), http://doi.acm.org/10.1145/543613.543615
Chapter Google Scholar
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970), http://doi.acm.org/10.1145/362686.362692
Article MATH Google Scholar
Csernel, B.: Résumé généraliste de flux de données. Ph.D. thesis, Ecole Nationale Supérieur des Télécommunications (Février 2008)
Google Scholar
Csernel, B., Clérot, F., Hébrail, G.: StreamSamp: DataStream Clustering Over Tilted Windows Through Sampling. In: ECML PKDD 2006 Workshop on Knowledge Discovery from Data Streams (2006)
Google Scholar
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985), http://dx.doi.org/10.1016/0022-00008590041-8
Article MATH MathSciNet Google Scholar
Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD Conference, pp. 379–392 (2008)
Google Scholar
Golab, L., Özsu, M.T.: Issues in data stream management. SIGMOD Rec. 32(2), 5–14 (2003), http://doi.acm.org/10.1145/776985.776986
Article Google Scholar
Guha, S., Harb, B.: Wavelet synopsis for data streams: minimizing non-euclidean error. In: KDD 2005: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 88–97. ACM, New York (2005), http://doi.acm.org/10.1145/1081870.1081884
Chapter Google Scholar
Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: STOC 2001: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp. 471–475. ACM, New York (2001), http://doi.acm.org/10.1145/380752.380841
Chapter Google Scholar
Ioannidis, Y.E., Poosala, V.: Histogram-Based Approximation of Set-Valued Query-Answers. In: VLDB, pp. 174–185 (1999)
Google Scholar
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal Histograms with Quality Guarantees. In: VLDB, pp. 275–286 (1998)
Google Scholar
Ma, L., Nutt, W., Taylor, H.: Condensative Stream Query Language for Data Streams. In: ADC, pp. 113–122 (2007)
Google Scholar
Muthukrishnan, S., Strauss, M., Zheng, X.: Workload-Optimal Histograms on Streams. In: Brodal, G.S., Leonardi, S. (eds.) ESA 2005. LNCS, vol. 3669, pp. 734–745. Springer, Heidelberg (2005)
Chapter Google Scholar
Park, B.-H., Ostrouchov, G., Samatova, N.F., Geist, A.: Reservoir-Based Random Sampling with Replacement from Data Stream. In: SIAM SDM International Conference on Data Mining (2004)
Google Scholar
Puttagunta, V., Kalpakis, K.: Adaptive Clusters and Histograms over Data Streams. In: IKE International Conference on Information and Knowledge Engineering, pp. 98–104 (2005)
Google Scholar
Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985), http://doi.acm.org/10.1145/3147.3165
Article MATH MathSciNet Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec. 25(2), 103–114 (1996), http://doi.acm.org/10.1145/235968.233324
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institut TELECOM, TELECOM ParisTech, 46 Rue Barrault, 75013, Paris
Nesrine Gabsi
France Télécom RD, 2, avenue P.Marzin, 22307, Lannion
Nesrine Gabsi & Fabrice Clérot
Institut TELECOM, TELECOM ParisTech, Partially Suported by ANR (MIDAS Project ANR-07-MDO-008), 46 Rue Barrault, 75013, Paris
Georges Hébrail

Authors

Nesrine Gabsi
View author publications
You can also search for this author in PubMed Google Scholar
Fabrice Clérot
View author publications
You can also search for this author in PubMed Google Scholar
Georges Hébrail
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Polytechnic School of Nantes University, Nantes, France
Fabrice Guillet & Henri Briand &
Université de Genève, Genève, Switzerland
Gilbert Ritschard
Université Lumi‘́ere Lyon 2, Bron, France
Djamel Abdelkader Zighed

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gabsi, N., Clérot, F., Hébrail, G. (2010). An Hybrid Data Stream Summarizing Approach by Sampling and Clustering. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00580-0_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-00580-0_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00579-4
Online ISBN: 978-3-642-00580-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics