Skip to main content

An Hybrid Data Stream Summarizing Approach by Sampling and Clustering

  • Chapter

Part of the book series: Studies in Computational Intelligence ((SCI,volume 292))

Abstract

Computer systems generate a large amount of data that, in terms of space and time, is very expensive - even impossible - to store. Besides this, many applications need to keep an historical view of such data in order to provide historical aggregated information, perform data mining tasks or detect anomalous behavior in computer systems. One solution is to treat the data as streams being processed on the fly in order to build historical summaries. Many data summarizing techniques have already been developed such as sampling, clustering, histograms, etc. Some of them have been extended to be applied directly to data streams. This chapter presents a new approach to build such historical summaries of data streams. It is based on a combination of two existing algorithms: StreamSamp and CluStream. The combination takes advantages of the benefits of each algorithm and avoids their drawbacks. Some experiments are presented both on real and synthetic data. These experiments show that the new approach gives better results than using any one of the two mentioned algorithms.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Abadi, D.J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., Zdonik, S.: Aurora: a new model and architecture for data stream management. The VLDB Journal 12(2), 120–139 (2003), http://dx.doi.org/10.1007/s00778-003-0095-z

    Article  Google Scholar 

  • Aggarwal, C. (ed.): Data Streams – Models and Algorithms. Springer, Heidelberg (2007)

    MATH  Google Scholar 

  • Aggarwal, C.C.: On biased reservoir sampling in the presence of stream evolution. In: VLDB 2006: Proceedings of the 32nd international conference on Very large data bases, VLDB Endowment, pp. 607–618 (2006)

    Google Scholar 

  • Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data Streams. In: VLDB, pp. 81–92 (2003)

    Google Scholar 

  • Al-Kateb, M., Lee, B.S., Wang, X.S.: Adaptive-Size Reservoir Sampling over Data Streams. In: SSDBM, p. 22 (2007)

    Google Scholar 

  • Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein, J., Widom, J.: STREAM: the stanford stream data manager (demonstration description). In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, p. 665. ACM, New York (2003), http://doi.acm.org/10.1145/872757.872854

    Chapter  Google Scholar 

  • Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS 2002: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–16. ACM, New York (2002), http://doi.acm.org/10.1145/543613.543615

    Chapter  Google Scholar 

  • Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970), http://doi.acm.org/10.1145/362686.362692

    Article  MATH  Google Scholar 

  • Csernel, B.: Résumé généraliste de flux de données. Ph.D. thesis, Ecole Nationale Supérieur des Télécommunications (Février 2008)

    Google Scholar 

  • Csernel, B., Clérot, F., Hébrail, G.: StreamSamp: DataStream Clustering Over Tilted Windows Through Sampling. In: ECML PKDD 2006 Workshop on Knowledge Discovery from Data Streams (2006)

    Google Scholar 

  • Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985), http://dx.doi.org/10.1016/0022-00008590041-8

    Article  MATH  MathSciNet  Google Scholar 

  • Gemulla, R., Lehner, W.: Sampling time-based sliding windows in bounded space. In: SIGMOD Conference, pp. 379–392 (2008)

    Google Scholar 

  • Golab, L., Özsu, M.T.: Issues in data stream management. SIGMOD Rec. 32(2), 5–14 (2003), http://doi.acm.org/10.1145/776985.776986

    Article  Google Scholar 

  • Guha, S., Harb, B.: Wavelet synopsis for data streams: minimizing non-euclidean error. In: KDD 2005: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 88–97. ACM, New York (2005), http://doi.acm.org/10.1145/1081870.1081884

    Chapter  Google Scholar 

  • Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: STOC 2001: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp. 471–475. ACM, New York (2001), http://doi.acm.org/10.1145/380752.380841

    Chapter  Google Scholar 

  • Ioannidis, Y.E., Poosala, V.: Histogram-Based Approximation of Set-Valued Query-Answers. In: VLDB, pp. 174–185 (1999)

    Google Scholar 

  • Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K.C., Suel, T.: Optimal Histograms with Quality Guarantees. In: VLDB, pp. 275–286 (1998)

    Google Scholar 

  • Ma, L., Nutt, W., Taylor, H.: Condensative Stream Query Language for Data Streams. In: ADC, pp. 113–122 (2007)

    Google Scholar 

  • Muthukrishnan, S., Strauss, M., Zheng, X.: Workload-Optimal Histograms on Streams. In: Brodal, G.S., Leonardi, S. (eds.) ESA 2005. LNCS, vol. 3669, pp. 734–745. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  • Park, B.-H., Ostrouchov, G., Samatova, N.F., Geist, A.: Reservoir-Based Random Sampling with Replacement from Data Stream. In: SIAM SDM International Conference on Data Mining (2004)

    Google Scholar 

  • Puttagunta, V., Kalpakis, K.: Adaptive Clusters and Histograms over Data Streams. In: IKE International Conference on Information and Knowledge Engineering, pp. 98–104 (2005)

    Google Scholar 

  • Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985), http://doi.acm.org/10.1145/3147.3165

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec. 25(2), 103–114 (1996), http://doi.acm.org/10.1145/235968.233324

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Gabsi, N., Clérot, F., Hébrail, G. (2010). An Hybrid Data Stream Summarizing Approach by Sampling and Clustering. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00580-0_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00580-0_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00579-4

  • Online ISBN: 978-3-642-00580-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics