Skip to main content

Summarizing Two-Dimensional Data with Skyline-Based Statistical Descriptors

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5069))

Abstract

Much real data consists of more than one dimension, such as financial transactions (eg, price × volume) and IP network flows (eg, duration × numBytes), and capture relationships between the variables. For a single dimension, quantiles are intuitive and robust descriptors. Processing and analyzing such data, particularly in data warehouse or data streaming settings, requires similarly robust and informative statistical descriptors that go beyond one-dimension. Applying quantile methods to summarize a multidimensional distribution along only singleton attributes ignores the rich dependence amongst the variables.

In this paper, we present new skyline-based statistical descriptors for capturing the distributions over pairs of dimensions. They generalize the notion of quantiles in the individual dimensions, and also incorporate properties of the joint distribution. We introduce φ-quantours and α-radials, which are skyline points over subsets of the data, and propose , α)-quantiles, found from the union of these skylines, as statistical descriptors of two-dimensional distributions. We present efficient online algorithms for tracking (φ,α)-quantiles on two-dimensional streams using guaranteed small space. We identify the principal properties of the proposed descriptors and perform extensive experiments with synthetic and real IP traffic data to study the efficiency of our proposed algorithms.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: ACM PODS, pp. 1–16 (2002)

    Google Scholar 

  2. Borzsonyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: IEEE ICDE, pp. 421–430 (2001)

    Google Scholar 

  3. Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: ACM SIGMOD (2001)

    Google Scholar 

  4. Chan, C.Y., Jagadish, H.V., Tan, K.-L., Tung, A.K.H., Zhang, Z.: On High Dimensional Skylines. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 478–495. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. Chaudhuri, P.: On a geometric notion of quantiles for multivariate data. Journal of the American Statistical Association 91, 862–872 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  6. Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD, pp. 35–46 (2004)

    Google Scholar 

  7. Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: ACM PODS (2006)

    Google Scholar 

  8. Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: A stream database for network applications. In: ACM SIGMOD, pp. 647–651 (2003)

    Google Scholar 

  9. Einmal, J., Mason, D.: Generalized quantile processes. Annals of Statistics 20(2), 1062–1078 (1992)

    Article  MathSciNet  Google Scholar 

  10. Eppstein, D.: Single point estimators (1999), http://www.ics.uci.edu/~eppstein/280/point.html

  11. Evans, M., Hastings, N., Peacock, B.: Statistical Distributions, 3rd edn. Wiley, New York (2000)

    MATH  Google Scholar 

  12. Goncalves, M., Vidal, M.-E.: Top-k Skyline: A Unified Approach. In: OTM Workshops (2005)

    Google Scholar 

  13. Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: ACM SIGMOD, pp. 58–66 (2001)

    Google Scholar 

  14. Hershberger, J., Shrivastava, N., Suri, S., Toth, C.: Adaptive spatial partitioning for multidimensional data streams. In: ISAAC (2004)

    Google Scholar 

  15. Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours. In: KDD, pp. 224–228 (1998)

    Google Scholar 

  16. Koltchinskii, V.I.: M-estimation, convexity and quantiles. Annals of Statistics 25(2), 435–477 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  17. Kumar, A., Sung, M., Xu, J., Wang, J.: Data streaming algorithms for efficient and accurate estimation of flow distribution. In: ACM Sigmetrics (2004)

    Google Scholar 

  18. Lin, X., Yuan, Y., Zhang, Q., Zhang, Y.: Selecting Stars: the k Most Representative Skyline Operator. In: IEEE ICDE (2007)

    Google Scholar 

  19. Muralikrishna, M., DeWitt, D.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: ACM SIGMOD (1988)

    Google Scholar 

  20. Muthukrishnan, S.: Data streams: Algorithms and applications. In: ACM-SIAM SODA (2003)

    Google Scholar 

  21. Papadias, D., Tao, Y., Fu, G., Seeger, B.: Progressive skyline computation in database systems. ACM Transactions on Database Systems 30(1), 41–82 (2005)

    Article  Google Scholar 

  22. Preparata, F.P., Shamos, M.I.: Computational Geometry: An Introduction, 2nd edn. Springer, Heidelberg (1985)

    Google Scholar 

  23. Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, Chichester (2003)

    Google Scholar 

  24. Serfling, R.: Quantile functions for multivariate analysis: approaches and applications. Statistica Neerlandica 56(2), 214–232 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  25. Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: New aggregation techniques for sensor networks. In: ACM SenSys. (2004)

    Google Scholar 

  26. Suri, S., Tóth, C.D., Zhou, Y.: Range counting over multidimensional data streams. In: SoCG (2004)

    Google Scholar 

  27. Thaper, N., Indyk, P., Guha, S., Koudas, N.: Dynamic multidimensional histograms. In: ACM SIGMOD, pp. 359–366 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Bertram Ludäscher Nikos Mamoulis

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D. (2008). Summarizing Two-Dimensional Data with Skyline-Based Statistical Descriptors. In: Ludäscher, B., Mamoulis, N. (eds) Scientific and Statistical Database Management. SSDBM 2008. Lecture Notes in Computer Science, vol 5069. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69497-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-69497-7_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-69476-2

  • Online ISBN: 978-3-540-69497-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics