Abstract
Much real data consists of more than one dimension, such as financial transactions (eg, price × volume) and IP network flows (eg, duration × numBytes), and capture relationships between the variables. For a single dimension, quantiles are intuitive and robust descriptors. Processing and analyzing such data, particularly in data warehouse or data streaming settings, requires similarly robust and informative statistical descriptors that go beyond one-dimension. Applying quantile methods to summarize a multidimensional distribution along only singleton attributes ignores the rich dependence amongst the variables.
In this paper, we present new skyline-based statistical descriptors for capturing the distributions over pairs of dimensions. They generalize the notion of quantiles in the individual dimensions, and also incorporate properties of the joint distribution. We introduce φ-quantours and α-radials, which are skyline points over subsets of the data, and propose (φ, α)-quantiles, found from the union of these skylines, as statistical descriptors of two-dimensional distributions. We present efficient online algorithms for tracking (φ,α)-quantiles on two-dimensional streams using guaranteed small space. We identify the principal properties of the proposed descriptors and perform extensive experiments with synthetic and real IP traffic data to study the efficiency of our proposed algorithms.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: ACM PODS, pp. 1–16 (2002)
Borzsonyi, S., Kossmann, D., Stocker, K.: The skyline operator. In: IEEE ICDE, pp. 421–430 (2001)
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. In: ACM SIGMOD (2001)
Chan, C.Y., Jagadish, H.V., Tan, K.-L., Tung, A.K.H., Zhang, Z.: On High Dimensional Skylines. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 478–495. Springer, Heidelberg (2006)
Chaudhuri, P.: On a geometric notion of quantiles for multivariate data. Journal of the American Statistical Association 91, 862–872 (1996)
Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: ACM SIGMOD, pp. 35–46 (2004)
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: ACM PODS (2006)
Cranor, C., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: A stream database for network applications. In: ACM SIGMOD, pp. 647–651 (2003)
Einmal, J., Mason, D.: Generalized quantile processes. Annals of Statistics 20(2), 1062–1078 (1992)
Eppstein, D.: Single point estimators (1999), http://www.ics.uci.edu/~eppstein/280/point.html
Evans, M., Hastings, N., Peacock, B.: Statistical Distributions, 3rd edn. Wiley, New York (2000)
Goncalves, M., Vidal, M.-E.: Top-k Skyline: A Unified Approach. In: OTM Workshops (2005)
Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: ACM SIGMOD, pp. 58–66 (2001)
Hershberger, J., Shrivastava, N., Suri, S., Toth, C.: Adaptive spatial partitioning for multidimensional data streams. In: ISAAC (2004)
Johnson, T., Kwok, I., Ng, R.: Fast computation of 2-dimensional depth contours. In: KDD, pp. 224–228 (1998)
Koltchinskii, V.I.: M-estimation, convexity and quantiles. Annals of Statistics 25(2), 435–477 (1997)
Kumar, A., Sung, M., Xu, J., Wang, J.: Data streaming algorithms for efficient and accurate estimation of flow distribution. In: ACM Sigmetrics (2004)
Lin, X., Yuan, Y., Zhang, Q., Zhang, Y.: Selecting Stars: the k Most Representative Skyline Operator. In: IEEE ICDE (2007)
Muralikrishna, M., DeWitt, D.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: ACM SIGMOD (1988)
Muthukrishnan, S.: Data streams: Algorithms and applications. In: ACM-SIAM SODA (2003)
Papadias, D., Tao, Y., Fu, G., Seeger, B.: Progressive skyline computation in database systems. ACM Transactions on Database Systems 30(1), 41–82 (2005)
Preparata, F.P., Shamos, M.I.: Computational Geometry: An Introduction, 2nd edn. Springer, Heidelberg (1985)
Rousseeuw, P.J., Leroy, A.M.: Robust Regression and Outlier Detection. Wiley, Chichester (2003)
Serfling, R.: Quantile functions for multivariate analysis: approaches and applications. Statistica Neerlandica 56(2), 214–232 (2002)
Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: New aggregation techniques for sensor networks. In: ACM SenSys. (2004)
Suri, S., Tóth, C.D., Zhou, Y.: Range counting over multidimensional data streams. In: SoCG (2004)
Thaper, N., Indyk, P., Guha, S., Koudas, N.: Dynamic multidimensional histograms. In: ACM SIGMOD, pp. 359–366 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D. (2008). Summarizing Two-Dimensional Data with Skyline-Based Statistical Descriptors. In: Ludäscher, B., Mamoulis, N. (eds) Scientific and Statistical Database Management. SSDBM 2008. Lecture Notes in Computer Science, vol 5069. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69497-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-540-69497-7_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69476-2
Online ISBN: 978-3-540-69497-7
eBook Packages: Computer ScienceComputer Science (R0)