Encyclopedia of Database Systems

Living Edition
| Editors: Ling Liu, M. Tamer Özsu

AMS Sketch

  • Alin DobraEmail author
Living reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7993-3_16-2

Synonyms

Definition

AMS sketches are randomized summaries of the data that can be used to compute aggregates such as the second frequency moment (the self-join size) and sizes of joins. AMS sketches can be viewed as random projections of the data in the frequency domain on ± 1 pseudo-random vectors. The key property of AMS sketches is that the product of projections on the same random vector of frequencies of the join attribute of two relations is an unbiased estimate of the size of join of the relations. While a single AMS sketch is inaccurate, multiple such sketches can be computed and combined using averages and medians to obtain an estimate of any desired precision.

Historical Background

The AMS sketches were introduced in 1996 by Noga Alon, Yossi Matias, and Mario Szegedy as part of a suit of randomized algorithms for approximate computation of frequency moments. The same authors, together with Phillip Gibbons, extended the second frequency moment application of AMS sketches to the computation of the size of join of two relations, a more relevant database application. The initial work on AMS sketches fostered a large amount of subsequent work on data streaming algorithms including generalizations and extensions of AMS sketches. Alon, Matias, and Szegedy received the Gödel Prize in 2005 for their work on AMS sketches.

Foundations

While the AMS sketches were initially introduced to compute the second frequency moment, since the reader might be more familiar with database terminology, the problem of estimating the size of join of two relations will be considered here instead. Notice that the size of the self join size of a relation coincides with the second frequency moment of the relation thus the treatment here is slightly more general but not more complicated.

Problem Setup

To set up the problem, assume access is provided to two relations F and G each with a single attribute a. Since it is convenient, denote with f i and g i the frequency of value i of attribute a in relation F and G, respectively. Assume that elements of F and G are streamed the result of the query: COUNT(F a G) needs to be computed or estimated. Consider the following example:
Elements of F and G are assumed to arrive one by one (i.e., are streamed). If the frequency vectors f and g can be maintained, than the result of the query COUNT(F a G) can be computed clearly in the following example:
$$ \begin{aligned}\mathrm{COUNT}\left(F{\bowtie}_{\alpha }G\right)&=\mathbf{f}{\mathbf{g}}^T\\ &=\left[3 1 2\right]{\left[\begin{array}{ccc}\hfill 3\hfill & \hfill 0\hfill & \hfill 2\hfill \end{array}\right]}^T\\ &=3\cdotp 3+1\cdotp 0+2\cdotp 2\\ &=13\end{aligned} $$

Observe that the size of join can be written as the dot product of the frequency vectors of the two relations. Expressing the size of the join in terms of frequencies of the join attribute is key for AMS sketch based approximation.

Main Idea

Assume now that the estimate COUNT(F a G) needs to be computed but only less than linear space, in terms of the size of the frequency vectors, is available. As it turns out, exact computation is not possible with less space (in an asymptotic sense), but approximate computation is possible. The AMS sketches prove that they allow the approximation of the size of join using sub-linear space.

The main idea behind AMS sketches is to summarize the entire frequency table by projecting it on a random vector. The value thus obtained will be referred to as an elementary sketch. Then, use the two elementary sketches, one for each relation, to recover approximately the result of the query. Interestingly, a random vector ξ = [ξ1ξ n ]of ± 1 values suffices to obtain projections with the desired properties. For simplicity, random vectors for which ∀i, E i ] = 0 are preferred. With this:
  • Sketch of F, X F = fξ T

  • Sketch of G, X G =gξ T

  • X = X F X G estimates COUNT(F a G) since

E[X] = E[fξ T ξg T ] = fE[ξ T ξ]g T = fIg T = fg T
  • if E T ξ] = I. To ensure this, property distinct elements of ξ must be pair-wise independent, i.e., \( \forall i\ne {i}^{\prime }, {\xi}_i^2=1, E\left[\begin{array}{cc}\hfill {\xi}_i\hfill & \hfill {\xi}_{i^{\prime }}\hfill \end{array}\right]=0 \)

For the particular random vector \( \xi {=}[{\xi}_1 \; {\xi}_2 \; {\xi}_3] =\left[\begin{array}{ccc}\hfill -1\hfill & \hfill +1\hfill & \hfill -1\hfill \end{array}\right] \), the value of the elementary sketches and the overall estimate will be:
$$ \begin{array}{l}{X}_F=\mathbf{f}{\xi}^T=-4\\ {}{X}_G=\mathbf{g}{\xi}^T=-5\\ {}X={X}_F{X}_G=\left(-4\right)\left(-5\right)=20\approx 13\end{array} $$
The error of the estimate X is due to its variance that can be shown to have the property:
$$ \mathrm{Var}\left(\mathrm{X}\right)\le 2\mathbf{f}{\mathbf{f}}^T\mathbf{g}{\mathbf{g}}^T=2\;\mathrm{SJ}\left(\mathrm{F}\right)\;\mathrm{SJ}\left(\mathrm{G}\right) $$

as long as the random vector ξ is 4-wise independent, i.e., \( \forall {i}_1\ne {i}_2\ne {i}_3\ne {i}_4, E\left[\begin{array}{cc}\hfill {\xi}_{i_1}\hfill & \hfill {\xi}_{i_2}\hfill \end{array}\right]=0, E\left[\begin{array}{cccc}\hfill {\xi}_{i_1}\hfill & \hfill {\xi}_{i_2}\hfill & \hfill {\xi}_{i_3}\hfill & \hfill {\xi}_{i_4}\hfill \end{array}\right]=0 \)

Since a higher degree of independence would not make the sketch more precise, 4-wise independence suffices. This is important since 4-wise independent ± 1 random vectors can be generated on the fly by combining a small seed s and the index of the entry using ξ i (s) = h(s,i) with h a special hash function that guarantees the 4-wise independence of the components of ξ. The fact that elements of ξ can be generated on the fly is important since space can be saved and, more importantly, because sketches X F and X G can be computed using constant storage. This is how this can be accomplished using the previous example:
$$ \begin{array}{ll}{X}_F&=\mathbf{f}{\xi}^T={\displaystyle \sum_i{f}_i}{\xi}_i\\ {}\operatorname{}&={\displaystyle \sum_{t\in F}{\xi}_{t.a}}={\xi}_1+{\xi}_1+{\xi}_2+{\xi}_3+{\xi}_1+{\xi}_3\\ {}\operatorname{}&=h\left(s,1\right)+h\left(s,1\right)+h\left(s,2\right)\\ {}&\quad+h\left(s,3\right)+h\left(s,1\right)+h\left(s,3\right)\end{array} $$
$$ \begin{array}{ll}{X}_G&=\mathbf{g}{\xi}^T={\displaystyle \sum_i{g}_i}{\xi}_i\\ {}\operatorname{}&={\displaystyle \sum_{t\in G}{\xi}_{t.a}}={\xi}_3+{\xi}_1+{\xi}_3+{\xi}_1+{\xi}_1\\ {}\operatorname{}&=h\left(s,3\right)+h\left(s,1\right)+h\left(s,3\right)\\ {}&\quad+h\left(s,1\right)+h\left(s,1\right)\end{array} $$

From this example, it can be observed that, to maintain the elementary sketches over the streams F and G, the only operation needed is to increment X F and X G by the value of ξt.a using the function h(⋅) and the seed s where t.a is the value of attribute a of the current tuple t arriving on the data stream. The fact that the elementary sketches can be computed so easily by considering one element at the time in an arbitrary order is what makes the AMS sketches appealing as an approximation technique.

Improving the Basic Schema

Since the streams F and G are summarized by a single number X F and X G , respectively, it is not expected that the estimate will be very precise (this is suggested as well by the above example). In order to improve the accuracy of X, a standard technique in randomized algorithms can be used (i.e., generating multiple independent copies of random variable X). Copies of X are averaged in order to decrease the variance (thus the error). The median of such averaged values of X is used to estimate COUNT(F a G) since medians improve confidence. Multiple copies of X can be obtained using multiple seeds as depicted in Fig. 1.
Fig. 1

Combining elementary sketches to estimate COUNT(F a G) with relative error at most ∈ with probability at least 1 − δ

It can be shown that:
  • Average \( \frac{8\mathrm{Var}\left(\mathrm{X}\right)}{\upepsilon^2{E}^2\left[X\right]} \)independent copies of X to reduce error to ∈

  • Median of 2 log 1∕δ such averages increases the confidence to 1 − δ

Key Applications

AMS sketches are particularly well suited for computing aggregates when data is either streamed (or a single pass over the data is allowed/desirable) or distributed at multiple sites. Thus, AMS sketches are relevant for processing large amount of data, as is the case in data warehousing, or processing distributed/streaming data, as is the case for computing networking statistics.

Experimental Results

To get an understanding of how the AMS sketches perform in the problem of estimating the self join size of a relation, consider the following setup. The domain of the attribute on which the self join size is computed is set to 16,384. The seize of the relation is fixed at 100,000 tuples. The distribution of the frequencies of join attribute values are generated according to a Zipf distribution with a varying Zipf coefficient. The number of medians is set to 1 (no median computation) and the number of elementary sketches averaged is set to 1,024.

The relative error, both theoretical and empirical, of AMS sketches is depicted in Fig. 2. The following observations confirm the intuition based on theory for the behavior of AMS sketches: (i) on the self join size problem the error is acceptable for sketches of size in the order of 2,000 words, (ii) the error decreases somewhat as the skew increases, and (iii) the theoretical prediction follows entirely the empirical behavior.
Fig. 2

Relative error of AMS sketches as a function of Zipf coefficient

URL to Code

Cross-References

Recommended Reading

  1. 1.
    Alon N, Gibbons PB, Matias Y, Szegedy M. Tracking join and self-join sizes in limited storage. J Comput Syst Sci. 2002;64(3):719–47.MathSciNetCrossRefzbMATHGoogle Scholar
  2. 2.
    Alon N., Matias Y., and Szegedy M. The space complexity of approximating the frequency moments. In: Proceeding of 28th Annual ACM Symposium on Theory of Computing; 1996, p. 20–29.Google Scholar
  3. 3.
    Charikar M., Chen K., and Farach-Colton M. Finding frequent items in data streams. In: Proceeding of 29th International Colloquium on Automata, Languages and Programming; 2002, p. 693–703.Google Scholar
  4. 4.
    Cormode G. and Garofalakis M. Sketching streams through the net: distributed approximate query tracking. In: Proceeding of 31st International Conference on Very Large Data Bases; 2005, p. 13–24.Google Scholar
  5. 5.
    Das A., Gehrke J., and Riedewald M. Approximation techniques for spatial data. In: Proceeding of ACM SIGMOD International Conference on Management of Data; 2004, p. 695–706.Google Scholar
  6. 6.
    Dobra A., Garofalakis M., Gehrke J., and Rastogi R. Processing complex aggregate queries over data streams. In: Proceeding of ACM SIGMOD International Conference on Management of Data; 2002, p. 61–72.Google Scholar
  7. 7.
    Rusu F, Dobra A. Pseudo-random number generation for sketch-based estimations. ACM Trans Database Syst. 2007;32(2):11.CrossRefGoogle Scholar
  8. 8.
    Rusu F. and Dobra A. Statistical analysis of sketch estimators. In: Proceedings ACM SIGMOD International Conference on Management of Data; 2007, p. 187–198.Google Scholar

Copyright information

© Springer Science+Business Media LLC 2016

Authors and Affiliations

  1. 1.University of FloridaGainesvilleUSA

Section editors and affiliations

  • Divesh Srivastava
    • 1
  1. 1.AT&T Labs - ResearchAT&TBedminsterUSA