AMS Sketch
Synonyms
Definition
AMS sketches are randomized summaries of the data that can be used to compute aggregates such as the second frequency moment (the selfjoin size) and sizes of joins. AMS sketches can be viewed as random projections of the data in the frequency domain on ± 1 pseudorandom vectors. The key property of AMS sketches is that the product of projections on the same random vector of frequencies of the join attribute of two relations is an unbiased estimate of the size of join of the relations. While a single AMS sketch is inaccurate, multiple such sketches can be computed and combined using averages and medians to obtain an estimate of any desired precision.
Historical Background
The AMS sketches were introduced in 1996 by Noga Alon, Yossi Matias, and Mario Szegedy as part of a suit of randomized algorithms for approximate computation of frequency moments. The same authors, together with Phillip Gibbons, extended the second frequency moment application of AMS sketches to the computation of the size of join of two relations, a more relevant database application. The initial work on AMS sketches fostered a large amount of subsequent work on data streaming algorithms including generalizations and extensions of AMS sketches. Alon, Matias, and Szegedy received the Gödel Prize in 2005 for their work on AMS sketches.
Foundations
While the AMS sketches were initially introduced to compute the second frequency moment, since the reader might be more familiar with database terminology, the problem of estimating the size of join of two relations will be considered here instead. Notice that the size of the self join size of a relation coincides with the second frequency moment of the relation thus the treatment here is slightly more general but not more complicated.
Problem Setup
Observe that the size of join can be written as the dot product of the frequency vectors of the two relations. Expressing the size of the join in terms of frequencies of the join attribute is key for AMS sketch based approximation.
Main Idea
Assume now that the estimate COUNT(F ⋈_{ a } G) needs to be computed but only less than linear space, in terms of the size of the frequency vectors, is available. As it turns out, exact computation is not possible with less space (in an asymptotic sense), but approximate computation is possible. The AMS sketches prove that they allow the approximation of the size of join using sublinear space.

Sketch of F, X_{ F } = fξ^{ T }

Sketch of G, X_{ G } =gξ^{ T }

X = X_{ F }X_{ G } estimates COUNT(F ⋈_{ a }G) since

if E[ξ^{ T } ξ] = I. To ensure this, property distinct elements of ξ must be pairwise independent, i.e., \( \forall i\ne {i}^{\prime }, {\xi}_i^2=1, E\left[\begin{array}{cc}\hfill {\xi}_i\hfill & \hfill {\xi}_{i^{\prime }}\hfill \end{array}\right]=0 \)
as long as the random vector ξ is 4wise independent, i.e., \( \forall {i}_1\ne {i}_2\ne {i}_3\ne {i}_4, E\left[\begin{array}{cc}\hfill {\xi}_{i_1}\hfill & \hfill {\xi}_{i_2}\hfill \end{array}\right]=0, E\left[\begin{array}{cccc}\hfill {\xi}_{i_1}\hfill & \hfill {\xi}_{i_2}\hfill & \hfill {\xi}_{i_3}\hfill & \hfill {\xi}_{i_4}\hfill \end{array}\right]=0 \)
From this example, it can be observed that, to maintain the elementary sketches over the streams F and G, the only operation needed is to increment X_{ F } and X_{ G } by the value of ξ_{t.a} using the function h(⋅) and the seed s where t.a is the value of attribute a of the current tuple t arriving on the data stream. The fact that the elementary sketches can be computed so easily by considering one element at the time in an arbitrary order is what makes the AMS sketches appealing as an approximation technique.
Improving the Basic Schema

Average \( \frac{8\mathrm{Var}\left(\mathrm{X}\right)}{\upepsilon^2{E}^2\left[X\right]} \)independent copies of X to reduce error to ∈

Median of 2 log 1∕δ such averages increases the confidence to 1 − δ
Key Applications
AMS sketches are particularly well suited for computing aggregates when data is either streamed (or a single pass over the data is allowed/desirable) or distributed at multiple sites. Thus, AMS sketches are relevant for processing large amount of data, as is the case in data warehousing, or processing distributed/streaming data, as is the case for computing networking statistics.
Experimental Results
To get an understanding of how the AMS sketches perform in the problem of estimating the self join size of a relation, consider the following setup. The domain of the attribute on which the self join size is computed is set to 16,384. The seize of the relation is fixed at 100,000 tuples. The distribution of the frequencies of join attribute values are generated according to a Zipf distribution with a varying Zipf coefficient. The number of medians is set to 1 (no median computation) and the number of elementary sketches averaged is set to 1,024.
URL to Code
CrossReferences
Recommended Reading
 1.Alon N, Gibbons PB, Matias Y, Szegedy M. Tracking join and selfjoin sizes in limited storage. J Comput Syst Sci. 2002;64(3):719–47.MathSciNetCrossRefzbMATHGoogle Scholar
 2.Alon N., Matias Y., and Szegedy M. The space complexity of approximating the frequency moments. In: Proceeding of 28th Annual ACM Symposium on Theory of Computing; 1996, p. 20–29.Google Scholar
 3.Charikar M., Chen K., and FarachColton M. Finding frequent items in data streams. In: Proceeding of 29th International Colloquium on Automata, Languages and Programming; 2002, p. 693–703.Google Scholar
 4.Cormode G. and Garofalakis M. Sketching streams through the net: distributed approximate query tracking. In: Proceeding of 31st International Conference on Very Large Data Bases; 2005, p. 13–24.Google Scholar
 5.Das A., Gehrke J., and Riedewald M. Approximation techniques for spatial data. In: Proceeding of ACM SIGMOD International Conference on Management of Data; 2004, p. 695–706.Google Scholar
 6.Dobra A., Garofalakis M., Gehrke J., and Rastogi R. Processing complex aggregate queries over data streams. In: Proceeding of ACM SIGMOD International Conference on Management of Data; 2002, p. 61–72.Google Scholar
 7.Rusu F, Dobra A. Pseudorandom number generation for sketchbased estimations. ACM Trans Database Syst. 2007;32(2):11.CrossRefGoogle Scholar
 8.Rusu F. and Dobra A. Statistical analysis of sketch estimators. In: Proceedings ACM SIGMOD International Conference on Management of Data; 2007, p. 187–198.Google Scholar