Stream Similarity Mining

Vee, Erik

doi:10.1007/978-1-4614-8265-9_373

Erik Vee³

12 Accesses

Synonyms

Distance between streams; Datastream distance

Definition

In many applications, it is useful to think of a datastream as representing a vector or a point in space. Given two datastreams, along with a distance or similarity measure, the distance (or similarity) between the two streams is simply the distance (respectively, similarity) between the two points that the datastreams represent. Due to the enormous amount of data being processed, datastream algorithms are allowed just a single, sequential pass over the data; in some settings, the algorithm may take a few passes. The algorithm itself must use very little memory, typically polylogarithmic in the amount of data, but is allowed to return approximate answers.

There are two frequently used datastream models. In the time series model, a vector, \( \overrightarrow{x} \), is simply represented as data items arriving in order of their indices: x₁ , x₂ , x₃ , …. That is, the value of the ith item of the stream is precisely the...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 4,499.99; Price excludes VAT (USA)

Hardcover Book: USD 6,499.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Alon N, Gibbons P, Matias Y, Szegedy M. Tracking join and self-join sizes in limited storage. In: Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 1999. p. 10–20.
Google Scholar
Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. In: Proceedings of the 28th ACM Symposium on Theory of Computing; 1996. p. 20–9.
Google Scholar
Broder A, Charikar M, Frieze A, Mitzenmacher M. Min-wise independent permutations. In: Proceedings of the 30th ACM Symposium on Theory of Computing; 1998. p. 327–36.
Google Scholar
Chambers JM, Mallows CL, Stuck BW. A method for simulating stable random variables. J Am Stat Assoc. 1976;71(354):340–4.
Article MathSciNet MATH Google Scholar
Cohen E. Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci. 1997;55(3):441–53.
Article MathSciNet MATH Google Scholar
Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman J. Finding interesting associations without support pruning. In: Proceedings of the 16th International Conference on Data Engineering; 2000.
Google Scholar
Cormode G, Datar M, Indyk P, Muthukrishnan S. Comparing data streams using hamming norms. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002. p. 335–45.
Google Scholar
Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms; 2002. p. 635–44.
Article MathSciNet MATH Google Scholar
Datar M, Muthukrishnan S. Estimating rarity and similarity on data stream windows. In: Proceedings of the 10th European Symposium on Algorithms; 2002.
Google Scholar
Feigenbaum J, Kannan S, Strauss M, Viswanathan M. An approximate l₁-difference algorithm for massive data streams. In: Proceedings of the 40th Annual Symposium on Foundations of Computer Science; 1999.
Google Scholar
Flajolet P, Martin G. Probabilistic counting. In: Proceedings of the 24th Annual Symposium on Foundations of Computer Science; 1983. p. 76–82.
Google Scholar
Indyk P. Stable distributions, pseudorandom generators, embeddings and data stream computation. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science; 2000. p. 189–97.
Google Scholar
Indyk P. A small approximately min-wise independent family of hash functions. J Algorithm. 2001;38(1):84–90.
Article MathSciNet MATH Google Scholar
On the distributional complexity of disjointness. J Comput Sci Syst. 1984;2.
Google Scholar
Saks M, Sun X. The space complexity of approximating the frequency moments. In: Proceedings of the 34th ACM Symposium on Theory of Computing; 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

Yahoo! Research, Silicon Valley, CA, USA
Erik Vee

Authors

Erik Vee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Erik Vee .

Editor information

Editors and Affiliations

Georgia Institute of Technology College of Computing, Atlanta, GA, USA
Ling Liu
University of Waterloo School of Computer Science, Waterloo, ON, Canada
M. Tamer Özsu

Section Editor information

AT&T Labs - Research, AT&T, Bedminster, NJ, USA
Divesh Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Vee, E. (2018). Stream Similarity Mining. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_373

Download citation

DOI: https://doi.org/10.1007/978-1-4614-8265-9_373
Published: 07 December 2018
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-8266-6
Online ISBN: 978-1-4614-8265-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics