Skip to main content

Stream Similarity Mining

  • Reference work entry
  • First Online:
Encyclopedia of Database Systems
  • 12 Accesses

Synonyms

Distance between streams; Datastream distance

Definition

In many applications, it is useful to think of a datastream as representing a vector or a point in space. Given two datastreams, along with a distance or similarity measure, the distance (or similarity) between the two streams is simply the distance (respectively, similarity) between the two points that the datastreams represent. Due to the enormous amount of data being processed, datastream algorithms are allowed just a single, sequential pass over the data; in some settings, the algorithm may take a few passes. The algorithm itself must use very little memory, typically polylogarithmic in the amount of data, but is allowed to return approximate answers.

There are two frequently used datastream models. In the time series model, a vector, \( \overrightarrow{x} \), is simply represented as data items arriving in order of their indices: x1 , x2 , x3 , …. That is, the value of the ith item of the stream is precisely the...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

  1. Alon N, Gibbons P, Matias Y, Szegedy M. Tracking join and self-join sizes in limited storage. In: Proceedings of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems; 1999. p. 10–20.

    Google Scholar 

  2. Alon N, Matias Y, Szegedy M. The space complexity of approximating the frequency moments. In: Proceedings of the 28th ACM Symposium on Theory of Computing; 1996. p. 20–9.

    Google Scholar 

  3. Broder A, Charikar M, Frieze A, Mitzenmacher M. Min-wise independent permutations. In: Proceedings of the 30th ACM Symposium on Theory of Computing; 1998. p. 327–36.

    Google Scholar 

  4. Chambers JM, Mallows CL, Stuck BW. A method for simulating stable random variables. J Am Stat Assoc. 1976;71(354):340–4.

    Article  MathSciNet  MATH  Google Scholar 

  5. Cohen E. Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci. 1997;55(3):441–53.

    Article  MathSciNet  MATH  Google Scholar 

  6. Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, Motwani R, Ullman J. Finding interesting associations without support pruning. In: Proceedings of the 16th International Conference on Data Engineering; 2000.

    Google Scholar 

  7. Cormode G, Datar M, Indyk P, Muthukrishnan S. Comparing data streams using hamming norms. In: Proceedings of the 28th International Conference on Very Large Data Bases; 2002. p. 335–45.

    Google Scholar 

  8. Datar M, Gionis A, Indyk P, Motwani R. Maintaining stream statistics over sliding windows. In: Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms; 2002. p. 635–44.

    Article  MathSciNet  MATH  Google Scholar 

  9. Datar M, Muthukrishnan S. Estimating rarity and similarity on data stream windows. In: Proceedings of the 10th European Symposium on Algorithms; 2002.

    Google Scholar 

  10. Feigenbaum J, Kannan S, Strauss M, Viswanathan M. An approximate l1-difference algorithm for massive data streams. In: Proceedings of the 40th Annual Symposium on Foundations of Computer Science; 1999.

    Google Scholar 

  11. Flajolet P, Martin G. Probabilistic counting. In: Proceedings of the 24th Annual Symposium on Foundations of Computer Science; 1983. p. 76–82.

    Google Scholar 

  12. Indyk P. Stable distributions, pseudorandom generators, embeddings and data stream computation. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science; 2000. p. 189–97.

    Google Scholar 

  13. Indyk P. A small approximately min-wise independent family of hash functions. J Algorithm. 2001;38(1):84–90.

    Article  MathSciNet  MATH  Google Scholar 

  14. On the distributional complexity of disjointness. J Comput Sci Syst. 1984;2.

    Google Scholar 

  15. Saks M, Sun X. The space complexity of approximating the frequency moments. In: Proceedings of the 34th ACM Symposium on Theory of Computing; 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Erik Vee .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Vee, E. (2018). Stream Similarity Mining. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_373

Download citation

Publish with us

Policies and ethics