Three Big Data Tools for a Data Scientist’s Toolbox

Calders, Toon

doi:10.1007/978-3-319-96655-7_5

Toon Calders ORCID: orcid.org/0000-0002-4943-6978^7,8

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 324))

Included in the following conference series:

European Business Intelligence and Big Data Summer School

1177 Accesses

Abstract

Sometimes data is generated unboundedly and at such a fast pace that it is no longer possible to store the complete data in a database. The development of techniques for handling and processing such streams of data is very challenging as the streaming context imposes severe constraints on the computation: we are often not able to store the whole data stream and making multiple passes over the data is no longer possible. As the stream is never finished we need to be able to continuously provide, upon request, up-to-date answers to analysis queries. Even problems that are highly trivial in an off-line context, such as: “How many different items are there in my database?” become very hard in a streaming context. Nevertheless, in the past decades several clever algorithms were developed to deal with streaming data. This paper covers several of these indispensable tools that should be present in every big data scientists’ toolbox, including approximate frequency counting of frequent items, cardinality estimation of very large sets, and fast nearest neighbor search in huge data collections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Apache flink. https://flink.apache.org/
Apache hadoop. http://hadoop.apache.org
Apache spark. https://spark.apache.org/
Aggarwal, C.C.: Data Streams. ADBS, vol. 31. Springer, Boston (2007). https://doi.org/10.1007/978-0-387-47534-9
Book MATH Google Scholar
Boldi, P., Rosa, M., Vigna, S.: HyperANF: approximating the neighbourhood function of very large graphs on a budget. In: Proceedings of the 20th International Conference on World Wide Web, pp. 625–634. ACM (2011)
Google Scholar
Flajolet, P., Fusy, É., Gandouet, O., Meunier, F.: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: AofA: Analysis of Algorithms. Discrete Mathematics and Theoretical Computer Science, pp. 137–156 (2007)
Google Scholar
Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182–209 (1985)
Article MathSciNet Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, pp. 604–613. ACM (1998)
Google Scholar
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. (TODS) 28(1), 51–55 (2003)
Article Google Scholar
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)
Book Google Scholar
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the 28th International Conference on Very Large Data Bases, pp. 346–357. VLDB Endowment (2002)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Université Libre de Bruxelles, Brussels, Belgium
Toon Calders
Universiteit Antwerpen, Antwerp, Belgium
Toon Calders

Authors

Toon Calders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Toon Calders .

Editor information

Editors and Affiliations

Université Libre de Bruxelles, Brussels, Belgium
Esteban Zimányi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Calders, T. (2018). Three Big Data Tools for a Data Scientist’s Toolbox. In: Zimányi, E. (eds) Business Intelligence and Big Data. eBISS 2017. Lecture Notes in Business Information Processing, vol 324. Springer, Cham. https://doi.org/10.1007/978-3-319-96655-7_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-96655-7_5
Published: 15 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-96654-0
Online ISBN: 978-3-319-96655-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics