Sampling-Based Data Mining Algorithms: Modern Techniques and Case Studies

Riondato, Matteo

doi:10.1007/978-3-662-44845-8_48

Matteo Riondato²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8726))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2851 Accesses
2 Citations

Abstract

Sampling a dataset for faster analysis and looking at it as a sample from an unknown distribution are two faces of the same coin. We discuss the use of modern techniques involving the Vapnik-Chervonenkis (VC) dimension to study the trade-off between sample size and accuracy of data mining results that can be obtained from a sample. We report two case studies where we and collaborators employed these techniques to develop efficient sampling-based algorithms for the problems of betweenness centrality computation in large graphs and extracting statistically significant Frequent Itemsets from transactional datasets.

Download to read the full chapter text

Chapter PDF

Empirical characterization of graph sampling algorithms

Article 08 April 2023

Graph sampling

Article Open access 03 October 2017

Efficient Computation of the Weighted Clustering Coefficient

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: Queries with bounded errors and bounded response times on very large data. In: EuroSys 2012 (2012)
Google Scholar
Boucheron, S., Bosquet, O., Lugosi, G.: Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics 9, 323–375 (2005)
Article MATH MathSciNet Google Scholar
Dubhashi, D.P., Panconesi, A.: Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press (2009)
Google Scholar
Har-Peled, S., Sharir, M.: Relative (p,ε)-approximations in geometry. Discr. & Computat. Geom. 45(3), 462–496 (2011)
Article MATH MathSciNet Google Scholar
Mitzenmacher, M., Upfal, E.: Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press (2005)
Google Scholar
Riondato, M., Akdere, M., Çetintemel, U., Zdonik, S.B., Upfal, E.: The VC-dimension of SQL queries and selectivity estimation through sampling. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds.) ECML PKDD 2011, Part II. LNCS, vol. 6912, pp. 661–676. Springer, Heidelberg (2011)
Chapter Google Scholar
Riondato, M., DeBrabant, J.A., Fonseca, R., Upfal, E.: PARMA: A parallel randomized algorithm for association rules mining in MapReduce. In: CIKM 2012 (2012)
Google Scholar
Riondato, M., Kornaropoulos, E.M.: Fast approximation of betweenness centrality through sampling. In: WSDM 2014 (2014)
Google Scholar
Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Trans. Knowl. Disc. from Data (in press)
Google Scholar
Riondato, M., Vandin, F.: Finding the true frequent itemsets. In: SDM 2014 (2014)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1999)
Google Scholar
Vapnik, V.N., Chervonenkis, A.J.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Prob. and its Appl. 16(2), 264–280 (1971)
Article MATH MathSciNet Google Scholar
Wang, J., Krishnan, S., Franklin, M.J., Goldberg, K., Kraska, T., Milo, T.: A sample-and-clean framework for fast and accurate query processing on dirty data. In: SIGMOD 2014 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Brown University, Providence, RI, 02912, USA
Matteo Riondato

Authors

Matteo Riondato
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Applied Sciences, Department of Computer and Decision Engineering, Université Libre de Bruxelles, Av. F. Roosevelt, CP 165/15, 1050, Brussels, Belgium
Toon Calders
Dipartimento di Informatica, Università degli Studi “Aldo Moro”, via Orabona 4, 70125, Bari, Italy
Floriana Esposito
Department of Computer Science, Universität Paderborn, Warburger Str. 100, 33098, Paderborn, Germany
Eyke Hüllermeier
Dipartimento di Informatica,, Università degli Studi di Torino, Corso Svizzera 185, 10149, Torino, Italy
Rosa Meo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Riondato, M. (2014). Sampling-Based Data Mining Algorithms: Modern Techniques and Case Studies. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science(), vol 8726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44845-8_48

Download citation

DOI: https://doi.org/10.1007/978-3-662-44845-8_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44844-1
Online ISBN: 978-3-662-44845-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sampling-Based Data Mining Algorithms: Modern Techniques and Case Studies

Abstract

Chapter PDF

Similar content being viewed by others

Empirical characterization of graph sampling algorithms

Graph sampling

Efficient Computation of the Weighted Clustering Coefficient

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Sampling-Based Data Mining Algorithms: Modern Techniques and Case Studies

Abstract

Chapter PDF

Similar content being viewed by others

Empirical characterization of graph sampling algorithms

Graph sampling

Efficient Computation of the Weighted Clustering Coefficient

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation