Advertisement

A Two-Sample Kolmogorov-Smirnov-Like Test for Big Data

  • Hien D. Nguyen
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 845)

Abstract

Exploratory data analysis (EDA) is an important component of modern data analysis and data mining. The Big Data setting has made many traditional and useful EDA tools impractical and ineffective. Among such useful tools is the two-sample Kolmogorov-Smirnov (TS-KS) goodness-of-fit (GoF) test for assessing whether or not two samples arose from the same population. A TS-KS like testing procedure is constructed using chunked and averaged (CA) estimation paradigm. The procedure is named the TS-CAKS GoF test. Distributed and streamed implementations of the TS-CAKS procedure are discussed. The consistency of the TS-CAKS test is proved. A numerical study is provided to demonstrate the effectiveness and computational efficiency of the procedure.

Keywords

Big Data Chunked-and-average estimator Hypothesis testing Kolmogorov-Smirnov test 

Notes

Acknowledgements

The author is personally supported by Australian Research Council grant number DE170101134.

References

  1. Buoncristiano, M., Mecca, G., Quitarelli, E., Roveri, M., Santoro, D., Tanca, L.: Database challenges for exploratory computing. ACM SIGMOD Rec. 44, 17–22 (2015)CrossRefGoogle Scholar
  2. DasGupta, A.: Asymptotic Theory of Statistics and Probability. Springer, New York (2008).  https://doi.org/10.1007/978-0-387-75971-5CrossRefzbMATHGoogle Scholar
  3. Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning. Wiley, New York (2003)CrossRefGoogle Scholar
  4. dos Reis, D., Flach, P., Matwin, S., Batista, G.: Fast unsupervised online drift detection using incremental Kolmogorov-Smirnov test. In: ACM SIGKDD International Conference on Knowledge Disocvery and Data Mining XXII. ACM (2016)Google Scholar
  5. Kifer, D., Ben-David, S., Gehrke, J.: Detecting change in data streams. In: Proceedings of the 30th VLDB Conference (2004)CrossRefGoogle Scholar
  6. Kim, P.J.: On the exact and approximate sampling distribution of the two sample Kolmogorov-Smirnov criterion \(D_{mn}\), \(m \le n\). J. Am. Stat. Assoc. 64, 1625–1637 (1969)Google Scholar
  7. Kim, P.J., Jennrich, R.I.: Selected tables in mathematical statistics 1, chapter tables of the exact sampling distribution of the two-sample Kolmogorov-Smirnov criterion \(D_{mn}\), \(m \le n\), pp. 80–129. Institute of Mathematical Statistics (1973)Google Scholar
  8. Lall, A.: Data streaming algorithm for the Kolmogorov-Smirnov test. In: Proceedings of the IEEE International Conference on Big Data, pp. 95–104 (2015)Google Scholar
  9. Li, R., Lin, D.K.J., Li, B.: Statistical inference in massive data sets. Appl. Stoch. Models Bus. Ind. 29, 399–409 (2013)MathSciNetGoogle Scholar
  10. Matloff, N.: Software alchemy: turning complex statistical computations into embarrassingly-parallel ones. J. Stat. Softw. 71, 1–15 (2016)CrossRefGoogle Scholar
  11. Mecca, G.: Database exploration: problems and opportunities. In: IEEE 32rd International Conference on Data Engineering Workshop, pp. 153–156 (2016)Google Scholar
  12. Myatt, G.J., Johnson, W.P.: Making Sense of Data I: A Practical Guide to Exploratory Data Analysis and Data Mining. Wiley, New York (2014)CrossRefGoogle Scholar
  13. Nguyen, H.D.: A simple online parameter estimation technique with asymptotic guarantees. arXiv:1703.07039 (2017a)
  14. Nguyen, H.D.: A stream-suitable Kolmogorov-Smirnov-type test for Big Data analysis. arXiv:1704.03721 (2017b)
  15. Nguyen, H.D., McLachlan, G.J.: Chunked-and-averaged estimators for vector parameters. arXiv:1612.06492 (2017)
  16. R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing (2016)Google Scholar
  17. Smirnov, N.V.: Estimating the deviation between the empirical distribution functions of two independent samples. Bulletin de l’Universite de Moscou, 2, 3–16 (1939)Google Scholar
  18. Tukey, J.W.: The future of data analysis. Ann. Math. Stat. 33, 1–67 (1962)MathSciNetCrossRefGoogle Scholar
  19. Tukey, J.W.: Exploratory Data Analysis. Addison-Wesley, Reading (1977)zbMATHGoogle Scholar
  20. Wang, J., Tsang, W.W., Marsaglia, G.: Evaluating Kolmogorov’s distribution. J. Stat. Softw. 8, 18 (2003)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Department of Mathematics and StatisticsLa Trobe UniversityBundooraAustralia

Personalised recommendations