Advertisement

The minimum ratio of preserving the dataset similarity in resampling: (1 − 1/e)

  • Faruk BulutEmail author
Original Research
  • 3 Downloads

Abstract

Pattern recognition, data mining and machine learning disciplines always work with a predefined dataset to create a hypothesis for an artificial decision support system. A dataset might occasionally be damaged due to various reasons. They might be subdivided for cross-validation to test an expert system performance. Some samples in the dataset might be deleted since they lose their importance. In addition, some noisy and outlier data need to be removed since it defects the general layout. In such similar cases, it is important to note how many percentages of the samples in a set should remain original in order to both avoid corruption and keep the overall originality. The ratio of missed, deleted, and removed samples in a dataset is a crucial issue for maintaining the whole integrity. In this study, a theoretical approach has been proposed about that the integrity and originality of a dataset can be preserved with a certain ratio of non-selection probability. It is approximately 63.21%, derived from the equation (1 − 1/e), which is the minimum ratio for the remaining original samples. e is the natural logarithm base. In other words, (1/e) % amount of the data at most might be removed from the set for the preservation of the originality. The rest data points in the set will be used for resampling. A variety of parametric and nonparametric criterions and tests in statistics such as Kolmogorov–Smirnov, t-tests, Kruskal–Wallis ANOVA, and Ansari–Bradley has been used in the proofing process of the proposed theory. In the experiments, a synthetic dataset has been damaged many times and compared with its original form in order to observe whether the originality and homogeneity changed or not. Experiments indicate that the ratio of (1 − 1/e) is the fundamental lower bound ratio and limit for the authenticity and actuality of a dataset.

Keywords

Resampling Dataset Bagging method Data science 

Supplementary material

41870_2019_316_MOESM1_ESM.docx (36 kb)
Supplementary material 1 (DOCX 36 kb)
41870_2019_316_MOESM2_ESM.xlsx (33 kb)
Supplementary material 2 (XLSX 32 kb)

References

  1. 1.
    Bulut F (2016) Determining heart attack risk ration through adaboost. Celal Bayar Univ J Sci 12(3):459–472.  https://doi.org/10.18466/cbayarfbe.280652 Google Scholar
  2. 2.
    Bulut F (2015) Doctorate dissertation, construction and performance analysis of locally adaptive base and ensemble learners. Yildiz Technical University, Computer Engineering Department, IstanbulGoogle Scholar
  3. 3.
    Bache K, Lichman M (2019) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 15 Oct 2019
  4. 4.
    Lameiro C, Schreier PJ (2016) Cross-validation techniques for determining the number of correlated components between two datasets when the number of samples is very small. In: Signals, systems and computers, 2016 50th asilomar conference on, IEEE, pp 601–605Google Scholar
  5. 5.
    Bulut F (2016) Heart attack risk detection using Bagging classifier. In: Signal processing and communication application conference (SIU), 2016 24th. IEEE, pp 2013–2016Google Scholar
  6. 6.
    Bulut F, Amasyali MF (2015) Locally adaptive k parameter selection for nearest neighbor classifier: one nearest cluster. Pattern Anal Appl, 1–11Google Scholar
  7. 7.
    Good, Philip. I. (2006) Resampling methods, a practical guide to data analysis. Springer, Birkhũser BostonGoogle Scholar
  8. 8.
    Albert M, Bouret Y, Fromont M, Reynaud-Bouret P (2015) Bootstrap and permutation tests of independence for point processes. Ann Stat 43(6):2537–2564MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Gitsakis N, Tzortzios S (2003) Resampling permutation exact tests using restat for exel. EFITA conferance, HungaryGoogle Scholar
  10. 10.
    Bai H et al (2016) Application of a new resampling method to SEM: a comparison of S-SMART with the bootstrap. Int J Res Method Educ 392:194–207CrossRefGoogle Scholar
  11. 11.
    Baspınar E (2001) Type I error and test power obtained by the application of Student t, Welch and sorted t-tests on two samples of different sample widths from normal populations with varying variance ratios. Tarım Bilimleri Dergisi 7(1):151–157CrossRefGoogle Scholar
  12. 12.
    Shao J, Tu D (2012) The jackknife and bootstrap. Springer Science & Business Media, New YorkzbMATHGoogle Scholar
  13. 13.
    Weinberg JM, Lagakos SW (2000) Asymptotic behavior of linear permutation tests under general alternatives, with application to test selection and study design. J Am Stat Assoc 95(450):596–607MathSciNetCrossRefzbMATHGoogle Scholar
  14. 14.
    Good P (2013) Permutation tests: a practical guide to resampling methods for testing hypotheses. Springer Science & Business MediaGoogle Scholar
  15. 15.
    Lyons MB, Keith DA, Phinn SR, Mason TJ, Elith J (2018) A comparison of resampling methods for remote sensing classification and accuracy assessment. Remote Sens Environ 208:145–153CrossRefGoogle Scholar
  16. 16.
    Sun L, Song J, Hua C, Shen C, Song M (2018) Value-aware resampling and loss for imbalanced classification. In: Proceedings of the 2nd international conference on computer science and application engineering. ACM, pp 21Google Scholar
  17. 17.
    Afendras G, Markatou M (2019) Optimality of training/test size and resampling effectiveness in cross-validation. J Stat Planning Inference 199:286–301MathSciNetCrossRefGoogle Scholar
  18. 18.
    Gimenez-Nadal JI, Lafuente M, Molina, JA, Velilla, J (2016) Resampling and bootstrap to assess the relevance of variables: a new algorithmic approach with applications to entrepreneurship data. IZA discussion papers 9938, Institute of Labor Economics (IZA). http://ftp.iza.org/dp9938.pdf
  19. 19.
    Ren S, Zhu W, Liao B, Li Z, Wang P, Li K, Li Z (2019) Selection-based resampling ensemble algorithm for nonstationary imbalanced stream data learning. Knowl Based Syst 163:705–722CrossRefGoogle Scholar
  20. 20.
    Rohatgi VK, Saleh AME (2015) An introduction to probability and statistics. WileyGoogle Scholar
  21. 21.
    Struik DJ (2014) A source book in mathematics, 1200–1800. Princeton University Press, PrincetonCrossRefzbMATHGoogle Scholar
  22. 22.
    Cole TJ, Altman DG (2017) Statistics notes: percentage differences, symmetry, and natural logarithms. BMJ 358:j3683CrossRefGoogle Scholar
  23. 23.
    Mendenhall WM, Sincich TL, Boudreau NS (2016) Statistics for engineering and the sciences. Chapman and Hall/CRC, Boca RatonCrossRefzbMATHGoogle Scholar
  24. 24.
    Brillouin Leon (2013) Science and information theory, 2nd edn. Courier Corporation, New YorkzbMATHGoogle Scholar
  25. 25.
    Zacks S (2014) Parametric statistical inference: basic theory and modern approaches, vol 4. Elsevier, PhiladelphiazbMATHGoogle Scholar
  26. 26.
    Brodsky E, Darkhovsky BS (2013) Non-parametric statistical diagnosis: problems and methods, vol 509. Springer Science & Business MediaGoogle Scholar
  27. 27.
    Hecke TV (2012) Power study of anova versus Kruskal–Wallis test. J Stat Manage Syst 15(2–3):241–247CrossRefGoogle Scholar
  28. 28.
    Agin MA, Godbole AP (1992) A new exact runs test for randomness. In: Page C, LePage R (eds) Computing science and statistics. Springer, New York, pp 281–285CrossRefGoogle Scholar
  29. 29.
    Zhao D, Bu L, Alippi C, Wei Q (2017) A Kolmogorov–Smirnov test to detect changes in stationarity in big data. IFAC-PapersOnLine 50(1):14260–14265CrossRefGoogle Scholar
  30. 30.
    Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G (2009) Bayesian t tests for accepting and rejecting the null hypothesis. Psychon Bull Rev 16(2):225–237CrossRefGoogle Scholar
  31. 31.
    Marozzi M, Reiczigel J (2018) A progressive shift alternative to evaluate nonparametric tests for skewed data. Commun Stat Simul Computation. 47(10):3083–3094.  https://doi.org/10.1080/03610918.2017.1371745 MathSciNetCrossRefGoogle Scholar
  32. 32.
    Haynes W (2013) Wilcoxon rank sum test. In: Wolkenhauer O, Cho KH, Yokota H (eds) Dubitzky W. Encyclopedia of systems biology. Springer, New York, pp 2354–2355Google Scholar

Copyright information

© Bharati Vidyapeeth's Institute of Computer Applications and Management 2019

Authors and Affiliations

  1. 1.Department of Computer EngineeringHaliç UniversityIstanbulTurkey

Personalised recommendations