Data Sets and Proper Statistical Analysis of Data Mining Techniques

García, Salvador; Luengo, Julián; Herrera, Francisco

doi:10.1007/978-3-319-10247-4_2

Salvador García⁶,
Julián Luengo⁷ &
Francisco Herrera⁸

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 72))

10k Accesses
1 Citations

Abstract

Presenting a Data Mining technique and analyzing it often involves using a data set related to the domain. In research fortunately many well-known data sets are available and widely used to check the performance of the technique being considered. Many of the subsequent sections of this book include a practical experimental comparison of the techniques described in each one as a exemplification of this process. Such comparisons require a clear bed test in order to enable the reader to be able to replicate and understand the analysis and the conclusions obtained. First we provide an insight of the data sets used to study the algorithms presented as representative in each section in Sect. 2.1. In this section we elaborate on the data sets used in the rest of the book indicating their characteristics, sources and availability. We also delve in the partitioning procedure and how it is expected to alleviate the problematic associated to the validation of any supervised method as well as the details of the performance measures that will be used in the rest of the book. Section 2.2 takes a tour of the most common statistical techniques required in the literature to provide meaningful and correct conclusions. The steps followed to correctly use and interpret the statistical test outcome are also given.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://keel.es/datasets.php.

References

Alpaydin, E.: Introduction to Machine Learning, 2nd edn. MIT Press, Cambridge (2010)
MATH Google Scholar
Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
Barandela, R., Sánchez, J.S., García, V., Rangel, E.: Strategies for learning in class imbalance problems. Pattern Recognit. 36(3), 849–851 (2003)
Article Google Scholar
Ben-David, A.: A lot of randomness is hiding in accuracy. Eng. Appl. Artif. Intell. 20(7), 875–885 (2007)
Article Google Scholar
Děmsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MATH MathSciNet Google Scholar
Efron, B., Gong, G.: A leisurely look at the bootstrap, the jackknife, and cross-validation. Am. Stat. 37(1), 36–48 (1983)
MathSciNet Google Scholar
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 32(200), 675–701 (1937)
Article Google Scholar
Friedman, M.: A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 11(1), 86–92 (1940)
Article Google Scholar
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 180(10), 2044–2064 (2010)
Article Google Scholar
García, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J. Mach. Learn. Res. 9, 2677–2694 (2008)
MATH Google Scholar
Hochberg, Y.: A sharper bonferroni procedure for multiple tests of significance. Biometrika 75(4), 800–802 (1988)
Article MATH MathSciNet Google Scholar
Hodges, J., Lehmann, E.: Rank methods for combination of independent experiments in analysis of variance. Ann. Math. Statist 33, 482–497 (1962)
Article MATH MathSciNet Google Scholar
Holm, S.: A simple sequentially rejective multiple test procedure. Scand. J. Stat. 6, 65–70 (1979)
MATH MathSciNet Google Scholar
Huang, J., Ling, C.X.: Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17(3), 299–310 (2005)
Article Google Scholar
Iman, R., Davenport, J.: Approximations of the critical region of the Friedman statistic. Commun. Stat. 9, 571–595 (1980)
Article Google Scholar
Koch, G.: The use of non-parametric methods in the statistical analysis of a complex split plot experiment. Biometrics 26(1), 105–128 (1970)
Article Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the 14th international joint conference on Artificial intelligence. IJCAI’95, vol. 2, pp. 1137–1143. Morgan Kaufmann Publishers Inc., San Francisco, CA (1995)
Google Scholar
Landgrebe, T.C., Duin, R.P.: Efficient multiclass ROC approximation by decomposition via confusion matrix perturbation analysis. IEEE Trans. Pattern Anal. Mach. Intell. 30(5), 810–822 (2008)
Article Google Scholar
Lim, T.S., Loh, W.Y., Shih, Y.S.: A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn. 40(3), 203–228 (2000)
Article MATH Google Scholar
Luengo, J., García, S., Herrera, F.: A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests. Expert Syst. Appl. 36(4), 7798–7808 (2009)
Article Google Scholar
Moreno-Torres, J.G., Sáez, J.A., Herrera, F.: Study on the impact of partition-induced dataset shift on k -fold cross-validation. IEEE Trans. Neural Netw. Learn. Syst. 23(8), 1304–1312 (2012)
Article Google Scholar
Salzberg, S.L.: On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1(3), 317–328 (1997)
Article Google Scholar
Shaffer, J.P.: Multiple hypothesis testing. Annu. Rev. Psychol. 46(1), 561–584 (1995)
Article Google Scholar
Sheskin, D.J.: Handbook of Parametric and Nonparametric Statistical Procedures. Chapman & Hall/CRC, Boca Raton (2007)
MATH Google Scholar
Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond accuracy, f-score and roc: A family of discriminant measures for performance evaluation. In: A. Sattar, B.H. Kang (eds.) Australian Conference on Artificial Intelligence, Lecture Notes in Computer Science, vol. 4304, pp. 1015–1021. Springer (2006).
Google Scholar
Stone, M.: Asymptotics for and against cross-validation. Biometrika 64(1), 29–35 (1977)
Article MATH MathSciNet Google Scholar
Tan, K.C., Yu, Q., Ang, J.H.: A coevolutionary algorithm for rules discovery in data mining. Int. J. Syst. Sci. 37(12), 835–864 (2006)
Article MATH MathSciNet Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San Francisco (2005)
Google Scholar
Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. Trans. Evol. Comp. 1(1), 67–82 (1997)
Article Google Scholar
Wright, S.P.: Adjusted P-values for simultaneous inference. Biometrics 48(4), 1005–1013 (1992)
Article Google Scholar
Youden, W.J.: Index for rating diagnostic tests. Cancer 3(1), 32–35 (1950)
Article Google Scholar
Zar, J.: Biostatistical Analysis, 4th edn. Prentice Hall, Upper Saddle River (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Jaén, Jaén, Spain
Salvador García
Department of Civil Engineering, University of Burgos, Burgos, Spain
Julián Luengo
Department of Computer Science and Artificial Intelligence, University of Granada, Granada, Spain
Francisco Herrera

Authors

Salvador García
View author publications
You can also search for this author in PubMed Google Scholar
Julián Luengo
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Salvador García .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

García, S., Luengo, J., Herrera, F. (2015). Data Sets and Proper Statistical Analysis of Data Mining Techniques. In: Data Preprocessing in Data Mining. Intelligent Systems Reference Library, vol 72. Springer, Cham. https://doi.org/10.1007/978-3-319-10247-4_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-10247-4_2
Published: 31 August 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10246-7
Online ISBN: 978-3-319-10247-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics