Abstract
This paper presents a novel procedure to apply in a sequential way two data preparation techniques from a different nature such as data cleansing and feature selection. For the former we have experienced with a partial removal of outliers via inter-quartile range whereas for the latter we have chosen relevant attributes with two widespread feature subset selectors like CFS (Correlation-based Feature Selection) and CNS (Consistency-based Feature Selection), which are founded on correlation and consistency measures, respectively. Empirical results on seven difficult binary and multi-class data sets, that is, with a test error rate of at least a 10%, according to accuracy, with C4.5 or 1-nearest neighbour classifiers without any kind of prior data pre-processing are outlined. Non-parametric statistical tests assert that the meeting of the aforementioned two data preparation strategies using a correlation measure for feature selection with C4.5 algorithm is significant better, measured with roc measure, than the single application of the data cleansing approach. Last but not least, a weak and not very powerful learner like PART achieved promising results with the new proposal based on a consistency measure and is able to compete with the best configuration of C4.5. To sum up, bearing in mind the new approach, for roc measure PART classifier with a consistency metric behaves slightly better than C4.5 and a correlation measure.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: ACM Sigmod Record, vol. 30, pp. 37–46. ACM (2001)
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6(1), 37–66 (1991)
Armanino, C., Leardi, R., Lanteri, S., Modi, G.: Chemometric analysis of tuscan olive oils. Chemometrics and Intelligent Laboratory Systems 5(4), 343–354 (1989)
Bache, K., Lichman, M.: UCI machine learning repository (2013)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967)
Dash, M., Liu, H.: Consistency-based search in feature selection. Artificial Intelligence 151(1), 155–176 (2003)
Dasu, T., Johnson, T.: Exploratory data mining and data cleaning, vol. 479. John Wiley & Sons (2003)
Fawcett, T.: An introduction to roc analysis. Pattern Recognition Letters 27(8), 861–874 (2006)
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Magazine 17(3), 37 (1996)
Frank, E., Witten, I.H.: Generating accurate rule sets without global optimization (1998)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)
Hall, M.A.: Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato (1999)
Hjorth, J.S.U.: Computer intensive statistical methods: Validation, model selection, and bootstrap. CRC Press (1993)
Klawikowski, S.J., Zeringue, C., Wootton, L.S., Ibbott, G.S., Beddar, S.: Preliminary evaluation of the dosimetric accuracy of the in vivo plastic scintillation detector oartrac system for prostate cancer treatments. Physics in Medicine and Biology 59(9), N27 (2014)
Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995)
Langley, P.: Selection of relevant features in machine learning. Defense Technical Information Center (1994)
Larose, D.T.: Discovering knowledge in data: an introduction to data mining. John Wiley & Sons (2014)
Liu, H., Motoda, H.: Computational methods of feature selection. CRC Press (2007)
Quinlan, J.R.: C4. 5: Programming for machine learning. Morgan Kauffmann (1993)
Shin, K., Abraham, A., Han, S.-Y.: Improving kNN text categorization by removing outliers from training set. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 563–566. Springer, Heidelberg (2006)
Tallón-Ballesteros, A.J., Hervás-Martínez, C., Riquelme, J.C., Ruiz, R.: Improving the accuracy of a two-stage algorithm in evolutionary product unit neural networks for classification by means of feature selection. In: Ferrández, J.M., Álvarez Sánchez, J.R., de la Paz, F., Toledo, F.J. (eds.) IWINAC 2011, Part II. LNCS, vol. 6687, pp. 381–390. Springer, Heidelberg (2011)
Tallón-Ballesteros, A.J., Riquelme, J.C.: Deleting or keeping outliers for classifier training? In: 2014 Sixth World Congress on Nature and Biologically Inspired Computing, NaBIC 2014, Porto, Portugal, July 30 - August 1, pp. 281–286 (2014)
Tallón-Ballesteros, A.J., Riquelme, J.C.: Tackling ant colony optimization meta-heuristic as search method in feature subset selection based on correlation or consistency measures. In: Corchado, E., Lozano, J.A., Quintián, H., Yin, H. (eds.) IDEAL 2014. LNCS, vol. 8669, pp. 386–393. Springer, Heidelberg (2014)
Witten, I.H., Frank, E., Mark, A.: Data mining: Practical machine learning tools and techniques (2011)
Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Applied Artificial Intelligence 17(5-6), 375–381 (2003)
Zhu, X., Wu, X.: Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review 22(3), 177–210 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Tallón-Ballesteros, A.J., Riquelme, J.C. (2015). Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach. In: Ferrández Vicente, J., Álvarez-Sánchez, J., de la Paz López, F., Toledo-Moreo, F., Adeli, H. (eds) Bioinspired Computation in Artificial Systems. IWINAC 2015. Lecture Notes in Computer Science(), vol 9108. Springer, Cham. https://doi.org/10.1007/978-3-319-18833-1_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-18833-1_39
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18832-4
Online ISBN: 978-3-319-18833-1
eBook Packages: Computer ScienceComputer Science (R0)