Skip to main content

Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach

  • Conference paper
Bioinspired Computation in Artificial Systems (IWINAC 2015)

Abstract

This paper presents a novel procedure to apply in a sequential way two data preparation techniques from a different nature such as data cleansing and feature selection. For the former we have experienced with a partial removal of outliers via inter-quartile range whereas for the latter we have chosen relevant attributes with two widespread feature subset selectors like CFS (Correlation-based Feature Selection) and CNS (Consistency-based Feature Selection), which are founded on correlation and consistency measures, respectively. Empirical results on seven difficult binary and multi-class data sets, that is, with a test error rate of at least a 10%, according to accuracy, with C4.5 or 1-nearest neighbour classifiers without any kind of prior data pre-processing are outlined. Non-parametric statistical tests assert that the meeting of the aforementioned two data preparation strategies using a correlation measure for feature selection with C4.5 algorithm is significant better, measured with roc measure, than the single application of the data cleansing approach. Last but not least, a weak and not very powerful learner like PART achieved promising results with the new proposal based on a consistency measure and is able to compete with the best configuration of C4.5. To sum up, bearing in mind the new approach, for roc measure PART classifier with a consistency metric behaves slightly better than C4.5 and a correlation measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: ACM Sigmod Record, vol. 30, pp. 37–46. ACM (2001)

    Google Scholar 

  2. Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Machine Learning 6(1), 37–66 (1991)

    Google Scholar 

  3. Armanino, C., Leardi, R., Lanteri, S., Modi, G.: Chemometric analysis of tuscan olive oils. Chemometrics and Intelligent Laboratory Systems 5(4), 343–354 (1989)

    Article  Google Scholar 

  4. Bache, K., Lichman, M.: UCI machine learning repository (2013)

    Google Scholar 

  5. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27 (1967)

    Article  MATH  Google Scholar 

  6. Dash, M., Liu, H.: Consistency-based search in feature selection. Artificial Intelligence 151(1), 155–176 (2003)

    Article  MATH  MathSciNet  Google Scholar 

  7. Dasu, T., Johnson, T.: Exploratory data mining and data cleaning, vol. 479. John Wiley & Sons (2003)

    Google Scholar 

  8. Fawcett, T.: An introduction to roc analysis. Pattern Recognition Letters 27(8), 861–874 (2006)

    Article  MathSciNet  Google Scholar 

  9. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Magazine 17(3), 37 (1996)

    Google Scholar 

  10. Frank, E., Witten, I.H.: Generating accurate rule sets without global optimization (1998)

    Google Scholar 

  11. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009)

    Article  Google Scholar 

  12. Hall, M.A.: Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato (1999)

    Google Scholar 

  13. Hjorth, J.S.U.: Computer intensive statistical methods: Validation, model selection, and bootstrap. CRC Press (1993)

    Google Scholar 

  14. Klawikowski, S.J., Zeringue, C., Wootton, L.S., Ibbott, G.S., Beddar, S.: Preliminary evaluation of the dosimetric accuracy of the in vivo plastic scintillation detector oartrac system for prostate cancer treatments. Physics in Medicine and Biology 59(9), N27 (2014)

    Google Scholar 

  15. Kohavi, R., et al.: A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI 14, 1137–1145 (1995)

    Google Scholar 

  16. Langley, P.: Selection of relevant features in machine learning. Defense Technical Information Center (1994)

    Google Scholar 

  17. Larose, D.T.: Discovering knowledge in data: an introduction to data mining. John Wiley & Sons (2014)

    Google Scholar 

  18. Liu, H., Motoda, H.: Computational methods of feature selection. CRC Press (2007)

    Google Scholar 

  19. Quinlan, J.R.: C4. 5: Programming for machine learning. Morgan Kauffmann (1993)

    Google Scholar 

  20. Shin, K., Abraham, A., Han, S.-Y.: Improving kNN text categorization by removing outliers from training set. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 563–566. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  21. Tallón-Ballesteros, A.J., Hervás-Martínez, C., Riquelme, J.C., Ruiz, R.: Improving the accuracy of a two-stage algorithm in evolutionary product unit neural networks for classification by means of feature selection. In: Ferrández, J.M., Álvarez Sánchez, J.R., de la Paz, F., Toledo, F.J. (eds.) IWINAC 2011, Part II. LNCS, vol. 6687, pp. 381–390. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  22. Tallón-Ballesteros, A.J., Riquelme, J.C.: Deleting or keeping outliers for classifier training? In: 2014 Sixth World Congress on Nature and Biologically Inspired Computing, NaBIC 2014, Porto, Portugal, July 30 - August 1, pp. 281–286 (2014)

    Google Scholar 

  23. Tallón-Ballesteros, A.J., Riquelme, J.C.: Tackling ant colony optimization meta-heuristic as search method in feature subset selection based on correlation or consistency measures. In: Corchado, E., Lozano, J.A., Quintián, H., Yin, H. (eds.) IDEAL 2014. LNCS, vol. 8669, pp. 386–393. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  24. Witten, I.H., Frank, E., Mark, A.: Data mining: Practical machine learning tools and techniques (2011)

    Google Scholar 

  25. Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. Applied Artificial Intelligence 17(5-6), 375–381 (2003)

    Article  Google Scholar 

  26. Zhu, X., Wu, X.: Class noise vs. attribute noise: A quantitative study. Artificial Intelligence Review 22(3), 177–210 (2004)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio J. Tallón-Ballesteros .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Tallón-Ballesteros, A.J., Riquelme, J.C. (2015). Data Cleansing Meets Feature Selection: A Supervised Machine Learning Approach. In: Ferrández Vicente, J., Álvarez-Sánchez, J., de la Paz López, F., Toledo-Moreo, F., Adeli, H. (eds) Bioinspired Computation in Artificial Systems. IWINAC 2015. Lecture Notes in Computer Science(), vol 9108. Springer, Cham. https://doi.org/10.1007/978-3-319-18833-1_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-18833-1_39

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-18832-4

  • Online ISBN: 978-3-319-18833-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics