Skip to main content

Preprocessing in High Dimensional Datasets

  • Chapter
  • First Online:

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 137))

Abstract

In the last few years, we have witnessed the advent of Big Data and, more specifically, Big Dimensionality, which refers to the unprecedented number of features that are rendering existing machine learning inadequate. To be able to deal with these high-dimensional spaces, a common solution is to use data preprocessing techniques which might help to reduce the dimensionality of the problem. Feature selection is one of the most popular dimensionality reduction techniques. It can be defined as the process of detecting the relevant features and discarding the irrelevant and redundant ones. Moreover, discretization can help to reduce the size and complexity of a problem in Big Data settings, by diminishing data from a large domain of numeric values to a subset of categorical values. This chapter describes in detail these preprocessing techniques as well as providing examples of new implementations developed to deal with Big Data.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Zhai, Y., Ong, Y., Tsang, I.: The emerging “Big Dimensionality?”. IEEE Comput. Intell. Mag. 9(3), 14–26 (2014)

    Article  Google Scholar 

  2. Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed November 2016

  3. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  4. Bellman, R.: Dynamic Programming. Princeton UP, Princeton (1957)

    Google Scholar 

  5. García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, New York (2015)

    Google Scholar 

  6. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, MA, USA (2011)

    Google Scholar 

  7. Witten, H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, Philadelphia (2011)

    Google Scholar 

  8. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

    MATH  Google Scholar 

  9. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature Selection for High-Dimensional Data. Springer-Verlag, Berlin (2015)

    Google Scholar 

  10. Zhai, Y., Ong, Y.S., Tsang, I.W.: The emerging “Big Dimensionality”. IEEE Comput. Intell. Mag. 9, 16–26 (2014)

    Article  Google Scholar 

  11. Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3, 185–205 (2005)

    Article  Google Scholar 

  12. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using Support vector Machines. Mach. Learn. 46, 389–422 (2002)

    Article  MATH  Google Scholar 

  13. Shah, M., Marchand, M., Corbeil, J.: Feature selection with conjunctions of decision stumps and learning from microarray data. IEEE Trans. Pattern Anal. Mach. Intell. 34, 174–186 (2012)

    Article  Google Scholar 

  14. Ramírez-Gallego, S., Lastra, I., Martínez-Rego, D., Bolón-Canedo, D., Benítez, J.M., Herrera, F., Alonso-Betanzos, A.: Fast-mRMR: Fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int. J. Intell. Syst. 0, 1–19 (2016)

    Google Scholar 

  15. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)

    Article  Google Scholar 

  16. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: On the effectiveness of discretization on gene selection of microarray data. In: Proceedings International Joint Conference on Neural Networks (IJCNN) 2010, pp. 167–174 (2010)

    Google Scholar 

  17. Ramírez-Gallego, S., García, S., Mouriño-Talín, H.: Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews. Data Min. Knowl. Disc. 6, 5–21 (2016)

    Google Scholar 

  18. Yang, Y., Webb, G.I.: Discretization for naive-Bayes learning: managing discretization bias and variance. Mach. Learn. 74(1), 39–74 (2009)

    Article  Google Scholar 

  19. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  20. Hu, H.W., Chen, Y.L., Tang, K.: A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Trans. Knowl. Data Eng. 21(11), 1505–1514 (2009)

    Article  Google Scholar 

  21. Yang, Y., Webb, G.I.: Proportional k-interval discretization for naive-Bayes classifiers. In: European Conference on Machine Learning, pp. 564–575, Springer, Berlin (2001)

    Google Scholar 

  22. Fayyad, U., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. IJCAI-93, pp. 1022–1027 (1993)

    Google Scholar 

  23. Machine Learning Library (MLlib) for Spark, Mllib.: [Online]. Available: http://spark.apache.org/docs/latest/mllib-guide.html (2015)

  24. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)

    Article  Google Scholar 

  25. Eiras-Franco, C., Bolón-Canedo, V., Ramos, S., González-Domínguez, J., Alonso-Betanzos, A., Touriño, J.: Multithreaded and Spark parallelization of feature selection filters. J. Comput. Sci. (2016). https://doi.org/10.1016/j.jocs.2016.07.002

  26. Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Distributed entropy minimization discretizer for big data analysis under Apache Spark. In: Proc.  9th IEEE International Conference on Big Data Science and Engineering (IEEE BigDataSE-15) Trustcom/BigDataSE/ISPA vol. 2, pp. 33–40, IEEE (2015)

    Google Scholar 

  27. Boutsidis, C., Drineas, P., Mahoney, M.W.: Unsupervised feature selection for the k-means clustering problem. Adv. Neural Inf. Process. Syst. 22, 153–161 (2009). https://papers.nips.cc/book/advances-in-neural-information-processing-systems-22-2009

  28. Roth, V., Lange, T.: Feature selection in clustering problems. Adv. Neural Inf. Process. Syst. 16 (2003). https://papers.nips.cc/book/advances-inneural-information-processing-systems-16-2003

  29. Leardi, R., Lupiáñez González, A.: Genetic algorithms applied to feature selection in PLS regression: how and when to use them. Chemometr. Intell. Lab. Syst. 41(2), 195–207 (1998)

    Article  Google Scholar 

  30. Paul, D., Bair, E., Hastie, T., Tibshirani, R.: ‘‘Preconditioning’’ for feature selection and regression in high-dimensional problems. Ann. Stat. 36(4), 1595–1618, 2008

    Google Scholar 

  31. Dash, M., Liu, H.: Feature selection for classification. Intell. data anal. 1(3), 131–156 (1997)

    Article  Google Scholar 

  32. Pal, M., Foody, G.M.: Feature selection for classification of hyperspectral data by SVM. IEEE Trans. Geosci. Remote Sens. 48(5), 2297–2307 (2010)

    Article  Google Scholar 

  33. Azmadian, F., Yilmazer, A., Dy, J.G., Aslam, J.A., Kaeli, D.R.: Accelerated feature selection for outlier detection using the local kernel density ratio. In: Proceedings IEEE 12th International Conference on Data Mining, pp. 51–60 (2012)

    Google Scholar 

  34. Guillén, A., García Arenas, M.I., van Heeswijk, M., Sovilj, D., Lendasse, A., Herrera, L.J., Pomares, H., Rojas, I.: Fast feature selection in a GPU cluster using the delta test. Entropy 16, 854–869 (2014)

    Article  Google Scholar 

  35. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1126–1238 (2005)

    Google Scholar 

  36. Fast-mRMR package. https://github.com/sramirez/fast-mRMR. Accessed December 2016

  37. Apache Spark: Lightning-fast cluster computing. http://shop.oreilly.com/product/0636920028512.do (2015). Accessed December 2016

  38. NVIDIA accelerated computing, CUDA platforms. https://developer.nvidia.com/additional-resources. Accessed December 2016

  39. Ramírez-Gallego, S., Lastra, I., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J.M., Herrera, F., Alonso-Betanzos, A.: Fast-mRMR: Fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int. J. Intell. Syst. 0, 1–19 (2016)

    Google Scholar 

  40. Das, K., Bhaduri, K., Kargupta, H.: A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl. Inf. Syst. 24(3), 341–367 (2010)

    Article  Google Scholar 

  41. Banerjee, M., Chakravarty, S.: Privacy preserving feature selection for distributed data using virtual dimension. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 2281–2284 (2011)

    Google Scholar 

  42. Tan, M., Tsang, I.W., Wang, L.: Towards ultrahigh dimensional feature selection for big data. J. Mach. Learn. Res., 15(1), 1371–1429 (2014)

    Google Scholar 

  43. Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benítez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a mapreduce approach. Math. Probl. Eng. 2015 (2015). http://dx.doi.org/10.1155/2015/246139

  44. Bolón-Canedo, V., Sánchez-Maroño, N., Cerviño-Rabuñal, J.: Toward parallel feature selection from vertically partitioned data. In: Proceedings of European Symposium on Artificial Neural Networks (ESANN), pp. 395–400 (2014)

    Google Scholar 

  45. Bolón-Canedo, V., Sánchez-Maroño, N. and Cerviño-Rabuñal, J.: Scaling up feature selection: a distributed filter approach. In: Proceedings of Conference of the Spanish Association for Artificial Intelligence (CAEPIA), pp. 121–130 (2013)

    Google Scholar 

  46. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Distributed feature selection: an application to microarray data classification. Appl. Soft Comput. 30, 136–150 (2015)

    Article  Google Scholar 

  47. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A distributed feature selection approach based on a complexity measure. International Work-Conference on Artificial Neural Networks, 15–28 (2015)

    Google Scholar 

  48. Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: A time efficient approach for distributed feature selection partitioning by features. Lecture Notes in Artificial Intelligence. LNAI-9422, 16th Conference of the Spanish Association for Artificial Intelligence, pp. 245–254 (2015)

    Google Scholar 

  49. Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)

    Article  Google Scholar 

  50. Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Centralized vs. distributed feature selection methods based on data complexity measures. Knowledge-Based Syst. 105, 48–59 (2016)

    Google Scholar 

  51. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MATH  Google Scholar 

  52. Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)

    Google Scholar 

  53. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012)

    Article  Google Scholar 

  54. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Data classification using an ensemble of filters. Neurocomputing 135, 13–20 (2014)

    Article  Google Scholar 

  55. Bramer, M.: Principles of Data Mining, 2nd edn.. Springer, London (2013)

    Google Scholar 

  56. Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A.: Ensemble feature selection: homogeneous and heterogeneous approaches. Knowledge-Based Syst. 118, 124–139 (2017)

    Google Scholar 

  57. Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2002)

    Google Scholar 

  58. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl.-Based Syst. 86, 33–45 (2015)

    Article  Google Scholar 

  59. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)

    Article  Google Scholar 

  60. Khoshgoftaar, T. M., Golawala, M. and Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), 2, IEEE (2007)

    Google Scholar 

  61. Mejía-Lavalle, M., Sucar, E. and Arroyo, G.: Feature selection with a perceptron neural net. In: Proceedings of the international workshop on feature selection for data mining (2006)

    Google Scholar 

  62. Seijo-Pardo, B., Bolón-Canedo, V. and Alonso-Betanzos, A.: Using a feature selection ensemble on DNA microarray datasets. In: Proceeding of 24th European Symposium on Artificial Neural Networks, pp. 277–282 (2016)

    Google Scholar 

  63. Seijo-Pardo, B., Bolón-Canedo, V. and Alonso-Betanzos, A.: Using data complexity measures for thresholding in feature selection rankers. Lecture Notes in Artificial Intelligence. LNAI-9868 Advances in Artificial Intelligence. 17th Conference of the Spanish Association for Artificial Intelligence, CAEPIA 2016, pp. 121–131 (2016)

    Google Scholar 

  64. Wang, H. and Khoshgoftaar, T. M. and Napolitano, A.: A comparative study of ensemble feature selection techniques for software defect prediction, Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on. IEEE (2010)

    Google Scholar 

  65. Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. Mach. Learn.: ECML-94 784, 171–182 (1994)

    Google Scholar 

  66. Eiras-Franco, C., Bolón-Canedo, V., Ramos, S., González-Domínguez, J., Alonso-Betanzos, A. and Touriño, J.: Paralelización de algoritmos de selección de características en la plataforma Weka, CAEPIA 2015 (Workshop BigDADE), pp. 949–958 (2015)

    Google Scholar 

  67. Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., Herrera, F.: An information theoretic feature selection framework for big data under Apache Spark. arXiv preprint arXiv:1610.04154 (2016). Available: IEEE Trans Syst Man Cybern: Syst PP(99). doi:10.1109/TSMC.2017.2670926

  68. Dong, M., Kothari, R.: Feature subset selection using a new definition of classifiability. Pattern Recogn. Lett. 24(9), 1215–1225 (2003)

    Article  MATH  Google Scholar 

  69. Lorena, A.C., Costa, I.G., Spolaôr, N., De Souto, M.C.P.: Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1), 33–42 (2012)

    Article  Google Scholar 

  70. Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Can classification performance be predicted by complexity measures? A study using microarray data, Knowledge and Information Systems, pp. 1–24 (2016)

    Google Scholar 

  71. Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)

    Google Scholar 

  72. Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Data complexity measures for analyzing the effect of SMOTE over microarrays. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (2016)

    Google Scholar 

  73. Liu, H., Setiono, R.: Chi2: Feature selection and discretization of numeric attributes. In: Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, pp. 388–391 (1995)

    Google Scholar 

  74. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (1999)

    Google Scholar 

  75. Hall, M.A.: Correlation-based feature selection for machine learning. PhD Thesis, The University of Waikato (1999)

    Google Scholar 

  76. Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. Machine Learning-International Workshop then Conference-, 856–863 (2003)

    Google Scholar 

  77. Zhao, Z., Liu, H.: Searching for interacting features. IJCAI, 7, 1156–1161 (2007)

    Google Scholar 

  78. Dash, M., Liu, H.: Consistency-based search in feature selection. Artif. Intell. 151(1), 155–176 (2003)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amparo Alonso-Betanzos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Cite this chapter

Alonso-Betanzos, A., Bolón-Canedo, V., Eiras-Franco, C., Morán-Fernández, L., Seijo-Pardo, B. (2018). Preprocessing in High Dimensional Datasets. In: Holmes, D., Jain, L. (eds) Advances in Biomedical Informatics. Intelligent Systems Reference Library, vol 137. Springer, Cham. https://doi.org/10.1007/978-3-319-67513-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67513-8_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67512-1

  • Online ISBN: 978-3-319-67513-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics