Advertisement

Preprocessing in High Dimensional Datasets

  • Amparo Alonso-BetanzosEmail author
  • Verónica Bolón-Canedo
  • Carlos Eiras-Franco
  • Laura Morán-Fernández
  • Borja Seijo-Pardo
Chapter
Part of the Intelligent Systems Reference Library book series (ISRL, volume 137)

Abstract

In the last few years, we have witnessed the advent of Big Data and, more specifically, Big Dimensionality, which refers to the unprecedented number of features that are rendering existing machine learning inadequate. To be able to deal with these high-dimensional spaces, a common solution is to use data preprocessing techniques which might help to reduce the dimensionality of the problem. Feature selection is one of the most popular dimensionality reduction techniques. It can be defined as the process of detecting the relevant features and discarding the irrelevant and redundant ones. Moreover, discretization can help to reduce the size and complexity of a problem in Big Data settings, by diminishing data from a large domain of numeric values to a subset of categorical values. This chapter describes in detail these preprocessing techniques as well as providing examples of new implementations developed to deal with Big Data.

Keywords

Preprocessing Big data Big dimensionality Discretization Feature selection 

References

  1. 1.
    Zhai, Y., Ong, Y., Tsang, I.: The emerging “Big Dimensionality?”. IEEE Comput. Intell. Mag. 9(3), 14–26 (2014)CrossRefGoogle Scholar
  2. 2.
    Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed November 2016
  3. 3.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
  4. 4.
    Bellman, R.: Dynamic Programming. Princeton UP, Princeton (1957) Google Scholar
  5. 5.
    García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, New York (2015)Google Scholar
  6. 6.
    Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, MA, USA (2011) Google Scholar
  7. 7.
    Witten, H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, Philadelphia (2011)Google Scholar
  8. 8.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)zbMATHGoogle Scholar
  9. 9.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature Selection for High-Dimensional Data. Springer-Verlag, Berlin (2015)Google Scholar
  10. 10.
    Zhai, Y., Ong, Y.S., Tsang, I.W.: The emerging “Big Dimensionality”. IEEE Comput. Intell. Mag. 9, 16–26 (2014)CrossRefGoogle Scholar
  11. 11.
    Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3, 185–205 (2005)CrossRefGoogle Scholar
  12. 12.
    Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using Support vector Machines. Mach. Learn. 46, 389–422 (2002)CrossRefzbMATHGoogle Scholar
  13. 13.
    Shah, M., Marchand, M., Corbeil, J.: Feature selection with conjunctions of decision stumps and learning from microarray data. IEEE Trans. Pattern Anal. Mach. Intell. 34, 174–186 (2012)CrossRefGoogle Scholar
  14. 14.
    Ramírez-Gallego, S., Lastra, I., Martínez-Rego, D., Bolón-Canedo, D., Benítez, J.M., Herrera, F., Alonso-Betanzos, A.: Fast-mRMR: Fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int. J. Intell. Syst. 0, 1–19 (2016)Google Scholar
  15. 15.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)CrossRefGoogle Scholar
  16. 16.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: On the effectiveness of discretization on gene selection of microarray data. In: Proceedings International Joint Conference on Neural Networks (IJCNN) 2010, pp. 167–174 (2010)Google Scholar
  17. 17.
    Ramírez-Gallego, S., García, S., Mouriño-Talín, H.: Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews. Data Min. Knowl. Disc. 6, 5–21 (2016)Google Scholar
  18. 18.
    Yang, Y., Webb, G.I.: Discretization for naive-Bayes learning: managing discretization bias and variance. Mach. Learn. 74(1), 39–74 (2009)CrossRefGoogle Scholar
  19. 19.
    Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)Google Scholar
  20. 20.
    Hu, H.W., Chen, Y.L., Tang, K.: A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Trans. Knowl. Data Eng. 21(11), 1505–1514 (2009)CrossRefGoogle Scholar
  21. 21.
    Yang, Y., Webb, G.I.: Proportional k-interval discretization for naive-Bayes classifiers. In: European Conference on Machine Learning, pp. 564–575, Springer, Berlin (2001)Google Scholar
  22. 22.
    Fayyad, U., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. IJCAI-93, pp. 1022–1027 (1993) Google Scholar
  23. 23.
    Machine Learning Library (MLlib) for Spark, Mllib.: [Online]. Available: http://spark.apache.org/docs/latest/mllib-guide.html (2015)
  24. 24.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  25. 25.
    Eiras-Franco, C., Bolón-Canedo, V., Ramos, S., González-Domínguez, J., Alonso-Betanzos, A., Touriño, J.: Multithreaded and Spark parallelization of feature selection filters. J. Comput. Sci. (2016). https://doi.org/10.1016/j.jocs.2016.07.002
  26. 26.
    Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Distributed entropy minimization discretizer for big data analysis under Apache Spark. In: Proc.  9th IEEE International Conference on Big Data Science and Engineering (IEEE BigDataSE-15) Trustcom/BigDataSE/ISPA vol. 2, pp. 33–40, IEEE (2015)Google Scholar
  27. 27.
    Boutsidis, C., Drineas, P., Mahoney, M.W.: Unsupervised feature selection for the k-means clustering problem. Adv. Neural Inf. Process. Syst. 22, 153–161 (2009). https://papers.nips.cc/book/advances-in-neural-information-processing-systems-22-2009
  28. 28.
    Roth, V., Lange, T.: Feature selection in clustering problems. Adv. Neural Inf. Process. Syst. 16 (2003). https://papers.nips.cc/book/advances-inneural-information-processing-systems-16-2003
  29. 29.
    Leardi, R., Lupiáñez González, A.: Genetic algorithms applied to feature selection in PLS regression: how and when to use them. Chemometr. Intell. Lab. Syst. 41(2), 195–207 (1998)CrossRefGoogle Scholar
  30. 30.
    Paul, D., Bair, E., Hastie, T., Tibshirani, R.: ‘‘Preconditioning’’ for feature selection and regression in high-dimensional problems. Ann. Stat. 36(4), 1595–1618, 2008Google Scholar
  31. 31.
    Dash, M., Liu, H.: Feature selection for classification. Intell. data anal. 1(3), 131–156 (1997)CrossRefGoogle Scholar
  32. 32.
    Pal, M., Foody, G.M.: Feature selection for classification of hyperspectral data by SVM. IEEE Trans. Geosci. Remote Sens. 48(5), 2297–2307 (2010)CrossRefGoogle Scholar
  33. 33.
    Azmadian, F., Yilmazer, A., Dy, J.G., Aslam, J.A., Kaeli, D.R.: Accelerated feature selection for outlier detection using the local kernel density ratio. In: Proceedings IEEE 12th International Conference on Data Mining, pp. 51–60 (2012)Google Scholar
  34. 34.
    Guillén, A., García Arenas, M.I., van Heeswijk, M., Sovilj, D., Lendasse, A., Herrera, L.J., Pomares, H., Rojas, I.: Fast feature selection in a GPU cluster using the delta test. Entropy 16, 854–869 (2014)CrossRefGoogle Scholar
  35. 35.
    Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1126–1238 (2005)Google Scholar
  36. 36.
    Fast-mRMR package. https://github.com/sramirez/fast-mRMR. Accessed December 2016
  37. 37.
    Apache Spark: Lightning-fast cluster computing. http://shop.oreilly.com/product/0636920028512.do (2015). Accessed December 2016
  38. 38.
    NVIDIA accelerated computing, CUDA platforms. https://developer.nvidia.com/additional-resources. Accessed December 2016
  39. 39.
    Ramírez-Gallego, S., Lastra, I., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J.M., Herrera, F., Alonso-Betanzos, A.: Fast-mRMR: Fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int. J. Intell. Syst. 0, 1–19 (2016)Google Scholar
  40. 40.
    Das, K., Bhaduri, K., Kargupta, H.: A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl. Inf. Syst. 24(3), 341–367 (2010)CrossRefGoogle Scholar
  41. 41.
    Banerjee, M., Chakravarty, S.: Privacy preserving feature selection for distributed data using virtual dimension. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 2281–2284 (2011)Google Scholar
  42. 42.
    Tan, M., Tsang, I.W., Wang, L.: Towards ultrahigh dimensional feature selection for big data. J. Mach. Learn. Res., 15(1), 1371–1429 (2014)Google Scholar
  43. 43.
    Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benítez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a mapreduce approach. Math. Probl. Eng. 2015 (2015). http://dx.doi.org/10.1155/2015/246139
  44. 44.
    Bolón-Canedo, V., Sánchez-Maroño, N., Cerviño-Rabuñal, J.: Toward parallel feature selection from vertically partitioned data. In: Proceedings of European Symposium on Artificial Neural Networks (ESANN), pp. 395–400 (2014)Google Scholar
  45. 45.
    Bolón-Canedo, V., Sánchez-Maroño, N. and Cerviño-Rabuñal, J.: Scaling up feature selection: a distributed filter approach. In: Proceedings of Conference of the Spanish Association for Artificial Intelligence (CAEPIA), pp. 121–130 (2013)Google Scholar
  46. 46.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Distributed feature selection: an application to microarray data classification. Appl. Soft Comput. 30, 136–150 (2015)CrossRefGoogle Scholar
  47. 47.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A distributed feature selection approach based on a complexity measure. International Work-Conference on Artificial Neural Networks, 15–28 (2015)Google Scholar
  48. 48.
    Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: A time efficient approach for distributed feature selection partitioning by features. Lecture Notes in Artificial Intelligence. LNAI-9422, 16th Conference of the Spanish Association for Artificial Intelligence, pp. 245–254 (2015)Google Scholar
  49. 49.
    Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)CrossRefGoogle Scholar
  50. 50.
    Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Centralized vs. distributed feature selection methods based on data complexity measures. Knowledge-Based Syst. 105, 48–59 (2016)Google Scholar
  51. 51.
    Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)zbMATHGoogle Scholar
  52. 52.
    Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)Google Scholar
  53. 53.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012)CrossRefGoogle Scholar
  54. 54.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Data classification using an ensemble of filters. Neurocomputing 135, 13–20 (2014)CrossRefGoogle Scholar
  55. 55.
    Bramer, M.: Principles of Data Mining, 2nd edn.. Springer, London (2013)Google Scholar
  56. 56.
    Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A.: Ensemble feature selection: homogeneous and heterogeneous approaches. Knowledge-Based Syst. 118, 124–139 (2017)Google Scholar
  57. 57.
    Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2002)Google Scholar
  58. 58.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl.-Based Syst. 86, 33–45 (2015)CrossRefGoogle Scholar
  59. 59.
    Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)CrossRefGoogle Scholar
  60. 60.
    Khoshgoftaar, T. M., Golawala, M. and Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), 2, IEEE (2007)Google Scholar
  61. 61.
    Mejía-Lavalle, M., Sucar, E. and Arroyo, G.: Feature selection with a perceptron neural net. In: Proceedings of the international workshop on feature selection for data mining (2006)Google Scholar
  62. 62.
    Seijo-Pardo, B., Bolón-Canedo, V. and Alonso-Betanzos, A.: Using a feature selection ensemble on DNA microarray datasets. In: Proceeding of 24th European Symposium on Artificial Neural Networks, pp. 277–282 (2016)Google Scholar
  63. 63.
    Seijo-Pardo, B., Bolón-Canedo, V. and Alonso-Betanzos, A.: Using data complexity measures for thresholding in feature selection rankers. Lecture Notes in Artificial Intelligence. LNAI-9868 Advances in Artificial Intelligence. 17th Conference of the Spanish Association for Artificial Intelligence, CAEPIA 2016, pp. 121–131 (2016)Google Scholar
  64. 64.
    Wang, H. and Khoshgoftaar, T. M. and Napolitano, A.: A comparative study of ensemble feature selection techniques for software defect prediction, Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on. IEEE (2010)Google Scholar
  65. 65.
    Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. Mach. Learn.: ECML-94 784, 171–182 (1994)Google Scholar
  66. 66.
    Eiras-Franco, C., Bolón-Canedo, V., Ramos, S., González-Domínguez, J., Alonso-Betanzos, A. and Touriño, J.: Paralelización de algoritmos de selección de características en la plataforma Weka, CAEPIA 2015 (Workshop BigDADE), pp. 949–958 (2015)Google Scholar
  67. 67.
    Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., Herrera, F.: An information theoretic feature selection framework for big data under Apache Spark. arXiv preprint arXiv:1610.04154 (2016). Available: IEEE Trans Syst Man Cybern: Syst PP(99). doi: 10.1109/TSMC.2017.2670926
  68. 68.
    Dong, M., Kothari, R.: Feature subset selection using a new definition of classifiability. Pattern Recogn. Lett. 24(9), 1215–1225 (2003)CrossRefzbMATHGoogle Scholar
  69. 69.
    Lorena, A.C., Costa, I.G., Spolaôr, N., De Souto, M.C.P.: Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1), 33–42 (2012)CrossRefGoogle Scholar
  70. 70.
    Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Can classification performance be predicted by complexity measures? A study using microarray data, Knowledge and Information Systems, pp. 1–24 (2016)Google Scholar
  71. 71.
    Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)Google Scholar
  72. 72.
    Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Data complexity measures for analyzing the effect of SMOTE over microarrays. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (2016)Google Scholar
  73. 73.
    Liu, H., Setiono, R.: Chi2: Feature selection and discretization of numeric attributes. In: Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, pp. 388–391 (1995)Google Scholar
  74. 74.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (1999)Google Scholar
  75. 75.
    Hall, M.A.: Correlation-based feature selection for machine learning. PhD Thesis, The University of Waikato (1999)Google Scholar
  76. 76.
    Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. Machine Learning-International Workshop then Conference-, 856–863 (2003)Google Scholar
  77. 77.
    Zhao, Z., Liu, H.: Searching for interacting features. IJCAI, 7, 1156–1161 (2007)Google Scholar
  78. 78.
    Dash, M., Liu, H.: Consistency-based search in feature selection. Artif. Intell. 151(1), 155–176 (2003)CrossRefzbMATHMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Amparo Alonso-Betanzos
    • 1
    Email author
  • Verónica Bolón-Canedo
    • 1
  • Carlos Eiras-Franco
    • 1
  • Laura Morán-Fernández
    • 1
  • Borja Seijo-Pardo
    • 1
  1. 1.Departamento de ComputaciónUniversidade Da CoruñaCoruñaSpain

Personalised recommendations