Preprocessing in High Dimensional Datasets

Alonso-Betanzos, Amparo; Bolón-Canedo, Verónica; Eiras-Franco, Carlos; Morán-Fernández, Laura; Seijo-Pardo, Borja

doi:10.1007/978-3-319-67513-8_11

Preprocessing in High Dimensional Datasets

Amparo Alonso-Betanzos⁵,
Verónica Bolón-Canedo⁵,
Carlos Eiras-Franco⁵,
Laura Morán-Fernández⁵ &
…
Borja Seijo-Pardo⁵

Chapter
First Online: 20 October 2017

1141 Accesses
2 Citations

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 137))

Abstract

In the last few years, we have witnessed the advent of Big Data and, more specifically, Big Dimensionality, which refers to the unprecedented number of features that are rendering existing machine learning inadequate. To be able to deal with these high-dimensional spaces, a common solution is to use data preprocessing techniques which might help to reduce the dimensionality of the problem. Feature selection is one of the most popular dimensionality reduction techniques. It can be defined as the process of detecting the relevant features and discarding the irrelevant and redundant ones. Moreover, discretization can help to reduce the size and complexity of a problem in Big Data settings, by diminishing data from a large domain of numeric values to a subset of categorical values. This chapter describes in detail these preprocessing techniques as well as providing examples of new implementations developed to deal with Big Data.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Zhai, Y., Ong, Y., Tsang, I.: The emerging “Big Dimensionality?”. IEEE Comput. Intell. Mag. 9(3), 14–26 (2014)
Article Google Scholar
Bache, K., Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed November 2016
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar
Bellman, R.: Dynamic Programming. Princeton UP, Princeton (1957)
Google Scholar
García, S., Luengo, J., Herrera, F.: Data Preprocessing in Data Mining. Springer, New York (2015)
Google Scholar
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, MA, USA (2011)
Google Scholar
Witten, H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, Philadelphia (2011)
Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature Selection for High-Dimensional Data. Springer-Verlag, Berlin (2015)
Google Scholar
Zhai, Y., Ong, Y.S., Tsang, I.W.: The emerging “Big Dimensionality”. IEEE Comput. Intell. Mag. 9, 16–26 (2014)
Article Google Scholar
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3, 185–205 (2005)
Article Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using Support vector Machines. Mach. Learn. 46, 389–422 (2002)
Article MATH Google Scholar
Shah, M., Marchand, M., Corbeil, J.: Feature selection with conjunctions of decision stumps and learning from microarray data. IEEE Trans. Pattern Anal. Mach. Intell. 34, 174–186 (2012)
Article Google Scholar
Ramírez-Gallego, S., Lastra, I., Martínez-Rego, D., Bolón-Canedo, D., Benítez, J.M., Herrera, F., Alonso-Betanzos, A.: Fast-mRMR: Fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int. J. Intell. Syst. 0, 1–19 (2016)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 111–135 (2014)
Article Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: On the effectiveness of discretization on gene selection of microarray data. In: Proceedings International Joint Conference on Neural Networks (IJCNN) 2010, pp. 167–174 (2010)
Google Scholar
Ramírez-Gallego, S., García, S., Mouriño-Talín, H.: Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Data discretization: taxonomy and big data challenge. Wiley Interdisciplinary Reviews. Data Min. Knowl. Disc. 6, 5–21 (2016)
Google Scholar
Yang, Y., Webb, G.I.: Discretization for naive-Bayes learning: managing discretization bias and variance. Mach. Learn. 74(1), 39–74 (2009)
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Hu, H.W., Chen, Y.L., Tang, K.: A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Trans. Knowl. Data Eng. 21(11), 1505–1514 (2009)
Article Google Scholar
Yang, Y., Webb, G.I.: Proportional k-interval discretization for naive-Bayes classifiers. In: European Conference on Machine Learning, pp. 564–575, Springer, Berlin (2001)
Google Scholar
Fayyad, U., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proc. IJCAI-93, pp. 1022–1027 (1993)
Google Scholar
Machine Learning Library (MLlib) for Spark, Mllib.: [Online]. Available: http://spark.apache.org/docs/latest/mllib-guide.html (2015)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Eiras-Franco, C., Bolón-Canedo, V., Ramos, S., González-Domínguez, J., Alonso-Betanzos, A., Touriño, J.: Multithreaded and Spark parallelization of feature selection filters. J. Comput. Sci. (2016). https://doi.org/10.1016/j.jocs.2016.07.002
Ramírez-Gallego, S., García, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: Distributed entropy minimization discretizer for big data analysis under Apache Spark. In: Proc. 9th IEEE International Conference on Big Data Science and Engineering (IEEE BigDataSE-15) Trustcom/BigDataSE/ISPA vol. 2, pp. 33–40, IEEE (2015)
Google Scholar
Boutsidis, C., Drineas, P., Mahoney, M.W.: Unsupervised feature selection for the k-means clustering problem. Adv. Neural Inf. Process. Syst. 22, 153–161 (2009). https://papers.nips.cc/book/advances-in-neural-information-processing-systems-22-2009
Roth, V., Lange, T.: Feature selection in clustering problems. Adv. Neural Inf. Process. Syst. 16 (2003). https://papers.nips.cc/book/advances-inneural-information-processing-systems-16-2003
Leardi, R., Lupiáñez González, A.: Genetic algorithms applied to feature selection in PLS regression: how and when to use them. Chemometr. Intell. Lab. Syst. 41(2), 195–207 (1998)
Article Google Scholar
Paul, D., Bair, E., Hastie, T., Tibshirani, R.: ‘‘Preconditioning’’ for feature selection and regression in high-dimensional problems. Ann. Stat. 36(4), 1595–1618, 2008
Google Scholar
Dash, M., Liu, H.: Feature selection for classification. Intell. data anal. 1(3), 131–156 (1997)
Article Google Scholar
Pal, M., Foody, G.M.: Feature selection for classification of hyperspectral data by SVM. IEEE Trans. Geosci. Remote Sens. 48(5), 2297–2307 (2010)
Article Google Scholar
Azmadian, F., Yilmazer, A., Dy, J.G., Aslam, J.A., Kaeli, D.R.: Accelerated feature selection for outlier detection using the local kernel density ratio. In: Proceedings IEEE 12th International Conference on Data Mining, pp. 51–60 (2012)
Google Scholar
Guillén, A., García Arenas, M.I., van Heeswijk, M., Sovilj, D., Lendasse, A., Herrera, L.J., Pomares, H., Rojas, I.: Fast feature selection in a GPU cluster using the delta test. Entropy 16, 854–869 (2014)
Article Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1126–1238 (2005)
Google Scholar
Fast-mRMR package. https://github.com/sramirez/fast-mRMR. Accessed December 2016
Apache Spark: Lightning-fast cluster computing. http://shop.oreilly.com/product/0636920028512.do (2015). Accessed December 2016
NVIDIA accelerated computing, CUDA platforms. https://developer.nvidia.com/additional-resources. Accessed December 2016
Ramírez-Gallego, S., Lastra, I., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J.M., Herrera, F., Alonso-Betanzos, A.: Fast-mRMR: Fast minimum redundancy maximum relevance algorithm for high-dimensional big data. Int. J. Intell. Syst. 0, 1–19 (2016)
Google Scholar
Das, K., Bhaduri, K., Kargupta, H.: A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks. Knowl. Inf. Syst. 24(3), 341–367 (2010)
Article Google Scholar
Banerjee, M., Chakravarty, S.: Privacy preserving feature selection for distributed data using virtual dimension. In: Proceedings of the 20th ACM international conference on Information and knowledge management, pp. 2281–2284 (2011)
Google Scholar
Tan, M., Tsang, I.W., Wang, L.: Towards ultrahigh dimensional feature selection for big data. J. Mach. Learn. Res., 15(1), 1371–1429 (2014)
Google Scholar
Peralta, D., del Río, S., Ramírez-Gallego, S., Triguero, I., Benítez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a mapreduce approach. Math. Probl. Eng. 2015 (2015). http://dx.doi.org/10.1155/2015/246139
Bolón-Canedo, V., Sánchez-Maroño, N., Cerviño-Rabuñal, J.: Toward parallel feature selection from vertically partitioned data. In: Proceedings of European Symposium on Artificial Neural Networks (ESANN), pp. 395–400 (2014)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N. and Cerviño-Rabuñal, J.: Scaling up feature selection: a distributed filter approach. In: Proceedings of Conference of the Spanish Association for Artificial Intelligence (CAEPIA), pp. 121–130 (2013)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Distributed feature selection: an application to microarray data classification. Appl. Soft Comput. 30, 136–150 (2015)
Article Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A distributed feature selection approach based on a complexity measure. International Work-Conference on Artificial Neural Networks, 15–28 (2015)
Google Scholar
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: A time efficient approach for distributed feature selection partitioning by features. Lecture Notes in Artificial Intelligence. LNAI-9422, 16th Conference of the Spanish Association for Artificial Intelligence, pp. 245–254 (2015)
Google Scholar
Ho, T.K., Basu, M.: Complexity measures of supervised classification problems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 289–300 (2002)
Article Google Scholar
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Centralized vs. distributed feature selection methods based on data complexity measures. Knowledge-Based Syst. 105, 48–59 (2016)
Google Scholar
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
MATH Google Scholar
Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: An ensemble of filters and classifiers for microarray data classification. Pattern Recogn. 45(1), 531–539 (2012)
Article Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Data classification using an ensemble of filters. Neurocomputing 135, 13–20 (2014)
Article Google Scholar
Bramer, M.: Principles of Data Mining, 2nd edn.. Springer, London (2013)
Google Scholar
Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., Alonso-Betanzos, A.: Ensemble feature selection: homogeneous and heterogeneous approaches. Knowledge-Based Syst. 118, 124–139 (2017)
Google Scholar
Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM (2002)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl.-Based Syst. 86, 33–45 (2015)
Article Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A review of feature selection methods on synthetic data. Knowl. Inf. Syst. 34(3), 483–519 (2013)
Article Google Scholar
Khoshgoftaar, T. M., Golawala, M. and Van Hulse, J.: An empirical study of learning from imbalanced data using random forest. 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007), 2, IEEE (2007)
Google Scholar
Mejía-Lavalle, M., Sucar, E. and Arroyo, G.: Feature selection with a perceptron neural net. In: Proceedings of the international workshop on feature selection for data mining (2006)
Google Scholar
Seijo-Pardo, B., Bolón-Canedo, V. and Alonso-Betanzos, A.: Using a feature selection ensemble on DNA microarray datasets. In: Proceeding of 24th European Symposium on Artificial Neural Networks, pp. 277–282 (2016)
Google Scholar
Seijo-Pardo, B., Bolón-Canedo, V. and Alonso-Betanzos, A.: Using data complexity measures for thresholding in feature selection rankers. Lecture Notes in Artificial Intelligence. LNAI-9868 Advances in Artificial Intelligence. 17th Conference of the Spanish Association for Artificial Intelligence, CAEPIA 2016, pp. 121–131 (2016)
Google Scholar
Wang, H. and Khoshgoftaar, T. M. and Napolitano, A.: A comparative study of ensemble feature selection techniques for software defect prediction, Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on. IEEE (2010)
Google Scholar
Kononenko, I.: Estimating attributes: analysis and extensions of RELIEF. Mach. Learn.: ECML-94 784, 171–182 (1994)
Google Scholar
Eiras-Franco, C., Bolón-Canedo, V., Ramos, S., González-Domínguez, J., Alonso-Betanzos, A. and Touriño, J.: Paralelización de algoritmos de selección de características en la plataforma Weka, CAEPIA 2015 (Workshop BigDADE), pp. 949–958 (2015)
Google Scholar
Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J. M., Alonso-Betanzos, A., Herrera, F.: An information theoretic feature selection framework for big data under Apache Spark. arXiv preprint arXiv:1610.04154 (2016). Available: IEEE Trans Syst Man Cybern: Syst PP(99). doi:10.1109/TSMC.2017.2670926
Dong, M., Kothari, R.: Feature subset selection using a new definition of classifiability. Pattern Recogn. Lett. 24(9), 1215–1225 (2003)
Article MATH Google Scholar
Lorena, A.C., Costa, I.G., Spolaôr, N., De Souto, M.C.P.: Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75(1), 33–42 (2012)
Article Google Scholar
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Can classification performance be predicted by complexity measures? A study using microarray data, Knowledge and Information Systems, pp. 1–24 (2016)
Google Scholar
Luengo, J., Fernández, A., García, S., Herrera, F.: Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft. Comput. 15(10), 1909–1936 (2011)
Google Scholar
Morán-Fernández, L., Bolón-Canedo, V., Alonso-Betanzos, A.: Data complexity measures for analyzing the effect of SMOTE over microarrays. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (2016)
Google Scholar
Liu, H., Setiono, R.: Chi2: Feature selection and discretization of numeric attributes. In: Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, pp. 388–391 (1995)
Google Scholar
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley, New York (1999)
Google Scholar
Hall, M.A.: Correlation-based feature selection for machine learning. PhD Thesis, The University of Waikato (1999)
Google Scholar
Yu, L., Liu, H.: Feature selection for high-dimensional data: a fast correlation-based filter solution. Machine Learning-International Workshop then Conference-, 856–863 (2003)
Google Scholar
Zhao, Z., Liu, H.: Searching for interacting features. IJCAI, 7, 1156–1161 (2007)
Google Scholar
Dash, M., Liu, H.: Consistency-based search in feature selection. Artif. Intell. 151(1), 155–176 (2003)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Computación, Universidade Da Coruña, Campus de Elviña S/N, 15071 A, Coruña, Spain
Amparo Alonso-Betanzos, Verónica Bolón-Canedo, Carlos Eiras-Franco, Laura Morán-Fernández & Borja Seijo-Pardo

Authors

Amparo Alonso-Betanzos
View author publications
You can also search for this author in PubMed Google Scholar
Verónica Bolón-Canedo
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Eiras-Franco
View author publications
You can also search for this author in PubMed Google Scholar
Laura Morán-Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Borja Seijo-Pardo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amparo Alonso-Betanzos .

Editor information

Editors and Affiliations

Dept. of Statistics & Applied Probabilit, University of California Santa Barbara, Santa Barbara, California, USA
Dawn E. Holmes
KES International , Adelaide, South Australia, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Alonso-Betanzos, A., Bolón-Canedo, V., Eiras-Franco, C., Morán-Fernández, L., Seijo-Pardo, B. (2018). Preprocessing in High Dimensional Datasets. In: Holmes, D., Jain, L. (eds) Advances in Biomedical Informatics. Intelligent Systems Reference Library, vol 137. Springer, Cham. https://doi.org/10.1007/978-3-319-67513-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-67513-8_11
Published: 20 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67512-1
Online ISBN: 978-3-319-67513-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics