Abstract
Feature selection is an important data pre-processing step in pattern recognition and data mining. It is effective in reducing dimensionality and redundancy among the selected features, and increasing the performance of learning algorithm and generating information-rich features subset. In this regard, the chapter reports on a rough set-based feature selection algorithm called maximum relevance-maximum significance (MRMS), and its applications on quantitative structure activity relationship (QSAR) and gene expression data. It selects a set of features from a high-dimensional data set by maximizing the relevance and significance of the selected features. A theoretical analysis is reported to justify the use of both relevance and significance criteria for selecting a reduced feature set with high predictive accuracy. The importance of rough set theory for computing both relevance and significance of the features is also established. The performance of the MRMS algorithm, along with a comparison with other related methods, is studied on three QSAR data sets using the R 2 statistic of support vector regression method, and on five cancer and two arthritis microarray data sets by using the predictive accuracy of the K-nearest neighbor rule and support vector machine.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Science, USA 96(12), 6745–6750 (1999)
Amat, L., Besalu, E., Dorca, R.C.: Identification of active molecular sites using quantum-self-similarity matrices. Journal of Chemical Information and Computer Science 41, 978–991 (2001)
Bazan, J., Skowron, A., Synak, P.: Dynamic Reducts as a Tool for Extracting Laws from Decision Tables. In: Raś, Z.W., Zemankova, M. (eds.) ISMIS 1994. LNCS (LNAI), vol. 869, pp. 346–355. Springer, Heidelberg (1994)
Bjorvand, A., Komorowski, J.: Practical applications of genetic algorithms for efficient reduct computation. In: Proceedings of the 15th IMACS World Congress on Scientific Computation, Modeling and Applied Mathematics, vol. 4, pp. 601–606 (1997)
Bravi, G., Gancia, E., Mascagni, P., Pegna, M., Todeschini, R., Zaliani, A.: MS-WHIM: New 3D theoretical descriptors derived from molecular surface properties: A comparative 3D QSAR study in a series of steroids. Journal of Computer-Aided Molecular Design 11, 79–92 (1997)
Chen, H., Zhou, J., Xie, G.: PARM: A genetic algorithm to predict bioactivity. Journal of Chemical Information and Computer Science 38, 243–250 (1998)
Chen, K.H., Raś, Z.W., Skowron, A.: Attributes and rough properties in information systems. International Journal of Approximate Reasoning 2, 365–376 (1988)
Chouchoulas, A., Shen, Q.: Rough set-aided keyword reduction for text categorisation. Applied Artificial Intelligence 15, 843–873 (2001)
Cornelis, C., Jensen, R., Martin, G.H., Ślęzak, D.: Attribute selection with fuzzy decision reducts. Information Sciences 180, 209–224 (2010)
Devijver, P.A., Kittler, J.: Pattern Recognition: A Statistical Approach. Prentice-Hall, Englewood Cliffs (1982)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification and Scene Analysis. John Wiley and Sons, New York (1999)
Fang, J., Busse, J.W.G.: Mining of MicroRNA Expression Data—A Rough Set Approach. In: Wang, G.-Y., Peters, J.F., Skowron, A., Yao, Y. (eds.) RSKT 2006. LNCS (LNAI), vol. 4062, pp. 758–765. Springer, Heidelberg (2006)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286(5439), 531–537 (1999)
Gordon, G.J., Jensen, R.V., Hsiao, L.L., Gullans, S.R., Blumenstock, J.E., Ramaswamy, S., Richards, W.G., Sugarbaker, D.J., Bueno, R.: Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research 62, 4963–4967 (2002)
Gruzdz, A., Ihnatowicz, A., Ślęzak, D.: Interactive gene clustering - A case study of breast cancer microarray data. Information Systems Frontiers 8, 21–27 (2006)
Han, J., Kamber, M.: Data Mining, Concepts and Techniques. Morgan Kaufmann Publishers (2001)
Inuiguchi, M., Yoshioka, Y., Kusunoki, Y.: Variable-precision dominance-based rough set approach and attribute reduction. International Journal of Approximate Reasoning 50, 1199–1214 (2009)
Jain, A.N., Koile, K., Chapman, D.: Compass: Predicting biological activities from molecular surface properties. Performance comparisons on a steroid benchmark. Journal of Medicinal Chemistry 37, 2315–2327 (1994)
Jensen, R., Shen, Q.: Semantics-preserving dimensionality reduction: Rough and fuzzy-rough-based approach. IEEE Transactions on Knowledge and Data Engineering 16(12), 1457–1471 (2004)
Jiang, D., Tang, C., Zhang, A.: Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering 16(11), 1370–1386 (2004)
Katritzky, A.R., Lobanov, V., Karelson, M.: Comprehensive descriptors for structural and statistical analysis version 1.1. University of Florida (1994)
Kim, D.: Data classification based on tolerant rough set. Pattern Recognition 34(8), 1613–1624 (2001)
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997)
Koller, D., Sahami, M.: Toward optimal feature selection. In: Proceedings of the International Conference on Machine Learning, pp. 284–292 (1996)
Komorowski, J., Pawlak, Z., Polkowski, L., Skowron, A.: Rough sets: A tutorial. In: Pal, S., Skowron, A. (eds.) Rough-Fuzzy Hybridization: A New Trend in Decision Making, pp. 3–98. Springer, Singapore (1999)
Kudo, Y., Murai, T., Akama, S.: A Granularity-based framework of deduction, induction, and abduction. International Journal of Approximate Reasoning 50(8), 1215–1226 (2009)
Leach, A.R.: Molecular Modelling: Principles and Applications, vol. 2. Prentice-Hall (2001)
Li, J., Su, H., Chen, H., Futscher, B.W.: Optimal search-based gene subset selection for gene array cancer classification. IEEE Transactions on Information Technology in Biomedicine 11(4), 398–405 (2007)
Li, Z.R., Han, L.Y., Xue, Y., Yap, C.W., Li, H., Jiang, L., Chen, Y.Z.: MODEL – Molecular descriptor lab: A Web-based server for computing structural and physicochemical features of compounds. Biotechnology and Bioengineering 97, 389–396 (2007)
Liu, S.S., Yin, C.S., Li, Z.L., Cai, S.X.: QSAR study of steroid benchmark and dipeptides based on MEDV-13. Journal of Chemical Information and Computer Science 41, 321–329 (2001)
Liu, X., Krishnan, A., Mondry, A.: An entropy based gene selection method for cancer classification using microarray data. BMC Bioinformatics 6(76), 1–14 (2005)
Maji, P.: f-Information measures for efficient selection of discriminative genes from microarray data. IEEE Transactions on Biomedical Engineering 56(4), 1063–1069 (2009)
Maji, P., Paul, S.: Rough sets for selection of molecular descriptors to predict biological activity of molecules. IEEE Transactions on System, Man and Cybernetics, Part C, Applications and Reviews 40(6), 639–648 (2010)
Maji, P., Paul, S.: Rough set based maximum relevance-maximum significance criterion and gene selection from microarray data. International Journal of Approximate Reasoning 52(3), 408–426 (2011)
Modrzejewski, M.: Feature selection using rough sets theory. In: Proceedings of the 11th International Conference on Machine Learning, pp. 213–226 (1993)
Napolitano, F., Raiconi, G., Tagliaferri, R., Ciaramella, A., Staiano, A., Miele, G.: Clustering and visualization approaches for human cell cycle gene expression data analysis. International Journal of Approximate Reasoning 47, 70–84 (2008)
van der Pouw Kraan, T.C.T.M., Kraan, T.C.T.M., van Gaalen, F.A., Kasperkovitz, P.V., Verbeet, N.L., Smeets, T.J.M., Kraan, M.C., Fero, M., Tak, P.P., Huizinga, T.W.J., Pieterman, E., Breedveld, F.C., Alizadeh, A.A., Verweij, C.L.: Rheumatoid arthritis is a heterogeneous disease: Evidence for differences in the activation of the STAT-1 pathway between rheumatoid tissues. Arthritis and Rheumatism 48(8), 2132–2145 (2003)
van der Pouw Kraan, T.C.T.M., Wijbrandts, C.A., van Baarsen, L.G.M., Voskuyl, A.E., Rustenburg, F., Baggen, J.M., Ibrahim, S.M., Fero, M., Dijkmans, B.A.C., Tak, P.P., Verweij, C.L.: Rheumatoid arthritis subtypes identified by genomic profiling of peripheral blood cells: Assignment of a type I interferon signature in a subpopulation of pateints. Annals of the Rheumatic Diseases 66, 1008–1014 (2007)
Parthalain, N.M., Shen, Q.: Exploring the boundary region of tolerance rough sets for feature selection. Pattern Recognition 42(5), 655–667 (2009)
Pawlak, Z.: Rough Sets, Theoretical Aspects of Resoning About Data. Kluwer, Dordrecht (1991)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(8), 1226–1238 (2005)
Polanski, J., Walczak, B.: The comparative molecular surface analysis (COMSA): a novel tool for molecular design. Computers and Chemistry 24, 615–625 (2000)
Robert, D., Amat, L., Carbo-Dorca, R.: Three-dimensional quantitative structure-activity relationships from tuned molecular quantum similarity measures: Prediction of the corticosteroid-binding globulin binding affinity for a steroid family. Journal of Chemical Information and Computer Sciences 39, 333–344 (1999)
Robinson, D.D., Winn, P., Lyne, P., Richards, W.: Self-organizing molecular field analysis: A tool for structure-activity studies. Journal of Medicinal Chemistry 42, 573–583 (1999)
Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P., Lander, E.S., Loda, M., Kantoff, P.W., Golub, T.R., Sellers, W.R.: Gene expression correlates of clinical prostate cancer behavior. Cancer Research 1, 203–209 (2002)
Skowron, A., Rauszer, C.: The discernibility matrices and functions in information systems. In: Słowiński, R. (ed.) Intelligent Decision Support, pp. 331–362. Kluwer Academic Publishers, Dordrecht (1992)
Skowron, A., Świniarski, R.W., Synak, P.: Approximation Spaces and Information Granulation. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets III. LNCS, vol. 3400, pp. 175–189. Springer, Heidelberg (2005)
Ślęzak, D.: Approximate reducts in decision tables. In: Proceedings of the 6th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 1996), pp. 1159–1164 (1996)
Ślęzak, D., Wróblewski, J.: Roughfication of Numeric Decision Tables: The Case Study of Gene Expression Data. In: Yao, J., Lingras, P., Wu, W.-Z., Szczuka, M.S., Cercone, N.J., Ślęzak, D. (eds.) RSKT 2007. LNCS (LNAI), vol. 4481, pp. 316–323. Springer, Heidelberg (2007)
Sventik, V., Wang, T., Tong, C., Liaw, A., Sheridan, R.P., Song, Q.: Boosting: An ensemble learning tool for compound classification and QSAR modeling. Journal of Chemical Information and Modeling 45(3), 786–799 (2005)
Tuppurainen, K., Viisas, M., Laatikainen, R., Peräkylä, M.: Evaluation of a novel electronic eigenvalue (EEVA) molecular descriptor for QSAR/QSPR studies: Validation using a benchmark steroid data set. Journal of Chemical Information and Computer Sciences 42, 607–613 (2002)
Turner, D.B., Willett, P., Ferguson, A.M., Heritage, T.W.: Evaluation of a novel molecular vibration-based descriptor (EVA) for QSAR studies: 2. Model validation using a benchmark steroid dataset. Journal of Computer-Aided Molecular Design 13, 271–296 (1999)
Valdés, J.J., Barton, A.J.: Relevant Attribute Discovery in High Dimensional Data: Application to Breast Cancer Gene Expressions. In: Wang, G.-Y., Peters, J.F., Skowron, A., Yao, Y. (eds.) RSKT 2006. LNCS (LNAI), vol. 4062, pp. 482–489. Springer, Heidelberg (2006)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)
West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J.A., Marks, J.R., Nevins, J.R.: Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Science, USA 98(20), 11,462–11,467 (2001)
Xie, G., Zhang, J., Lai, K., Yu, L.: Variable precision rough set for group decision-making: an application. International Journal of Approximate Reasoning 49, 331–343 (2008)
Yao, Y.: Probabilistic rough set approximations. International Journal of Approximate Reasoning 49(2), 255–271 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Maji, P., Paul, S. (2013). Rough Set-Based Feature Selection: Criteria of Max-Dependency, Max-Relevance, and Max-Significance. In: Skowron, A., Suraj, Z. (eds) Rough Sets and Intelligent Systems - Professor Zdzisław Pawlak in Memoriam. Intelligent Systems Reference Library, vol 43. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30341-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-30341-8_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30340-1
Online ISBN: 978-3-642-30341-8
eBook Packages: EngineeringEngineering (R0)