Feature Selection for Data Mining

de Angelis, Vanda; Felici, Giovanni; Mancinelli, Gabriella

doi:10.1007/0-387-34296-6_6

Vanda de Angelis³,
Giovanni Felici⁴ &
Gabriella Mancinelli³

Part of the book series: Massive Computing ((MACO,volume 6))

1252 Accesses
5 Citations

Abstract

Feature Selection methods in Data Mining and Data Analysis problems aim at selecting a subset of the variables, or features, that describe the data in order to obtain a more essential and compact representation of the available information. The selected subset has to be small in size and must retain the information that is most useful for the specific application. The role of Feature Selection is particularly important when computationally expensive Data Mining tools are used, or when the data collection process is difficult or costly. Feature Selection problems are typically solved in the literature using search techniques, where the evaluation of a specific subset is accomplished by a proper function (filter methods) or directly by the performance of a Data Mining tool (wrapper methods). In this work we show how the Feature Selection problem can be formulated as a subgraph selection problem derived from the lightest k-subgraph problem, and solved as an Integer Program. The proposed formulation is very flexible, as additional conditions on the solution can be added in the formulation. Although optimal solutions for such problems are difficult to find in the worst case, a large number of test instances have been solved efficiently by commercial tools. Finally, an application to a database on urban mobility is presented, where the proposed method is integrated in the Data Mining tool named Lsquare and is compared with other approaches.

Triantaphyllou, E. and G. Felici (Eds.), Data Mining and Knowledge Discovery Approaches based on Rule Induction Techniques, Massive Computing Series, Springer, Heidelberg, Germany, pp. 227–252, 2006.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

G. Alleva, F. D. Falorsi, S. Falorsi. Modelli interpretativi e previsivi della domanda di trasporto locale. Rapporto finale di ricerca-28 febbraio 2002 ISFORT.
Google Scholar
H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proceedings of the 9 ^th National Conference on Artificial Intelligence. MIT Press, Cambridge, Mass., 1991.
Google Scholar
Y. Asahiro, K. Iwama, H. Tamaki, T. Tokuyama. Greedily finding a dense subgraph. Proceedings of the 5 ^th Scandinavian Workshop on Algorithm Theory (SWAT). Lecture Notes in Computer Science, 1097, p. 136–148, Springer-Verlag, Reykjavik, Iceland, 1996.
Google Scholar
M. Charikar, V. Guruswami, R. Kumar, S. Rajagopalan and A. Sahai. Combinatorial Feature Selection Problems. In Proceedings of FOCS 2000.
Google Scholar
R. Caruana and D. Freitag. Greedy attribute selection. In Machine Learning: Proceedings of the 11 ^th International Conference. Morgan Kaufmann, New Brunswick, New Jersey, 1994.
Google Scholar
K.J. Cherkauer and J.W. Shavlik. Growing simpler decision trees to facilitate knowledge discovery. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, Portland, Oregon, 1996.
Google Scholar
M. Dash and H. Liu. Feature Selection for Classification. Intelligent Data Analysis, I(3), 1997.
Google Scholar
P. Domingos. Context-sensitive feature selection for lazy learners. Artificial Intelligence Review, (11):227–253, 1997.
Article Google Scholar
U. Feige and M. Seltser. On the densest k-subgraph problem. Technical Report CS97-16, Weizmann Institute of Science.
Google Scholar
U. Feige, G. Kortsarz, and D. Peleg. The Dense k-Subgraph Problem. Algoritmica, 2001.
Google Scholar
G. Felici and M. F. Arezzo. Tecniche avanzate di Data Mining applicate all’analisi della mobilità individuale, www.ing.unipi.it/input2003.
Google Scholar
G. Felici and K. Truemper. A Minsat Approach for Learning in Logic Domains, INFORMS Journal of Computing, Vol. 14, No. 1, 20–36, 2002.
Article MathSciNet Google Scholar
M. R. Garey and D. S. Johnson. Computers and Intractability. Freeman, 1979.
Google Scholar
J. H. Gennari, P. Langley, and D. Fisher. Models of incremental concept formation. Artificial Intelligence 40,: 11–61, 1989.
Article Google Scholar
M. X. Goemans. Mathematical programming and approximation algorithms, Lezione su Approximate Solution of Hard Combinatorial Problems, Summer School, Udine 1996.
Google Scholar
M. X. Goemans and D. P. Williamson. Improved approximation algorithms for Maximum Cut and Satisfiability Problems Using Semidefinite Programming. Journal of ACM, VO1 42, p. 1115–1145, 1995.
Article MATH MathSciNet Google Scholar
M. A. Hall. Correlation-based Feature Selection for Machine Learning. In Proceedings of the 17 ^th International Conference on Machine Learning, Stanford University, C.A. Morgan Kaufmann Publishers, 2000.
Google Scholar
G. H. John, R. Kohavi, and P. Pfleger. Irrelevant features and the subset selection problem. In Machine Learning: Proceedings of the 11 ^th International Conference. Morgan Kaufmann, 1994.
Google Scholar
R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, special issue on relevance, 97(1–2):273–324, 1996.
Google Scholar
D. Koller and M. Sahami. Hierachically classifying documents using very few words. In Machine learning: Proceedings of the 14 ^th International Conference, 1997.
Google Scholar
P. Langley. Selection of relevant features in machine learning. In Proceedings of the AAAI Fall Symposium on Relevance. AAAI Press, 1994.
Google Scholar
P. Langley and S. Sage. Scaling to domains with irrelevant features. In R. Greiner, editor, Computational Learning Theory an Natural Learning Systems, volume 4. MIT Press, 1994.
Google Scholar
H. Liu and H. Motoda. Feature Selection for knowledge discovery and data mining. Kluwer Academic Publishers, 2000.
Google Scholar
H. Liu and R. Setiono. A probabilistic approach to feature selection: A filter solution. In Machine learning: Proceedings of the 13 ^th International Conference on Machine Learning. Morgan Kaufmann, 1996.
Google Scholar
J. Sheinvald, B. Dom and W. Niblack. Unsupervised image segmentation using the minimum description length principle. In Proceedings of the 10 ^th International Conference on Pattern Recognition, 1992.
Google Scholar
A. L. Oliveira and A. S. Vincetelli. Constructive induction using a non-greedy strategy for feature selection. In Proceedings of the 9 ^th International Conference on Machine Learning, 355–360, Morgan Kaufmann, Aberdeen, Scotland, 1992.
Google Scholar
A. W. Moore and M. S. Lee. Efficient algorithms for minimizing cross validation error. In Machine learning: Proceedings of the 11 ^th International Conference. Morgan Kaufmann, 1994.
Google Scholar
A. W. Moore, D. J. Hill and M. P. Johnson. Computational Learning Theory and Natural Learning Systems, Volume 3. MIT Press, 1992.
Google Scholar
M. Pazzani. Searching for dependencies in Bayesian classifiers. In Proceedings of the 5 ^th International Workshop on AI and Statistics, 1995.
Google Scholar
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
Google Scholar
J. C. Schlimmer. Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning. In Proceedings of the 10 ^th International Conference on Machine Learning, pp. 284–290, Amherst, MA: Morgan Kaufmann (1993).
Google Scholar
D. B. Skalak. Prototype and feature selection by sampling and random mutation hill climbing algorithms. In Machine Learning: Proceedings of the 11 ^th International Conference. Morgan Kaufmann, 1994.
Google Scholar
H. Vafaie and K. De Jong. Genetic algorithms as a tool for restructuring feature space representations. In Proceedings of the International Conference on Tools with A.I. IEEE Computer Society Press, 1995.
Google Scholar
Y. Ye and J. Zhang. Approximation of Dense-n/2-Subgraph and the Complement of Min-Bisection, Working Paper, Department of Management Sciences, The University of Iowa (1999).
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Statistica Probabilità e Statistiche Applicate, Università di Roma “La Sapienza”, Piazzale A. Moro 5, 00185, Rome, Italy
Vanda de Angelis & Gabriella Mancinelli
Istituto di Analisi dei Sistemi ed Informatica “A. Ruberti”, Consiglio Nazionale delle Ricerche, Viale Manzoni 30, 00185, Rome, Italy
Giovanni Felici

Authors

Vanda de Angelis
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Felici
View author publications
You can also search for this author in PubMed Google Scholar
Gabriella Mancinelli
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Louisiana State University, Baton Rouge, Louisiana, USA
Evangelos Triantaphyllou
Consiglio Nazionale delle Ricerche, Rome, Italy
Giovanni Felici

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

de Angelis, V., Felici, G., Mancinelli, G. (2006). Feature Selection for Data Mining. In: Triantaphyllou, E., Felici, G. (eds) Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Massive Computing, vol 6. Springer, Boston, MA . https://doi.org/10.1007/0-387-34296-6_6

Download citation

DOI: https://doi.org/10.1007/0-387-34296-6_6
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-34294-8
Online ISBN: 978-0-387-34296-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics