A New Feature Selection and Feature Contrasting Approach Based on Quality Metric: Application to Efficient Classification of Complex Textual Data

Lamirel, Jean-Charles; Cuxac, Pascal; Chivukula, Aneesh Sreevallabh; Hajlaoui, Kafil

doi:10.1007/978-3-642-40319-4_32

A New Feature Selection and Feature Contrasting Approach Based on Quality Metric: Application to Efficient Classification of Complex Textual Data

Jean-Charles Lamirel²⁵,
Pascal Cuxac²⁶,
Aneesh Sreevallabh Chivukula²⁷ &
…
Kafil Hajlaoui²⁶

Conference paper

3460 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7867))

Abstract

Feature maximization is a cluster quality metric which favors clusters with maximum feature representation as regard to their associated data. In this paper we go one step further showing that a straightforward adaptation of such metric can provide a highly efficient feature selection and feature contrasting model in the context of supervised classification. We more especially show that this technique can enhance the performance of classification methods whilst very significantly outperforming (+80%) the state-of-the art feature selection techniques in the case of the classification of unbalanced, highly multidimensional and noisy textual data, with a high degree of similarity between the classes.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aha, D., Kibler, D.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Google Scholar
Attik, M., Lamirel, J.-C., Al Shehabi, S.: Clustering analysis for data with multiple labels. In: Proceedings of the IASTED International Conference on Databases and Applications (DBA), Innsbruck, Austria (2006)
Google Scholar
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Article MATH Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: A Review of Feature Selection Methods on Synthetic Data. Knowledge and Information Systems, 1–37 (2012)
Google Scholar
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International Group, Belmont (1984)
MATH Google Scholar
Dash, M., Liu, H.: Consistency-based search in feature selection. Artificial Intelligence 151(1), 155–176 (2003)
Article MathSciNet MATH Google Scholar
Daviet, H.: Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’ information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. PhD, Université de Nantes, France (2009)
Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research 3, 1289–1305 (2003)
MATH Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Machine Learning 46(1), 389–422 (2002)
Article MATH Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. The Journal of Machine Learning Research 3, 1157–1182 (2003)
MATH Google Scholar
Hall, M.A., Smith, L.A.: Feature Selection for Machine Learning: Comparing a Correlation-Based Filter Approach to the Wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, pp. 235–239. AAAI Press (1999)
Google Scholar
Hajlaoui, K., Cuxac, P., Lamirel, J.-C., François, C.: Enhancing patent expertise through automatic matching with scientific papers. In: Ganascia, J.-G., Lenca, P., Petit, J.-M. (eds.) DS 2012. LNCS, vol. 7569, pp. 299–312. Springer, Heidelberg (2012)
Chapter Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial Intelligence 97(1-2), 273–324 (1997)
Article MATH Google Scholar
Kononenko, I.: Estimating Attributes: Analysis and Extensions of RELIEF. In: Bergadano, F., De Raedt, L. (eds.) ECML 1994. LNCS, vol. 784, pp. 171–182. Springer, Heidelberg (1994)
Chapter Google Scholar
Ladha, L., Deepa, T.: Feature selection methods and algorithms. International Journal on Computer Science and Engineering 3(5), 1787–1797 (2011)
Google Scholar
Lallich, S., Rakotomalala, R.: Fast Feature Selection Using Partial Correlation for Multi-valued Attributes. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 221–231. Springer, Heidelberg (2000)
Chapter Google Scholar
Lamirel, J.-C., Al Shehabi, S., Francois, C., Hoffmann, M.: New classification quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics 60(3) (2004)
Google Scholar
Lamirel, J.-C., Ta, A.P.: Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: an application to social network analysis. In: Proceedings of the 4th International Conference on Webometrics, Informetrics and Scientometrics and 9th COLLNET Meeting, Berlin, Germany (2008)
Google Scholar
Lamirel, J.-C., Ghribi, M., Cuxac, P.: Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In: Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT 2010), Paris, France (2010)
Google Scholar
Lamirel, J.-C., Mall, R., Cuxac, P., Safi, G.: Variations to incremental growing neural gas algorithm based on label maximization. In: Proceedings of IJCNN 2011, San Jose, CA, USA (2011)
Google Scholar
Lamirel, J.-C.: A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. Scientometrics 93, 151–166 (2012)
Article Google Scholar
Mejía-Lavalle, M., Sucar, E., Arroyo, G.: Feature selection with a perceptron neural net. Feature Selection for Data Mining: Interfacing Machine Learning and Statistics (2006)
Google Scholar
Pearson, K.: On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2(11), 559–572 (1901)
Article Google Scholar
Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning. MIT Press (1998)
Google Scholar
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Google Scholar
Salton, G.: Automatic processing of foreign language documents. Prentice-Hill, Englewood Cliffs (1971)
Google Scholar
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Article Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of International Conference on New Methods in Language Processing (1994)
Google Scholar
Su, J., Zhang, H., Ling, C., Matwin, S.: Discriminative parameter learning for bayesian networks. In: ICML (2008)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2005)
Google Scholar
Yu, L., Liu, H.: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. In: ICML 2003, Washington DC, USA, pp. 856–863 (2003)
Google Scholar
Zhang, T., Oles, F.J.: Text categorization based on regularized linear classification methods. Inf. Retr. 4(1), 5–31 (2001)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

SYNALP Team - LORIA, INRIA Nancy-Grand Est, Vandoeuvre-les-Nancy, France
Jean-Charles Lamirel
INIST-CNRS, Vandoeuvre-les-Nancy, France
Pascal Cuxac & Kafil Hajlaoui
Center for Data Engineering, International Institute of Information Technology, Gachibowli, Hyderabad, Andhra Pradesh, India
Aneesh Sreevallabh Chivukula

Authors

Jean-Charles Lamirel
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Cuxac
View author publications
You can also search for this author in PubMed Google Scholar
Aneesh Sreevallabh Chivukula
View author publications
You can also search for this author in PubMed Google Scholar
Kafil Hajlaoui
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Mathematical Sciences, University of South Australia, 1 Mawson Lakes Boulevard, 5095, Adelaide, SA, Australia
Jiuyong Li
Advanced Analytics Institute, University of Technology, 2-12 Blackfriars Street, Chippendale, Blackfriars Campus, 2008, Sydney, NSW, Australia
Longbing Cao & Can Wang &
Department of Electrical and Computer Engineering, National University of Singapore, 4 Engineering Drive 3, 117576, Singapore, Singapore
Kay Chen Tan
School of Automation, Guangdong University of Technology, No. 100 Waihuan Xi Road, Panyu District, 510006, Guangzhou, China
Bo Liu
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, 701, Tainan, Taiwan
Vincent S. Tseng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lamirel, JC., Cuxac, P., Chivukula, A.S., Hajlaoui, K. (2013). A New Feature Selection and Feature Contrasting Approach Based on Quality Metric: Application to Efficient Classification of Complex Textual Data. In: Li, J., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7867. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40319-4_32

Download citation

DOI: https://doi.org/10.1007/978-3-642-40319-4_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40318-7
Online ISBN: 978-3-642-40319-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics