Optimizing text classification through efficient feature selection based on quality metric

Lamirel, Jean-Charles; Cuxac, Pascal; Chivukula, Aneesh Sreevallabh; Hajlaoui, Kafil

doi:10.1007/s10844-014-0317-4

Optimizing text classification through efficient feature selection based on quality metric

Published: 23 May 2014

Volume 45, pages 379–396, (2015)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Jean-Charles Lamirel¹,
Pascal Cuxac²,
Aneesh Sreevallabh Chivukula³ &
…
Kafil Hajlaoui³

794 Accesses
21 Citations
4 Altmetric
Explore all metrics

Abstract

Feature maximization is a cluster quality metric which favors clusters with maximum feature representation as regard to their associated data. In this paper we show that a simple adaptation of such metric can provide a highly efficient feature selection and feature contrasting model in the context of supervised classification. The method is experienced on different types of textual datasets. The paper illustrates that the proposed method provides a very significant performance increase, as compared to state of the art methods, in all the studied cases even when a single bag of words model is exploited for data description. Interestingly, the most significant performance gain is obtained in the case of the classification of highly unbalanced, highly multidimensional and noisy data, with a high degree of similarity between the classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Novel Approach to Feature Selection Based on Quality Estimation Metrics

A New Feature Selection and Feature Contrasting Approach Based on Quality Metric: Application to Efficient Classification of Complex Textual Data

A Novel Feature Selection Technique for Text Classification

Notes

Since Feature recall is equivalent to the conditional probability P(g|p) and Feature precision is equivalent to the conditional probability P(p|g), this former strategy can be classified as an expectation maximization approach with respect to the original definition given by Dempster et al. (1977). Harmonic mean provides an additional influence to the lowest of the two values in the combination of feature recall and feature precision.
See Section 4 for more details on usual weighting schemes exploited on textual data.
The QUAERO project was initiated to meet multimedia content analysis requirements for consumers and professionals facing the rapid increase of accessible digital information. This collaborative research and development project focuses on the areas of automatic extraction of information, analysis, classification and usage of digital multimedia content for professionals and consumers. One specific subtask of the project is to develop automatic patents’ validation tools.
http://www.ncbi.nlm.nih.gov/pubmed/
http://web.ist.utl.pt/~acardoso/datasets/
http://www.research.att.com/~lewis/reuters21578.html
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
http://www.cs.waikato.ac.nz/ml/weka/
In terms of active variables (see Section 3 for details).
The computation is performed on Linux with a laptop equipped with Intel^®;Pentium^®; cpu B970 2.3Ghz and with 8Go standard memory.
http://www.quaero.org
http://www.oseo.fr/

References

Aha, D., & Kibler, D. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Google Scholar
Attik, M., Lamirel, J.-C., Al Shehabi, S. (2006). Clustering analysis for data with multiple labels. In Proceedings of the IASTED International conference on databases and applications (DBA). Innsbruck.
Bache, K., & Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Article MATH Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A. (2012). A review of feature selection methods on synthetic data. Knowledge and Information Systems, 1–37.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Belmont: Wadsworth International Group.
MATH Google Scholar
Chawla, N.V., Bowyer, K.V., Hall, L.O., Kegelmeyer, W.P. (2002). Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
MATH Google Scholar
Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1), 155–176.
Article MATH MathSciNet Google Scholar
Daviet, H. (2009). Class-Add, une procédure de sélection de variables basée sur une troncature k-additive de l’ information mutuelle et sur une classification ascendante hiérarchique en pré-traitement. PhD, Université de Nantes, France.
Dempster, A. P., Laird, N. M., Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B, 39(1), 1–38.
MATH MathSciNet Google Scholar
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289–1305.
MATH Google Scholar
Good, P. (2006). Resampling methods, 3rd edn. Birkhauser.
Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine learning, 46(1), 389–422.
Article MATH Google Scholar
Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157–1182.
MATH Google Scholar
Hall, M.A., & Smith, L.A. (1999). Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In Proceedings of the 12th international florida artificial intelligence research society conference (pp. 235-239). AAAI Press.
Hajlaoui, K., Cuxac, P., Lamirel, J.C., Francois, C. (2012). Enhancing patent expertise through automatic matching with scientific papers. Discovery Science LNCS, 7569, 299–312.
Article Google Scholar
Ken Lang, K. (1995). Learning to filter netnews. In Proceedings of the 12th international conference on machine learning (pp. 331–339).
Kohavi, R., & John, G.H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 273–324.
Article MATH Google Scholar
Kononenko, I. (1994). Estimating Attributes: Analysis and Extensions of RELIEF. European Conference on Machine Learning, 171–182.
Ladha, L., & Deepa, T. (2011). Feature selection methods and algorithms. International Journal on Computer Science and Engineering, 3(5), 1787–1797.
Google Scholar
Lallich, S., & Rakotomalala, R. (2000). Fast Feature Selection Using Partial Correlation for Multi-valued Attributes. In D. A. Zighed, J. Komorowski, J. Zytkow (Eds.), Principles of data mining and knowledge discovery, 221-231. Lecture notes in computer science (pp. 1910). Berlin-Heidelberg: Springer.
Lamirel, J.-C., Al Shehabi, S., Francois, C., Hoffmann, M. (2004). New classification quality estimators for analysis of documentary information: application to patent analysis and web mapping. Scientometrics, 60(3).
Lamirel, J.-C., & Ta, A.P. (2008). Combination of hyperbolic visualization and graph-based approach for organizing data analysis results: an application to social network analysis. In Proceedings of the 4th international conference on webometrics, informetrics and scientometrics and 9th COLLNET meeting. Berlin.
Lamirel, J.-C., Ghribi, M., Cuxac, P. (2010). Unsupervised recall and precision measures: a step towards new efficient clustering quality indexes. In Proceedings of the 19th international conference on computational statistics (COMPSTAT’2010). Paris.
Lamirel, J.-C, Mall, R., Cuxac, P., Safi, G. (2011). Variations to incremental growing neural gas algorithm based on label maximization. In Proceedings of IJCNN 2011. San Jose.
Lamirel, J.-C. (2012). A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research. Scientometrics, 93, 151–166.
Article Google Scholar
Mejía-Lavalle, M., Sucar, E., Arroyo, G. (2006). Feature selection with a perceptron neural net. Feature selection for data mining: interfacing machine learning and statistics.
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(11), 559–572.
Article Google Scholar
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schoelkopf, C. Burges, A. Smola (Eds.). Advances in kernel methods - support vector learning. MIT Press.
Porter, M.F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Article Google Scholar
Quinlan, R. (1993). C4.5: Programs for machine learning. San Mateo: Morgan Kaufmann.
Google Scholar
Salton, G. (1971). Automatic processing of foreign language documents. Englewood Cliffs: Prentice-Hill.
Google Scholar
Salton, G., & Buckley, C. (1988). Term weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.
Article Google Scholar
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing.
Witten, I.H., & Frank, E. (2005). Data mining: practical machine learning tools and techniques. Morgan Kaufmann.
Yu, L., & Liu, H. (2003). Feature selection for high-dimensional data: a fast correlation-based filter solution. ICML 2003, 856–863. Washington.
Zhang, T., & Oles, F.J. (2001). Text categorization based on regularized linear classification methods. Information Retrieval, 4(1), 5–31.
Article MATH Google Scholar

Download references

Acknowledgments

This work was done under the program QUAERO^{Footnote 11} supported by OSEO^{Footnote 12} French national agency of research development.

Author information

Authors and Affiliations

SYNALP Team - LORIA, INRIA Nancy-Grand Est, Vandoeuvre-les-Nancy, France
Jean-Charles Lamirel
INIST-CNRS, Vandoeuvre-les-Nancy, France
Pascal Cuxac
Center For Data Engineering, International Institute of Information Technology, Gachibowli Hyderabad, Andhra Pradesh, India
Aneesh Sreevallabh Chivukula & Kafil Hajlaoui

Authors

Jean-Charles Lamirel
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Cuxac
View author publications
You can also search for this author in PubMed Google Scholar
Aneesh Sreevallabh Chivukula
View author publications
You can also search for this author in PubMed Google Scholar
Kafil Hajlaoui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jean-Charles Lamirel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lamirel, JC., Cuxac, P., Chivukula, A.S. et al. Optimizing text classification through efficient feature selection based on quality metric. J Intell Inf Syst 45, 379–396 (2015). https://doi.org/10.1007/s10844-014-0317-4

Download citation

Received: 08 September 2013
Revised: 10 January 2014
Accepted: 23 March 2014
Published: 23 May 2014
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10844-014-0317-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimizing text classification through efficient feature selection based on quality metric

Abstract

Access this article

Similar content being viewed by others

A Novel Approach to Feature Selection Based on Quality Estimation Metrics

A New Feature Selection and Feature Contrasting Approach Based on Quality Metric: Application to Efficient Classification of Complex Textual Data

A Novel Feature Selection Technique for Text Classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimizing text classification through efficient feature selection based on quality metric

Abstract

Access this article

Similar content being viewed by others

A Novel Approach to Feature Selection Based on Quality Estimation Metrics

A New Feature Selection and Feature Contrasting Approach Based on Quality Metric: Application to Efficient Classification of Complex Textual Data

A Novel Feature Selection Technique for Text Classification

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation