Critical Feature Selection and Critical Sampling for Data Mining

Ribeiro, Bernardete; Silva, José; Sung, Andrew H.; Suryakumar, Divya

doi:10.1007/978-981-13-0716-4_2

Bernardete Ribeiro¹⁴,
José Silva¹⁴,
Andrew H. Sung¹⁵ &
…
Divya Suryakumar¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 844))

Included in the following conference series:

International Conference on Computational Intelligence, Cyber Security, and Computational Models

459 Accesses
1 Citations

Abstract

The rapidly growing big data generated by connected sensors, devices, the web and social network platforms, etc., have stimulated the advancement of data science, which holds tremendous potential for problem solving in various domains. How to properly utilize the data in model building to obtain accurate analytics and knowledge discovery is a topic of great importance in data mining, and wherefore two issues arise: how to select a critical subset of features and how to select a critical subset of data points for sampling. This paper presents ongoing research that suggests: 1. the critical feature dimension problem is theoretically intractable, but simple heuristic methods may well be sufficient for practical purposes; 2. there are big data analytic problems where evidence suggest that the success of data mining depends more on the critical feature dimension than the specific features selected, thus a random selection of the features based on the dataset’s critical feature dimension will prove sufficient; and 3. The problem of critical sampling has the same intractable complexity as critical feature dimension, but again simple heuristic methods may well be practicable in most applications; experimental results with several versions of the heuristic method are presented and discussed. Finally, a set of metrics for data quality is proposed based on the concepts of critical features and critical sampling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1), 245–271 (1997)
Article MathSciNet Google Scholar
Domingo, C., Gavaldà, R., Watanabe, O.: Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Min. Knowl. Discov. 6(2), 131–152 (2002)
Article MathSciNet Google Scholar
UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 20 Oct 2017
Garey, M.R., Johnson, D.S.: A Guide to the Theory of NP-Completeness, p. 70. WH Freemann, New York (1979)
Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
National Research Council: Frontiers in massive data analysis. The National Academies Press, Washington, DC (2013)
Google Scholar
Papadimitriou, C.H., Yannakakis, M.: The complexity of facets (and some facets of complexity). J. Comput. Syst. Sci. 28, 244–259 (1984)
Article MathSciNet Google Scholar
Provost, F., Jensen, D., Oates, T.: Progressive sampling. In: Liu, H., Motoda, H. (eds.) Instance Selection and Construction for Data Mining, pp. 151–170. Springer, Boston (2001)
Chapter Google Scholar
Suryakumar, D.: The critical dimension problem; no compromise feature selection. Ph.D. dissertation. New Mexico Institute of Mining and Technology (2013)
Google Scholar
Suryakumar, Divya, Sung, A. H., and Liu, Q.: The critical dimension problem: No compromise feature selection. In: Proceedings of eKNOW 2014, the Sixth International Conference on Information, Process, and Knowledge Management, pp. 145–151. IARIA, Barcelona, Spain (2014)
Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the reviewer of earlier versions of our papers whose insightful comments point them to fruitful directions of study, and their other colleagues and students who contributed to many helpful discussions and experiments.

Author information

Authors and Affiliations

Department of Informatics Engineering, University of Coimbra, 3030-290, Coimbra, Portugal
Bernardete Ribeiro & José Silva
School of Computing, The University of Southern Mississippi, Hattiesburg, MS, 39406, USA
Andrew H. Sung
ConstructConnect, Cincinnati, USA
Divya Suryakumar

Authors

Bernardete Ribeiro
View author publications
You can also search for this author in PubMed Google Scholar
José Silva
View author publications
You can also search for this author in PubMed Google Scholar
Andrew H. Sung
View author publications
You can also search for this author in PubMed Google Scholar
Divya Suryakumar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew H. Sung .

Editor information

Editors and Affiliations

PSG College of Technology, Coimbatore, India
Geetha Ganapathi
BITS Pilani, KK Birla, Goa, India
Arumugam Subramaniam
University of the Basque Country, San Sebastian, Spain
Manuel Graña
PSG College of Technology, Coimbatore, India
Suresh Balusamy
PSG College of Technology, Coimbatore, India
Rajamanickam Natarajan
PSG College of Technology, Coimbatore, India
Periakaruppan Ramanathan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ribeiro, B., Silva, J., Sung, A.H., Suryakumar, D. (2018). Critical Feature Selection and Critical Sampling for Data Mining. In: Ganapathi, G., Subramaniam, A., Graña, M., Balusamy, S., Natarajan, R., Ramanathan, P. (eds) Computational Intelligence, Cyber Security and Computational Models. Models and Techniques for Intelligent Systems and Automation. ICC3 2017. Communications in Computer and Information Science, vol 844. Springer, Singapore. https://doi.org/10.1007/978-981-13-0716-4_2

Download citation

DOI: https://doi.org/10.1007/978-981-13-0716-4_2
Published: 11 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0715-7
Online ISBN: 978-981-13-0716-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics