Abstract
In this work we demonstrate the effect of small sample size on the risk that feature selection algorithms will select irrelevant features when dealing with high-dimensional data. We develop a simple analytical model to quantify this risk; we verify this model by the means of simulation. These results (i) explain the inherent instability of feature selection from high-dimensional, small sample size data and (ii) can be used to estimate the minimum required sample size which leads to good stability of features. Such results are useful when dealing with data from high-throughput studies.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Ein-Dor L, Kela I, Getz G, Givol D, Domany E (2005) Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21(2):171–178
Ein-Dor L, Zuk O, Domany E (2006) Thousands of samples are needed to generate a robust gene list for predicting outcome of cancer. Proc Natl Acad Sci 103(15):5923–5928
Fisher RA (1915) Frequency distribution of the values of correlation coefficient in samples from an indefinitely large population. Biometrica 10(4):507–521
Fisher RA (1921) On the “probable error” of a coefficient of correlation deduced from a small sample. Metron 1:3–32
Maciejewski H (2013) Predictive modelling in high-dimensional data: prior domain knowledge-based approaches. Oficyna Wydawnicza Politechniki Wrocławskiej, Wrocław
Wu MC, Lin X (2009) Prior biological knowledge-based approaches for the analysis of genome-wide expression profiles using gene sets and pathways. Stat Methods Med Res 18(6):577–593
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Maciejewski, H. (2015). Risk of Selection of Irrelevant Features from High-Dimensional Data with Small Sample Size. In: Steland, A., Rafajłowicz, E., Szajowski, K. (eds) Stochastic Models, Statistics and Their Applications. Springer Proceedings in Mathematics & Statistics, vol 122. Springer, Cham. https://doi.org/10.1007/978-3-319-13881-7_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-13881-7_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13880-0
Online ISBN: 978-3-319-13881-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)