Abstract
Machine learning research, e.g. genomics research, is often based on sparse datasets that have very large numbers of features, but small samples sizes. Such configuration promotes the influence of chance on the learning process as well as on the evaluation. Prior research underlined the problem of generalization of models obtained based on such data. In this paper, we deeply investigate the influence of chance on classification and regression. We empirically show how considerable the influence of chance such datasets is. This brings the conclusions drawn based on them into question. We relate the observations of chance correlation to the problem of method generalization. Finally, we provide a discussion of chance correlation and guidelines that mitigate the influence of chance.
Index Terms
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Fachmedien Wiesbaden GmbH, ein Teil von Springer Nature
About this paper
Cite this paper
Taha, A.A., Bampoulidis, A., Lupu, M. (2019). Chance influence in datasets with a large number of features. In: Haber, P., Lampoltshammer, T., Mayr, M. (eds) Data Science – Analytics and Applications. Springer Vieweg, Wiesbaden. https://doi.org/10.1007/978-3-658-27495-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-658-27495-5_2
Published:
Publisher Name: Springer Vieweg, Wiesbaden
Print ISBN: 978-3-658-27494-8
Online ISBN: 978-3-658-27495-5
eBook Packages: Computer Science and Engineering (German Language)