Abstract
A scientific phenomenon under study may often be manifested by data arising from processes, i.e. sources, that may describe this phenomenon. In this context of multi-source data, we define the “out-of-source” error, that is the error committed when a new observation of unknown source origin is allocated to one of the sources using a rule that is trained on the known labeled data. We present an unbiased estimator of this error, and discuss its variance. We derive natural and easily verifiable assumptions under which the consistency of our estimator is guaranteed for a broad class of loss functions and data distributions. Finally, we evaluate our theoretical results via a simulation study.
References
Afendras, G., & Markatou, M. (2016). Optimality of training/test size and resampling effectiveness of cross-validation estimators of the generalization error. arXiv:1511.02980v1 [math.ST].
Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys, 4, 40–79.
Ben-David, S., Blitzer, J, Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). A theory of learning from different domains. Machine Learning, 79, 151–175.
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth.
Billingsley, P. (1995). Probability and measure, 3rd ed. Wiley series in probability and mathematical statistics. New York: Wiley.
Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70(350), 320–328.
Geras, K., & Sutton, C. (2013). Multiple-source cross-validation. In Proceedings of the 30 th International Conference on Machine Learning, Atlanta, GA (2013). JMLR: W&CP, 28(3), 1292–1300.
Isserlis, L. (1918). On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika, 12, 134–139.
Markatou, M., Tian, H, Biswas, S., & Hripcsak, G. (2005). Analysis of variance of cross-validation estimators of the generalization error. Journal of Machine Learning Research, 6, 1127–1168.
Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52, 239–281.
Riley, R. D., Lambert, P. C., & Abo-Zaid, G. (2010). Meta-analysis of individual participant data: Rationale, conduct, and reporting. British Medical Journal, 340, c221. https://doi.org/doi:10.1136/bmj.c221
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B, 36(2), 111–147.
Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika, 64(1), 29–35.
Acknowledgements
Dr. Markatou would like to thank the Jacobs School of Medicine and Biomedical Science for facilitating this work through institutional financial resources (to M. Markatou) that supported the work of the first author of this paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Afendras, G., Markatou, M. (2017). The Out-of-Source Error in Multi-Source Cross Validation-Type Procedures. In: Chen, DG., Jin, Z., Li, G., Li, Y., Liu, A., Zhao, Y. (eds) New Advances in Statistics and Data Science. ICSA Book Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-69416-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-69416-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69415-3
Online ISBN: 978-3-319-69416-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)