Abstract
Data Fusion and Data Grafting are concerned with combining files and information coming from different sources. The problem is not to extract data from a single database, but to merge information collected from different sample surveys. The typical data fusion situation formed of two data samples, the former made up of a complete data matrix X relative to a first survey, and the latter Y which contains a certain number of missing variables. The aim is to complete the matrix Y beginning from the knowledge acquired from the X. Thus, the goal is the definition of the correlation structure which joins the two data matrices to be merged. In this paper, we provide an innovative methodology for Data Fusion based on an incremental imputation algorithm in tree-based models. In addition, we consider robust tree validation by boosting iterations. A relevant advantage of the proposed method is that it works for a mixed data structure including both numerical and categorical variables. As benchmarking methods we consider explicit methods such as standard trees and multiple regression as well as an implicit method based principal component analysis. A widely extended simulation study proves that the proposed method is more accurate than the other methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aluja-Banet, T., Morineau, A., Rius, R.: La greffe de fichiers et ses conditions d’application. Méthode et exemple. In: Brossier, G., Dussaix, A.M. (eds.) Enquêtes et sondages, Dunod, Paris, pp. 94–102 (1997)
Aluja-Banet, T., Rius, R., Nonell, R., Martínez-Abarca, M.J.: Data Fusion and File Grafting. Analyses Multidimensionelles Des Données. In: Morineau, A., Fernández Aguirre, K. (eds.) NGUS 1997. 1 ed. París: CISIA-CERESTA, pp. 7–14 (1998)
Aria, M.: Un software fruibile ed interattivo per l’apprendimento statistico dei dati attraverso alberi esplorativi. Contributi metodologici ed applicativi. Phd thesis, University of Naples Federico II (2004)
Barcena, M.J., Tusell, F.: Enlace de encuestas: una propuesta metodológica y aplicación a la Encuesta de Presupuestos de Tempo. Qüestiio 23(2), 297–320 (1999)
Breiman, L., Friedman, J.H., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984)
Eibl, G., Pfeiffer, K.P.: How to make AdaBoost.M1 work for weak base classifiers by changing only one line of the code. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, Springer, Heidelberg (2002)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1) (1997)
Friedman, J.H., Popescu, B.E.: Predictive Learning via Rule Ensembles, Technical Report of Stanford University (2005)
Gey, S., Poggi, J.M.: Boosting and instability for regression trees. Computational Statistics and Data Analysis 50, 533–550 (2006)
Little, J.R.A., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley and Sons, New York (1987)
Petrakos, G., Conversano, C., Farmakis, G., Mola, F., Siciliano, R., Stavropoulos, P.: New ways to specify data edits. Journal of Royal Statistical Society, Series A, Part 2 167, 249–274 (2004)
Saporta, G.: Data fusion and data grafting. Computational Statistics and Data Analysis 38, 465–473 (2002)
Siciliano, R., Conversano, C.: Tree-based Classifiers for Conditional Missing Data Incremental Imputation. In: Proceedings of the International Conference on Data Clean, Jyvaskyla, May 29-31, 2002, University of Jyvaskyla (2002)
Siciliano, R., Aria, M., D’Ambrosio, A.: Boosted incremental tree-based imputation of missing data. Data Analysis, Classification and the Forward Search. In: Springer series in Studies in Classification, Data Analysis, and Knowledge Organization, Springer, Heidelberg (2006)
Siciliano, R., Aria, M., Conversano, C.: Harvesting trees: methods, software and applications. In: Proceedings in Computational Statistics: 16th Symposium of IASC. COMPSTAT2004, held Prague, August 23-27. 2004. Eletronical Edition (CD), Physica-Verlag, Heidelberg (2004)
van der Putten, P.: Data Fusion for Data Mining: a Problem Statement. In: Coil Seminar 2000, June 22-23, 2000, Chios, Greece (2000)
van der Putten, P., Kok, J.N., Gupta, A.: Data Fusion through Statistical Matching. MIT Sloan School of Management Working Paper No. 4342-02, Cambridge, MA (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
D’Ambrosio, A., Aria, M., Siciliano, R. (2007). Robust Tree-Based Incremental Imputation Method for Data Fusion. In: R. Berthold, M., Shawe-Taylor, J., Lavrač, N. (eds) Advances in Intelligent Data Analysis VII. IDA 2007. Lecture Notes in Computer Science, vol 4723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74825-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-74825-0_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74824-3
Online ISBN: 978-3-540-74825-0
eBook Packages: Computer ScienceComputer Science (R0)