Robust Tree-Based Incremental Imputation Method for Data Fusion

D’Ambrosio, Antonio; Aria, Massimo; Siciliano, Roberta

doi:10.1007/978-3-540-74825-0_16

Antonio D’Ambrosio¹,
Massimo Aria¹ &
Roberta Siciliano¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4723))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1513 Accesses
8 Citations

Abstract

Data Fusion and Data Grafting are concerned with combining files and information coming from different sources. The problem is not to extract data from a single database, but to merge information collected from different sample surveys. The typical data fusion situation formed of two data samples, the former made up of a complete data matrix X relative to a first survey, and the latter Y which contains a certain number of missing variables. The aim is to complete the matrix Y beginning from the knowledge acquired from the X. Thus, the goal is the definition of the correlation structure which joins the two data matrices to be merged. In this paper, we provide an innovative methodology for Data Fusion based on an incremental imputation algorithm in tree-based models. In addition, we consider robust tree validation by boosting iterations. A relevant advantage of the proposed method is that it works for a mixed data structure including both numerical and categorical variables. As benchmarking methods we consider explicit methods such as standard trees and multiple regression as well as an implicit method based principal component analysis. A widely extended simulation study proves that the proposed method is more accurate than the other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aluja-Banet, T., Morineau, A., Rius, R.: La greffe de fichiers et ses conditions d’application. Méthode et exemple. In: Brossier, G., Dussaix, A.M. (eds.) Enquêtes et sondages, Dunod, Paris, pp. 94–102 (1997)
Google Scholar
Aluja-Banet, T., Rius, R., Nonell, R., Martínez-Abarca, M.J.: Data Fusion and File Grafting. Analyses Multidimensionelles Des Données. In: Morineau, A., Fernández Aguirre, K. (eds.) NGUS 1997. 1 ed. París: CISIA-CERESTA, pp. 7–14 (1998)
Google Scholar
Aria, M.: Un software fruibile ed interattivo per l’apprendimento statistico dei dati attraverso alberi esplorativi. Contributi metodologici ed applicativi. Phd thesis, University of Naples Federico II (2004)
Google Scholar
Barcena, M.J., Tusell, F.: Enlace de encuestas: una propuesta metodológica y aplicación a la Encuesta de Presupuestos de Tempo. Qüestiio 23(2), 297–320 (1999)
MathSciNet Google Scholar
Breiman, L., Friedman, J.H., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984)
Google Scholar
Eibl, G., Pfeiffer, K.P.: How to make AdaBoost.M1 work for weak base classifiers by changing only one line of the code. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, Springer, Heidelberg (2002)
Chapter Google Scholar
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1) (1997)
Google Scholar
Friedman, J.H., Popescu, B.E.: Predictive Learning via Rule Ensembles, Technical Report of Stanford University (2005)
Google Scholar
Gey, S., Poggi, J.M.: Boosting and instability for regression trees. Computational Statistics and Data Analysis 50, 533–550 (2006)
Article MathSciNet Google Scholar
Little, J.R.A., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley and Sons, New York (1987)
MATH Google Scholar
Petrakos, G., Conversano, C., Farmakis, G., Mola, F., Siciliano, R., Stavropoulos, P.: New ways to specify data edits. Journal of Royal Statistical Society, Series A, Part 2 167, 249–274 (2004)
MathSciNet Google Scholar
Saporta, G.: Data fusion and data grafting. Computational Statistics and Data Analysis 38, 465–473 (2002)
Article MATH MathSciNet Google Scholar
Siciliano, R., Conversano, C.: Tree-based Classifiers for Conditional Missing Data Incremental Imputation. In: Proceedings of the International Conference on Data Clean, Jyvaskyla, May 29-31, 2002, University of Jyvaskyla (2002)
Google Scholar
Siciliano, R., Aria, M., D’Ambrosio, A.: Boosted incremental tree-based imputation of missing data. Data Analysis, Classification and the Forward Search. In: Springer series in Studies in Classification, Data Analysis, and Knowledge Organization, Springer, Heidelberg (2006)
Google Scholar
Siciliano, R., Aria, M., Conversano, C.: Harvesting trees: methods, software and applications. In: Proceedings in Computational Statistics: 16th Symposium of IASC. COMPSTAT2004, held Prague, August 23-27. 2004. Eletronical Edition (CD), Physica-Verlag, Heidelberg (2004)
Google Scholar
van der Putten, P.: Data Fusion for Data Mining: a Problem Statement. In: Coil Seminar 2000, June 22-23, 2000, Chios, Greece (2000)
Google Scholar
van der Putten, P., Kok, J.N., Gupta, A.: Data Fusion through Statistical Matching. MIT Sloan School of Management Working Paper No. 4342-02, Cambridge, MA (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Matematica e Statistica, Università di Napoli Federico II, Italy
Antonio D’Ambrosio, Massimo Aria & Roberta Siciliano

Authors

Antonio D’Ambrosio
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Aria
View author publications
You can also search for this author in PubMed Google Scholar
Roberta Siciliano
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Michael R. Berthold John Shawe-Taylor Nada Lavrač

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

D’Ambrosio, A., Aria, M., Siciliano, R. (2007). Robust Tree-Based Incremental Imputation Method for Data Fusion. In: R. Berthold, M., Shawe-Taylor, J., Lavrač, N. (eds) Advances in Intelligent Data Analysis VII. IDA 2007. Lecture Notes in Computer Science, vol 4723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74825-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-74825-0_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74824-3
Online ISBN: 978-3-540-74825-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics