Skip to main content

Robust Tree-Based Incremental Imputation Method for Data Fusion

  • Conference paper
Advances in Intelligent Data Analysis VII (IDA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4723))

Included in the following conference series:

Abstract

Data Fusion and Data Grafting are concerned with combining files and information coming from different sources. The problem is not to extract data from a single database, but to merge information collected from different sample surveys. The typical data fusion situation formed of two data samples, the former made up of a complete data matrix X relative to a first survey, and the latter Y which contains a certain number of missing variables. The aim is to complete the matrix Y beginning from the knowledge acquired from the X. Thus, the goal is the definition of the correlation structure which joins the two data matrices to be merged. In this paper, we provide an innovative methodology for Data Fusion based on an incremental imputation algorithm in tree-based models. In addition, we consider robust tree validation by boosting iterations. A relevant advantage of the proposed method is that it works for a mixed data structure including both numerical and categorical variables. As benchmarking methods we consider explicit methods such as standard trees and multiple regression as well as an implicit method based principal component analysis. A widely extended simulation study proves that the proposed method is more accurate than the other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aluja-Banet, T., Morineau, A., Rius, R.: La greffe de fichiers et ses conditions d’application. Méthode et exemple. In: Brossier, G., Dussaix, A.M. (eds.) Enquêtes et sondages, Dunod, Paris, pp. 94–102 (1997)

    Google Scholar 

  2. Aluja-Banet, T., Rius, R., Nonell, R., Martínez-Abarca, M.J.: Data Fusion and File Grafting. Analyses Multidimensionelles Des Données. In: Morineau, A., Fernández Aguirre, K. (eds.) NGUS 1997. 1 ed. París: CISIA-CERESTA, pp. 7–14 (1998)

    Google Scholar 

  3. Aria, M.: Un software fruibile ed interattivo per l’apprendimento statistico dei dati attraverso alberi esplorativi. Contributi metodologici ed applicativi. Phd thesis, University of Naples Federico II (2004)

    Google Scholar 

  4. Barcena, M.J., Tusell, F.: Enlace de encuestas: una propuesta metodológica y aplicación a la Encuesta de Presupuestos de Tempo. Qüestiio 23(2), 297–320 (1999)

    MathSciNet  Google Scholar 

  5. Breiman, L., Friedman, J.H., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth (1984)

    Google Scholar 

  6. Eibl, G., Pfeiffer, K.P.: How to make AdaBoost.M1 work for weak base classifiers by changing only one line of the code. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  7. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 55(1) (1997)

    Google Scholar 

  8. Friedman, J.H., Popescu, B.E.: Predictive Learning via Rule Ensembles, Technical Report of Stanford University (2005)

    Google Scholar 

  9. Gey, S., Poggi, J.M.: Boosting and instability for regression trees. Computational Statistics and Data Analysis 50, 533–550 (2006)

    Article  MathSciNet  Google Scholar 

  10. Little, J.R.A., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley and Sons, New York (1987)

    MATH  Google Scholar 

  11. Petrakos, G., Conversano, C., Farmakis, G., Mola, F., Siciliano, R., Stavropoulos, P.: New ways to specify data edits. Journal of Royal Statistical Society, Series A, Part 2  167, 249–274 (2004)

    MathSciNet  Google Scholar 

  12. Saporta, G.: Data fusion and data grafting. Computational Statistics and Data Analysis 38, 465–473 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  13. Siciliano, R., Conversano, C.: Tree-based Classifiers for Conditional Missing Data Incremental Imputation. In: Proceedings of the International Conference on Data Clean, Jyvaskyla, May 29-31, 2002, University of Jyvaskyla (2002)

    Google Scholar 

  14. Siciliano, R., Aria, M., D’Ambrosio, A.: Boosted incremental tree-based imputation of missing data. Data Analysis, Classification and the Forward Search. In: Springer series in Studies in Classification, Data Analysis, and Knowledge Organization, Springer, Heidelberg (2006)

    Google Scholar 

  15. Siciliano, R., Aria, M., Conversano, C.: Harvesting trees: methods, software and applications. In: Proceedings in Computational Statistics: 16th Symposium of IASC. COMPSTAT2004, held Prague, August 23-27. 2004. Eletronical Edition (CD), Physica-Verlag, Heidelberg (2004)

    Google Scholar 

  16. van der Putten, P.: Data Fusion for Data Mining: a Problem Statement. In: Coil Seminar 2000, June 22-23, 2000, Chios, Greece (2000)

    Google Scholar 

  17. van der Putten, P., Kok, J.N., Gupta, A.: Data Fusion through Statistical Matching. MIT Sloan School of Management Working Paper No. 4342-02, Cambridge, MA (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Michael R. Berthold John Shawe-Taylor Nada Lavrač

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

D’Ambrosio, A., Aria, M., Siciliano, R. (2007). Robust Tree-Based Incremental Imputation Method for Data Fusion. In: R. Berthold, M., Shawe-Taylor, J., Lavrač, N. (eds) Advances in Intelligent Data Analysis VII. IDA 2007. Lecture Notes in Computer Science, vol 4723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74825-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74825-0_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74824-3

  • Online ISBN: 978-3-540-74825-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics