Abstract
Ensuring high quality data when collecting and integrating information from heterogeneous sources into a data warehouse is a challenging problem. In this paper, we propose a model for XML data fusion, which allows the integrator to define data cleaning rules for solving value conflicts that may have been detected during the integration process. These rules resemble decisions that are made by users when data are manually curated and, once defined, conflicts detected in subsequent integration processes that are within the context of existing rules can be automatically solved without user intervention. We also introduce a notion of fusion policy validation that prevents conflicting resolution rules to be defined. To validate our proposal, we developed XFusion, a rule-based cleaning tool that stores curated data in a integrated repository.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bhattacharya, I., Getoor, L.: Collective entity resolution in relational data. IEEE Data Eng. Bull. 29(2), 4–12 (2006)
Bleiholder, J., Naumann, F.: Conflict handling strategies in an integrated information system. In: Proceedings of IIWeb (2006)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comp. Surveys 41(1), 1–41 (2008)
Buneman, P., Davidson, S., Fan, W., Hara, C., Tan, W.C.: Reasoning about keys for XML. Information Systems 28(8), 1037–1063 (2003)
Chan, L.M., Mitchell, J.S.: Introduction to the Dewey Decimal Classification (2003), http://www.oclc.org/dewey/versions/ddc22print/intro.pdf
Genevès, P., Layaïda, N.: Deciding XPath containment with MSO. Data & Knowledge Eng. 63(1), 108–136 (2007)
Hammerschmidt, B.C., Ad Volker Linnemann, M.K.: On the intersection of XPath expressions. In: Proc of IDEAS, pp. 49–57 (2005)
Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity identification in database integration. Information Sciences 89(1) (1996)
Meier, W.: eXist-db open source native XML database (2000), http://exist.sourceforge.net
Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: Proc. of VLDB Work. on Clean Databases (2006)
Miklau, G., Suciu, D.: Containment and equivalence for a fragment of XPath. J. of the ACM 51(1), 2–45 (2004)
Milano, D., Scannapieco, M., Catarci, T.: Using ontologies for XML data cleaning. In: OTM Workshops, pp. 562–571 (2005)
Motro, A., Anokhin, P.: Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources. Info. Fusion 7(2), 176–196 (2006)
do Nascimento, A.M., Hara, C.S.: A model for XML instance level integration. In: Proc. of SBBD, pp. 46–60 (2008)
Poggi, A., Abiteboul, S.: XML data integration with identification. In: Proc. of DBPL (2005)
Raman, V., Hellerstein, J.M.: Potter’s wheel: An interactive data cleaning system. In: Proc. of VLDB, pp. 381–390 (2001)
Weis, M., Manolescu, I.: Declarative XML data cleaning with XClean. In: Krogstie, J., Opdahl, A.L., Sindre, G. (eds.) CAiSE 2007 and WES 2007. LNCS, vol. 4495, pp. 96–110. Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cecchin, F., de Aguiar Ciferri, C.D., Hara, C.S. (2010). XML Data Fusion. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2010. Lecture Notes in Computer Science, vol 6263. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15105-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-15105-7_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15104-0
Online ISBN: 978-3-642-15105-7
eBook Packages: Computer ScienceComputer Science (R0)