Abstract
We address the problem of learning to map automatically flat and semi-structured documents onto a mediated target XML schema. We propose a machine learning approach where the mapping between input and target documents is learned from examples. Complex transformations can be learned using only pairs of input and corresponding target documents. From a machine learning point of view, the structure mapping task raises important complexity challenges. Hence we propose an original model which scales well to real world applications. We provide learning and inference procedures with low complexity. The model sequentially builds the target XML document by processing the input document node per node. We demonstrate the efficiency of our model on two structure mapping tasks. Up to our knowledge, there are no other model yet able to solve these tasks.
This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Denoyer, L., Wisniewski, G., Gallinari, P.: Document structure matching for heterogeneous corpora. In: SIGIR 2004. Workshop, Sheffield (2004)
Howard, R.A.: Dynamic Programming and Markov Processes. Technology Press-Wiley, Cambridge, Massachusetts (1960)
Sutton, R., Barto, A.: Reinforcement learning: an introduction. MIT Press, Cambridge (1998)
Si, J., Barto, A.G., B., P.W., W. II, D.: Handbook of Learning and Approximate Dynamic Programming. Wiley&Sons, Inc., Publications, New York (2004)
Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse coarse coding. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 1038–1044. The MIT Press, Cambridge, MA (1996)
Chidlovskii, B., Fuselier, J.: A probabilistic learning method for xml annotation of documents. In: IJCAI, pp. 1016–1021 (2005)
Doan, A., Halevy, A.Y.: Semantic integration research in the database community: A brief survey. AI Magazine, Special Issue on Semantic Integration (2005)
Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: A multistrategy approach. Maching Learning 50(3), 279–301 (2003)
Califf, M.E., Mooney, R.J.: Bottom-up relational learning of pattern matching rules for information extraction. J. Mach. Learn. Res. 4, 177–210 (2003)
Young-Lai, M., Tompa, F.W.: Stochastic grammatical inference of text database structure. Mach. Learn. 40(2), 111–137 (2000)
Chidlovskii, B., Fuselier, J.: Supervised learning for the legacy document conversion. In: DocEng 2004, pp. 220–228. ACM Press, New York (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Maes, F., Denoyer, L., Gallinari, P. (2007). XML Structure Mapping. In: Fuhr, N., Lalmas, M., Trotman, A. (eds) Comparative Evaluation of XML Information Retrieval Systems. INEX 2006. Lecture Notes in Computer Science, vol 4518. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73888-6_49
Download citation
DOI: https://doi.org/10.1007/978-3-540-73888-6_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73887-9
Online ISBN: 978-3-540-73888-6
eBook Packages: Computer ScienceComputer Science (R0)