Abstract
The WWW contains a huge amount of documents. Some of them share the subject, but are generated by different people or even organizations. To guarantee the interchange of such documents, we can use XML, which allows to share documents that do not have the same structure. However, it makes difficult to understand the core of such heterogeneous documents (in general, schema is not available). In this paper, we offer a characterization and algorithm to obtain the midpoint (in terms of a resemblance function) of a set of semi-structured, heterogeneous documents without optional elements. The trivial case of midpoint would be the common elements to all documents. Nevertheless, in cases with several heterogeneous documents this may result in an empty set. Thus, we consider that those elements present in a given amount of documents belong to the midpoint. A exact schema could always be found generating optional elements. However, the exact schema of the whole set may result in overspecialization (lots of optional elements), which would make it useless.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web - From Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco (2000)
Albert, J., Giammarresi, D., Wood, D.: Normal Form algorithms for extended Context-Free Grammars. Theoretical Computer Science 267(1-2), 35–47 (2001)
Batagelj, V., Bren, M.: Comparing resemblance measures. Journal of Classification 12(1), 73–90 (1995)
Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.): The Description Logic Handbook. Cambridge University Press, Cambridge (2003)
Boobna, U., de Rougemont, M.: Correctors for XML Data. In: Bellahsène, Z., Milo, T., Rys, M., Suciu, D., Unland, R. (eds.) XSym 2004. LNCS, vol. 3186, pp. 97–111. Springer, Heidelberg (2004)
Bertino, E., Guerrini, G., Mesiti, M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Information Systems 29(1), 23–46 (2004)
Jung, J.-S., Oh, D.-I., Kong, Y.-H., Ahn, J.-K.: Extracting Information from XML Documents by Reverse Generating a DTD. In: Shafazand, H., Tjoa, A.M. (eds.) EurAsia-ICT 2002. LNCS, vol. 2510, pp. 314–321. Springer, Heidelberg (2002)
Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 1998), pp. 295–306. ACM, New York (1998)
Sanz, I., Pérez, J.M., Berlanga, R., Aramburu, M.J.: XML Schemata Inference and Evolution. In: Mařík, V., Štěpánková, O., Retschitzegger, W. (eds.) DEXA 2003. LNCS, vol. 2736, pp. 109–118. Springer, Heidelberg (2003)
W3C. Extensible Markup Language (XML) 1.0, 3rd edn. (February 2004)
Widom, J.: Data Management for XML: Research Directions. IEEE Data Engineering Bulletin 22(3), 44–52 (1999)
Zhang, Z., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and Related Problems. SIAM Journal on Computing 18(6), 1245–1262 (1989)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abelló, A., de Palol, X., Hacid, MS. (2005). On the Midpoint of a Set of XML Documents. In: Andersen, K.V., Debenham, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2005. Lecture Notes in Computer Science, vol 3588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11546924_43
Download citation
DOI: https://doi.org/10.1007/11546924_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28566-3
Online ISBN: 978-3-540-31729-6
eBook Packages: Computer ScienceComputer Science (R0)