Abstract
As more and more digital documents become available for the public use from different sources, also the needs of the users increase. Seamless integration of heterogenous collections, e.g., a possibility to query and format documents in a uniform way, is one of these needs. Processing of documents is greatly enhanced if the structure of documents is explicitly represented by some standard (SGML, XML, HTML). Hence, the problem of integrating heterogenous structures has to be taken into consideration.
We address this problem by introducing a classification method that acquires knowledge from document instances and their document type definitions, and uses this knowledge to attach a generic class to each SGML element type. The classification retains the tree hierarchy of elements. Although the structure is simplified, enough distinctions remain to facilitate versatile further processing, e.g., formatting. The class of an element type can be stored in the document type definition and, using the architectural form feature of SGML, the documents can be processed as virtual documents obeying a pre-defined generic DTD.
The specific usages of the classification, in addition to formatting and querying, include assembly of new documents from existing document fragments and automatic generation of style sheet templates for original document type definitions. We have implemented the classification method and experimented with several document types.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
S. Abiteboul. Querying semi-structured data. In Proceedings of the 6th International Conference on Database Theory, ICDT’97, number 1186 in Lecture Notes in Computer Science, pages 1–18, Delphi, Greece, Jan. 1997. Springer-Verlag.
H. Ahonen, B. Heikkinen, O. Heinonen, J. Jaakkola, P. Kilpeläinen, and G. Lindén. Design and implementation of a document assembly workbench. In Electronic Publishing, Artistic Imaging, and Digital Typography, Proceedings of the 7th International Conference on Electronic Publishing, EP’98, number 1375 in Lecture Notes in Computer Science, Saint-Malo, France, Mar./Apr. 1998. Springer-Verlag.
H. Ahonen, B. Heikkinen, O. Heinonen, J. Jaakkola, P. Kilpeläinen, G. Lindén, and H. Mannila. Intelligent Assembly of Structured Documents. Technical Report C-1996-40, University of Helsinki, Department of Computer Science, Finland, June 1996.
R. Bordin. Alice Freeman Palmer: The Evolution of a New Woman. The University of Michigan Press, Ann Arbor, Michigan, USA, 1993. Available at http://www.press.umich.edu/bookhome/bordin/.
P. Buneman. Semistructured data: a tutorial. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS’97, pages 117–121, Tucson, Arizona, USA, May 1997. ACM.
V. Christophides, M. Dörr, and I. Fundulaki. A semantic network approach to semi-structured documents repositories. In C. Peters and C. Thanos, editors, Proceedings of the 1st European Conference on Research and Advanced Technology for Digital Libraries, ECDL’97, number 1324 in Lecture Notes in Computer Science, pages 305–324, Pisa, Italy, Sept. 1997. Springer-Verlag.
ISO. Information Processing — Text and Office Systems — Standard Generalized Markup Language (SGML), ISO 8879, 1986.
ISO/IEC. Hypermedia/Time-based Structuring Language (HyTime), 2nd Edition, ISO/IEC 10744, 1997.
J. Jaakkola, P. Kilpeläinen, and G. Lindén. TranSID: An SGML tree transformation language. In J. Paakki, editor, Proceedings of the Fifth Symposium on Programming Languages and Software Tools, pages 72–83, Jyväskylä, Finland, June 1997. Technical Report C-1997-37, University of Helsinki, Department of Computer Science, Finland.
S. Nestorov, S. Abiteboul, and R. Motwani. Inferring structure in semistructured data. SIGMOD Record, 26(4):39–43, Dec. 1997.
D. Suciu. Management of semistructured data. SIGMOD Record, 26(4):4–7, Dec. 1997.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ahonen, H., Heikkinen, B., Heinonen, O., Jaakkola, J., Klemettinen, M. (1998). Analysis of Document Structures for Element Type Classification. In: Munson, E.V., Nicholas, C., Wood, D. (eds) Principles of Digital Document Processing. PODDP 1998. Lecture Notes in Computer Science, vol 1481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49654-8_3
Download citation
DOI: https://doi.org/10.1007/3-540-49654-8_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65086-7
Online ISBN: 978-3-540-49654-0
eBook Packages: Springer Book Archive