Skip to main content

Analysis of Document Structures for Element Type Classification

  • Conference paper
  • First Online:
Principles of Digital Document Processing (PODDP 1998)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1481))

Included in the following conference series:

Abstract

As more and more digital documents become available for the public use from different sources, also the needs of the users increase. Seamless integration of heterogenous collections, e.g., a possibility to query and format documents in a uniform way, is one of these needs. Processing of documents is greatly enhanced if the structure of documents is explicitly represented by some standard (SGML, XML, HTML). Hence, the problem of integrating heterogenous structures has to be taken into consideration.

We address this problem by introducing a classification method that acquires knowledge from document instances and their document type definitions, and uses this knowledge to attach a generic class to each SGML element type. The classification retains the tree hierarchy of elements. Although the structure is simplified, enough distinctions remain to facilitate versatile further processing, e.g., formatting. The class of an element type can be stored in the document type definition and, using the architectural form feature of SGML, the documents can be processed as virtual documents obeying a pre-defined generic DTD.

The specific usages of the classification, in addition to formatting and querying, include assembly of new documents from existing document fragments and automatic generation of style sheet templates for original document type definitions. We have implemented the classification method and experimented with several document types.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Abiteboul. Querying semi-structured data. In Proceedings of the 6th International Conference on Database Theory, ICDT’97, number 1186 in Lecture Notes in Computer Science, pages 1–18, Delphi, Greece, Jan. 1997. Springer-Verlag.

    Google Scholar 

  2. H. Ahonen, B. Heikkinen, O. Heinonen, J. Jaakkola, P. Kilpeläinen, and G. Lindén. Design and implementation of a document assembly workbench. In Electronic Publishing, Artistic Imaging, and Digital Typography, Proceedings of the 7th International Conference on Electronic Publishing, EP’98, number 1375 in Lecture Notes in Computer Science, Saint-Malo, France, Mar./Apr. 1998. Springer-Verlag.

    Chapter  Google Scholar 

  3. H. Ahonen, B. Heikkinen, O. Heinonen, J. Jaakkola, P. Kilpeläinen, G. Lindén, and H. Mannila. Intelligent Assembly of Structured Documents. Technical Report C-1996-40, University of Helsinki, Department of Computer Science, Finland, June 1996.

    Google Scholar 

  4. R. Bordin. Alice Freeman Palmer: The Evolution of a New Woman. The University of Michigan Press, Ann Arbor, Michigan, USA, 1993. Available at http://www.press.umich.edu/bookhome/bordin/.

    Google Scholar 

  5. P. Buneman. Semistructured data: a tutorial. In Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS’97, pages 117–121, Tucson, Arizona, USA, May 1997. ACM.

    Google Scholar 

  6. V. Christophides, M. Dörr, and I. Fundulaki. A semantic network approach to semi-structured documents repositories. In C. Peters and C. Thanos, editors, Proceedings of the 1st European Conference on Research and Advanced Technology for Digital Libraries, ECDL’97, number 1324 in Lecture Notes in Computer Science, pages 305–324, Pisa, Italy, Sept. 1997. Springer-Verlag.

    Chapter  Google Scholar 

  7. ISO. Information Processing — Text and Office Systems — Standard Generalized Markup Language (SGML), ISO 8879, 1986.

    Google Scholar 

  8. ISO/IEC. Hypermedia/Time-based Structuring Language (HyTime), 2nd Edition, ISO/IEC 10744, 1997.

    Google Scholar 

  9. J. Jaakkola, P. Kilpeläinen, and G. Lindén. TranSID: An SGML tree transformation language. In J. Paakki, editor, Proceedings of the Fifth Symposium on Programming Languages and Software Tools, pages 72–83, Jyväskylä, Finland, June 1997. Technical Report C-1997-37, University of Helsinki, Department of Computer Science, Finland.

    Google Scholar 

  10. S. Nestorov, S. Abiteboul, and R. Motwani. Inferring structure in semistructured data. SIGMOD Record, 26(4):39–43, Dec. 1997.

    Article  Google Scholar 

  11. D. Suciu. Management of semistructured data. SIGMOD Record, 26(4):4–7, Dec. 1997.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ahonen, H., Heikkinen, B., Heinonen, O., Jaakkola, J., Klemettinen, M. (1998). Analysis of Document Structures for Element Type Classification. In: Munson, E.V., Nicholas, C., Wood, D. (eds) Principles of Digital Document Processing. PODDP 1998. Lecture Notes in Computer Science, vol 1481. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49654-8_3

Download citation

  • DOI: https://doi.org/10.1007/3-540-49654-8_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-65086-7

  • Online ISBN: 978-3-540-49654-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics