Skip to main content

XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents

  • Conference paper
Database and Expert Systems Applications (DEXA 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2736))

Included in the following conference series:

Abstract

Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data elements that carry descriptions of their meaning, usage and relationship. Moreover, the combination with XSLT enables any browser to render the original layout structure of the paper documents accurately. However, an effective transformation of paper documents into XML format is a complex process involving several steps. In this paper we propose the application of knowledge technologies to many document processing steps, namely rule-based systems for semantic indexing of documents and the extraction of the necessary knowledge by means of machine learning techniques. This approach has been implemented in the system Wisdom++, which is currently used in the European project COLLATE (Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material) to provide film archivists with a tool for the automated annotation of historical documents in film archives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. International Journal on Document Analysis and Recognition 4(1), 2–17 (2001)

    Article  Google Scholar 

  2. Bradshaw, B.: Semantic based image retrieval: a probabilistic approach. ACM Multimedia 2000, 167–176 (2000)

    Google Scholar 

  3. Esposito, F., Malerba, D., Semeraro, G., Annese, E., Scafuro, G.: An experimental page layout recognition system for office document automatic classitication: An integratedapproach for inductive generalization. In: Proc. of the 10th Int. Conf on Pattern Recognition, pp. 557–562 (1990)

    Google Scholar 

  4. Esposito, F., Malerba, D., Lisi, F.A.: Machine Learning for intelligent processing of printed documents. Journal of Intelligent Information Systems 14(2/3), 175–198 (2000)

    Article  Google Scholar 

  5. Fan, X., Sheng, F., Ng, P.A.: DOCPROS: A Knowledge-Based Personal Document Management System. In: Proc. of the 10th International Workshop on Database & Expert Systems Applications, DEXA Workshop, pp. 527–531 (1999)

    Google Scholar 

  6. Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proc. of Fourth IAPR International Workshop on Document Analysis Systems, DAS 2000, Rio de Janeiro, Brazil, pp. 99–111 (2000)

    Google Scholar 

  7. Malerba, D., Esposito, F., Lisi, F.A.: Learning recursive theories with ATRE. In: Prade, H. (ed.) Proceedings of the 13th European Conference on Artificial Intelligence, pp. 435–439. John Wiley & Sons, Chichester (1998)

    Google Scholar 

  8. Malerba, D., Esposito, F., Lisi, F.A., Altamura, O.: Automated Discovery of Dependencies Between Logical Components in Document Image Understanding. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle (WA), pp. 174–178 (2001)

    Google Scholar 

  9. Malerba, D., Esposito, F., Altamura, O., Ceci, M., Berardi, M.: Correcting the Document Layout: A Machine Learning Approach. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, UK (2003) (to appear)

    Google Scholar 

  10. Nagy, G.: Twenty Years of Document Image Analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 38–62 (2000)

    Article  Google Scholar 

  11. RAF Technology, Inc. DAFS Library, Programmer’s Guide and Reference (August 1995)

    Google Scholar 

  12. Tang, Y.Y., Yan, C.D., Suen, C.Y.: Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering 6(1), 3–21 (1994)

    Article  Google Scholar 

  13. Tsujimoto, S., Asada, H.: Understanding Multi-articled Documents. In: Proceedings of the 10th International Conference on Pattern Recognition, Atlantic City, N.J., pp. 551–556 (1990)

    Google Scholar 

  14. Utgoff, P.E.: An improved algorithm for incremental induction of decision trees. In: Proc. of the Eleventh Int. Conf. on Machine Learning. Morgan Kaufmann, San Francisco (1994)

    Google Scholar 

  15. Wong, K.Y., Casey, R.G., Wahl, F.M.: Document analysis system. IBM Journal of Research Development 26(6), 647–656 (1982)

    Article  Google Scholar 

  16. Worring, M., Smeulders, A.W.M.: Content based Internet access to scanned documents. Int J. Doc. Anal. Recognition 1(4) (1999)

    Google Scholar 

  17. Zhao, R., Grosky, W.I.: Narrowing the Semantic Gap Improved Text-Based Web Document Retrieval Using Visual Features. IEEE Trans. on Multimedia 4(2), 189–200 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Malerba, D., Ceci, M., Berardi, M. (2003). XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents. In: Mařík, V., Retschitzegger, W., Štěpánková, O. (eds) Database and Expert Systems Applications. DEXA 2003. Lecture Notes in Computer Science, vol 2736. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45227-0_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45227-0_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40806-2

  • Online ISBN: 978-3-540-45227-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics