XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents

Malerba, Donato; Ceci, Michelangelo; Berardi, Margherita

doi:10.1007/978-3-540-45227-0_26

Donato Malerba⁷,
Michelangelo Ceci⁷ &
Margherita Berardi⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2736))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

647 Accesses
2 Citations

Abstract

Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data elements that carry descriptions of their meaning, usage and relationship. Moreover, the combination with XSLT enables any browser to render the original layout structure of the paper documents accurately. However, an effective transformation of paper documents into XML format is a complex process involving several steps. In this paper we propose the application of knowledge technologies to many document processing steps, namely rule-based systems for semantic indexing of documents and the extraction of the necessary knowledge by means of machine learning techniques. This approach has been implemented in the system Wisdom++, which is currently used in the European project COLLATE (Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material) to provide film archivists with a tool for the automated annotation of historical documents in film archives.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. International Journal on Document Analysis and Recognition 4(1), 2–17 (2001)
Article Google Scholar
Bradshaw, B.: Semantic based image retrieval: a probabilistic approach. ACM Multimedia 2000, 167–176 (2000)
Google Scholar
Esposito, F., Malerba, D., Semeraro, G., Annese, E., Scafuro, G.: An experimental page layout recognition system for office document automatic classitication: An integratedapproach for inductive generalization. In: Proc. of the 10th Int. Conf on Pattern Recognition, pp. 557–562 (1990)
Google Scholar
Esposito, F., Malerba, D., Lisi, F.A.: Machine Learning for intelligent processing of printed documents. Journal of Intelligent Information Systems 14(2/3), 175–198 (2000)
Article Google Scholar
Fan, X., Sheng, F., Ng, P.A.: DOCPROS: A Knowledge-Based Personal Document Management System. In: Proc. of the 10th International Workshop on Database & Expert Systems Applications, DEXA Workshop, pp. 527–531 (1999)
Google Scholar
Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proc. of Fourth IAPR International Workshop on Document Analysis Systems, DAS 2000, Rio de Janeiro, Brazil, pp. 99–111 (2000)
Google Scholar
Malerba, D., Esposito, F., Lisi, F.A.: Learning recursive theories with ATRE. In: Prade, H. (ed.) Proceedings of the 13th European Conference on Artificial Intelligence, pp. 435–439. John Wiley & Sons, Chichester (1998)
Google Scholar
Malerba, D., Esposito, F., Lisi, F.A., Altamura, O.: Automated Discovery of Dependencies Between Logical Components in Document Image Understanding. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle (WA), pp. 174–178 (2001)
Google Scholar
Malerba, D., Esposito, F., Altamura, O., Ceci, M., Berardi, M.: Correcting the Document Layout: A Machine Learning Approach. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, UK (2003) (to appear)
Google Scholar
Nagy, G.: Twenty Years of Document Image Analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 38–62 (2000)
Article Google Scholar
RAF Technology, Inc. DAFS Library, Programmer’s Guide and Reference (August 1995)
Google Scholar
Tang, Y.Y., Yan, C.D., Suen, C.Y.: Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering 6(1), 3–21 (1994)
Article Google Scholar
Tsujimoto, S., Asada, H.: Understanding Multi-articled Documents. In: Proceedings of the 10th International Conference on Pattern Recognition, Atlantic City, N.J., pp. 551–556 (1990)
Google Scholar
Utgoff, P.E.: An improved algorithm for incremental induction of decision trees. In: Proc. of the Eleventh Int. Conf. on Machine Learning. Morgan Kaufmann, San Francisco (1994)
Google Scholar
Wong, K.Y., Casey, R.G., Wahl, F.M.: Document analysis system. IBM Journal of Research Development 26(6), 647–656 (1982)
Article Google Scholar
Worring, M., Smeulders, A.W.M.: Content based Internet access to scanned documents. Int J. Doc. Anal. Recognition 1(4) (1999)
Google Scholar
Zhao, R., Grosky, W.I.: Narrowing the Semantic Gap Improved Text-Based Web Document Retrieval Using Visual Features. IEEE Trans. on Multimedia 4(2), 189–200 (2002)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università degli Studi, via Orabona, 4, 70126, Bari, Italy
Donato Malerba, Michelangelo Ceci & Margherita Berardi

Authors

Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar
Michelangelo Ceci
View author publications
You can also search for this author in PubMed Google Scholar
Margherita Berardi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Gerstner Laboratory, Czech Technical University in Prague, Technická 2, 166 27, Prague 6, Czech Republic
Vladimír Mařík
Johannes Kepler University Linz, Altenberger Str. 69, 4040, Linz, Austria
Werner Retschitzegger
Faculty of Electrical Engineering, The Gerstner Laboratory, Czech Technical University in Prague, Technická 2, 166 27, Prague 6, Czech Republic
Olga Štěpánková

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Malerba, D., Ceci, M., Berardi, M. (2003). XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents. In: Mařík, V., Retschitzegger, W., Štěpánková, O. (eds) Database and Expert Systems Applications. DEXA 2003. Lecture Notes in Computer Science, vol 2736. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45227-0_26

Download citation

DOI: https://doi.org/10.1007/978-3-540-45227-0_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40806-2
Online ISBN: 978-3-540-45227-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics