Abstract
Effective daily processing of large amounts of paper documents in office environments requires the application of semantic-based indexing techniques during the transformation of paper documents to electronic format. For this purpose a combination of both XML and knowledge technologies can be used. XML distinguishes between data, its structure and semantics, allowing the exchange of data elements that carry descriptions of their meaning, usage and relationship. Moreover, the combination with XSLT enables any browser to render the original layout structure of the paper documents accurately. However, an effective transformation of paper documents into XML format is a complex process involving several steps. In this paper we propose the application of knowledge technologies to many document processing steps, namely rule-based systems for semantic indexing of documents and the extraction of the necessary knowledge by means of machine learning techniques. This approach has been implemented in the system Wisdom++, which is currently used in the European project COLLATE (Collaboratory for Annotation, Indexing and Retrieval of Digitized Historical Archive Material) to provide film archivists with a tool for the automated annotation of historical documents in film archives.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Altamura, O., Esposito, F., Malerba, D.: Transforming paper documents into XML format with WISDOM++. International Journal on Document Analysis and Recognition 4(1), 2–17 (2001)
Bradshaw, B.: Semantic based image retrieval: a probabilistic approach. ACM Multimedia 2000, 167–176 (2000)
Esposito, F., Malerba, D., Semeraro, G., Annese, E., Scafuro, G.: An experimental page layout recognition system for office document automatic classitication: An integratedapproach for inductive generalization. In: Proc. of the 10th Int. Conf on Pattern Recognition, pp. 557–562 (1990)
Esposito, F., Malerba, D., Lisi, F.A.: Machine Learning for intelligent processing of printed documents. Journal of Intelligent Information Systems 14(2/3), 175–198 (2000)
Fan, X., Sheng, F., Ng, P.A.: DOCPROS: A Knowledge-Based Personal Document Management System. In: Proc. of the 10th International Workshop on Database & Expert Systems Applications, DEXA Workshop, pp. 527–531 (1999)
Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proc. of Fourth IAPR International Workshop on Document Analysis Systems, DAS 2000, Rio de Janeiro, Brazil, pp. 99–111 (2000)
Malerba, D., Esposito, F., Lisi, F.A.: Learning recursive theories with ATRE. In: Prade, H. (ed.) Proceedings of the 13th European Conference on Artificial Intelligence, pp. 435–439. John Wiley & Sons, Chichester (1998)
Malerba, D., Esposito, F., Lisi, F.A., Altamura, O.: Automated Discovery of Dependencies Between Logical Components in Document Image Understanding. In: Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle (WA), pp. 174–178 (2001)
Malerba, D., Esposito, F., Altamura, O., Ceci, M., Berardi, M.: Correcting the Document Layout: A Machine Learning Approach. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, Edinburgh, UK (2003) (to appear)
Nagy, G.: Twenty Years of Document Image Analysis in PAMI. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 38–62 (2000)
RAF Technology, Inc. DAFS Library, Programmer’s Guide and Reference (August 1995)
Tang, Y.Y., Yan, C.D., Suen, C.Y.: Document Processing for Automatic Knowledge Acquisition. IEEE Trans. on Knowledge and Data Engineering 6(1), 3–21 (1994)
Tsujimoto, S., Asada, H.: Understanding Multi-articled Documents. In: Proceedings of the 10th International Conference on Pattern Recognition, Atlantic City, N.J., pp. 551–556 (1990)
Utgoff, P.E.: An improved algorithm for incremental induction of decision trees. In: Proc. of the Eleventh Int. Conf. on Machine Learning. Morgan Kaufmann, San Francisco (1994)
Wong, K.Y., Casey, R.G., Wahl, F.M.: Document analysis system. IBM Journal of Research Development 26(6), 647–656 (1982)
Worring, M., Smeulders, A.W.M.: Content based Internet access to scanned documents. Int J. Doc. Anal. Recognition 1(4) (1999)
Zhao, R., Grosky, W.I.: Narrowing the Semantic Gap Improved Text-Based Web Document Retrieval Using Visual Features. IEEE Trans. on Multimedia 4(2), 189–200 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Malerba, D., Ceci, M., Berardi, M. (2003). XML and Knowledge Technologies for Semantic-Based Indexing of Paper Documents. In: Mařík, V., Retschitzegger, W., Štěpánková, O. (eds) Database and Expert Systems Applications. DEXA 2003. Lecture Notes in Computer Science, vol 2736. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45227-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-540-45227-0_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40806-2
Online ISBN: 978-3-540-45227-0
eBook Packages: Springer Book Archive