Abstract
Document image classification is an important step in document image analysis. Based on classification results we can tackle other tasks such as indexation, understanding or navigation in document collections. Using a document representation and an unsupervized classification method, we can group documents that from the user point of view constitute valid clusters. The semantic gap between a domain independent document representation and the user implicit representation can lead to unsatisfactory results. In this paper we describe document images based on frequent occurring symbols. This document description is created in an unsupervised manner and can be related to the domain knowledge. Using data mining techniques applied to a graph based document representation we found frequent and maximal subgraphs. For each document image, we construct a bag containing the frequent subgraphs found in it. This bag of “symbols” represents the description of a document. We present results obtained on a corpus of graphical document images.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Antonacopoulos, A.: Introduction to Document Image Analysis (1996)
Nagy, G.: Twenty years of document analysis in PAMI. IEEE PAMI 22, 38–62 (2000)
Pavlidis, T.: Algorithms or Graphics and Image Processing. Computer Science Press, Rockville (1982)
Bagdanov, A.D., Worring, M.: Fine-grained Document Genre Classification Using First Order Random Graphs. In: ICDAR 2001, pp. 79–90 (2001)
Washio, T., Motoda, H.: State of the art of graph-based data mining. SIGKDD Explor. Newsl. 5(1), 59–68 (2003)
Fung, B.C.M., Wang, K., Ester, M.: Hierarchical Document Clustering Using Frequent Itemsets. In: Proceedings of the SIAM International Conference on Data Mining (2003)
Termier, A., Rousset, M., Sebag, M.: Mining XML Data with Frequent Trees. In: DBFusion Workshop 2002, pp. 87–96 (2002)
Blostein, D., Zanibbi, R., Nagy, G., Harrap, R.: Document Representations. In: Lladós, J., Kwon, Y.-B. (eds.) GREC 2003. LNCS, vol. 3088. Springer, Heidelberg (2004)
Khotazad, A., Hong, Y.H.: Invariant Image recognition by Zernike Moments. IEEE PAMI 12(5) (May 1990)
Gordon, A.D.: Classification, 2nd edn. (1999)
Milligan, G.W., Cooper, M.C.: An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 58(2), 159–179 (1985)
Kaufmann, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Statistical Data Analysis based on the L 1 Norm and Related Methods, pp. 405–416 (1987)
Tabbone, S., Wendling, L., Tombre, K.: Matching of graphical symbols in line-drawing images using angular signature information. Int’l. Journal on Document Analysis and Recognition 6(2), 115–125 (2003)
Seno, M., Kuramochi, M., Karypis, G.: PAFI, A Pattern Finding Toolkit (2003), http://www.cs.umn.edu/~karypis
Dumais, S.T.: Improving the retrieval information from external resources. Behaviour Research Methods, Instruments and Computers 23(2), 229–236 (1991)
Ballard, D.H., Brown, C.M.: Computer Vision. Prentice Hall, Englewood Cliffs (1982)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Barbu, E., Héroux, P., Adam, S., Trupin, E. (2005). Clustering Document Images Using Graph Summaries. In: Perner, P., Imiya, A. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2005. Lecture Notes in Computer Science(), vol 3587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11510888_20
Download citation
DOI: https://doi.org/10.1007/11510888_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26923-6
Online ISBN: 978-3-540-31891-0
eBook Packages: Computer ScienceComputer Science (R0)