Abstract
The problem of classifying web documents is studied in this paper. A graph-based instead of traditional vector-based model is used for document representation. A novel classification algorithm which uses two different types of structural patterns (subgraphs): contrast and common is proposed. This approach is strongly associated with the classical emerging patterns techniques known from decision tables. The presented method is evaluated on three different benchmark web documents collections for measuring classification accuracy. Results show that it can outperform other existing algorithms (based on vector, graph, and hybrid document representation) in terms of accuracy and document model complexity. Another advantage is that the introduced classifier has a simple, understandable structure and can be easily extended by the expert knowledge.
Chapter PDF
Similar content being viewed by others
References
Datasets ”pddpdata”: ftp://ftp.cs.umn.edu/dept/users/boley/pddpdata/
Borgelt, C., Berthold, M.R.: Mining molecular fragments: Finding relevant substructures of molecules. In: ICDM 2002. Proceedings of the 2002 IEEE International Conference on Data Mining, Washington, DC, USA, pp. 51–58. IEEE Computer Society Press, Los Alamitos (2002)
Bunke, H., Shearer, K.: A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters 19, 255–259 (1998)
Deshpande, M., Kuramochi, M., Karypis, G.: Frequent sub-structure-based approaches for classifying chemical compounds. In: ICDM 2003. Proceedings of the 2002 IEEE International Conference on Data Mining, pp. 35–42 (2003)
Diestel, R.: Graph Theory. Springer, New York (2000)
Dominik, A., Walczak, Z., Wojciechowski, J.: Classifying chemical compounds using contrast and common patterns. In: Beliczynski, B., Dzielinski, A., Iwanowski, M., Ribeiro, B. (eds.) ICANNGA 2007, vol. 4432, pp. 772–781. Springer, Heidelberg (2007)
Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences. In: Knowledge Discovery and Data Mining, pp. 43–52 (1999)
Dong, G., Zhang, X., Wong, L., Li, J.: CAEP: Classification by aggregating emerging patterns. In: Discovery Science, pp. 30–42 (1999)
Fortin, S.: The graph isomorphism problem. Technical report, University of Alberta, Edomonton, Alberta, Canada (1996)
Inokuchi, A., Washio, T., Motoda, H.: An apriori-based algorithm for mining frequent substructures from graph data. In: Principles of Data Mining and Knowledge Discovery, pp. 13–23 (2000)
Kotagiri, R., Bailey, J.: Discovery of emerging patterns and their use in classification. In: Gedeon, T.D., Fung, L.C.C. (eds.) AI 2003. LNCS (LNAI), vol. 2903, pp. 1–12. Springer, Heidelberg (2003)
Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: ICDM 2001. Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 313–320. IEEE Computer Society Press, Los Alamitos (2001)
Markov, A., Last, M.: Efficient graph-based representation of web documents. In: MGTS 2005. Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences, pp. 52–62 (2005)
Markov, A., Last, M., Kandel, A.: Model-based classification of web documents represenetd by graphs. In: KDD 2006. Proceedings of WebKDD 2006: KDD Workshop on Web Mining and web Usage Analysis, iin conjunction with the 12th ACM SIGKDD International Conference on Knoowledge Discovery and Data Mining, Philadelphia, PA, USA, ACM, New York (2006)
Read, R.C., Corneil, D.G.: The graph isomorph disease. Journal of Graph Theory 363, 339–363 (1977)
Schenker, A.: Graph-Theoretic Techniques for Web Content Mining. PhD thesis, University of South Florida (2003)
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification of web documents using a graph model. In: ICDAR 2003. Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 01, pp. 240–244. IEEE Computer Society, Los Alamitos, CA, USA (2003)
Ting, R.M.H., Bailey, J.: Mining minimal contrast subgraph patterns. In: SIAM 2006. Proceedings of the 2006 SIAM Conference on Data Mining, Maryland, USA (2006)
Ullmann, J.R.: An algorithm for subgraph isomorphism. J. ACM 23(1), 31–42 (1976)
Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: ICML 2002. Proceedings of the Nineteenth International Conference on Machine Learning (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dominik, A., Walczak, Z., Wojciechowski, J. (2007). Classification of Web Documents Using a Graph-Based Model and Structural Patterns. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds) Knowledge Discovery in Databases: PKDD 2007. PKDD 2007. Lecture Notes in Computer Science(), vol 4702. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74976-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-540-74976-9_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74975-2
Online ISBN: 978-3-540-74976-9
eBook Packages: Computer ScienceComputer Science (R0)