A Simple Yet Efficient Approach for Maximal Frequent Subtrees Extraction from a Collection of XML Documents

Paik, Juryon; Kim, Ung Mo

doi:10.1007/11906070_9

Juryon Paik²⁰ &
Ung Mo Kim²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4256))

Included in the following conference series:

International Conference on Web Information Systems Engineering

433 Accesses
3 Citations

Abstract

Recently, XML is penetrating virtually all areas of computer science and information technology, and is bringing about an unprecedented level of data exchange among heterogeneous data storage systems. With the continuous growth of online information stored, presented and exchanged using XML, the discovery of useful information from a collection of XML documents is currently one of the main research areas occupying the data mining community. The mostly used approach to this task is to extract frequently occurring subtree patterns in trees. However, the number of frequent subtrees usually grows exponentially with the size of trees, and therefore, mining all frequent subtrees becomes infeasible for a large tree size. A more practical and scalable approach is to use maximal frequent subtrees, the number of which is much smaller than that of frequent subtrees. Handling the maximal frequent subtrees is an interesting challenge, and represents the core of this paper. We present a novel, conceptually simple, yet effective approach that discovers maximal frequent subtrees without generation of candidate subtrees from a database of XML trees. The beneficial effect of our approach is that it not only reduces significantly the number of rounds for infrequent tree pruning, but also eliminates totally each round for candidate generation by avoiding time consuming tree join operations or tree enumerations.

This work was supported in part by the Ubiquitous Autonomic Computing and Network Project, 21st Century Frontier R&D Program and by the university IT Research Center project (ITRC), funded by the Korean Ministry of Information and Communication.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML, 1st edn. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 12th International Conference on Very Large Databases, pp. 487–499 (1994)
Google Scholar
Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In: Proceedings of the 2nd SIAM International Conference on Data Mining, pp. 158–174 (2002)
Google Scholar
Buneman, P.: Semistructured data. In: Proceedings of the 16th ACM SIGACT-SIGMOD-SIGART symposium on Principles of databases systems, pp. 117–121 (1997)
Google Scholar
Chi, Y., Nijssen, S., Muntz, R.R., Kok, J.N.: Frequent subtree mining — an overview. Fundamenta Informaticae 66(1–2), 161–198 (2005)
MATH MathSciNet Google Scholar
Chi, Y., Xia, Y., Yang, Y., Muntz, R.R.: Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Trans. Knowledge and Data Engineering 17(3), 190–202 (2005)
Google Scholar
Chi, Y., Yang, Y., Muntz, R.R.: HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. In: The 16th International Conference on Scientific and Statistical Database Management, pp. 11–20 (2004)
Google Scholar
Chi, Y., Yang, Y., Muntz, R.R.: Canonical forms for labelled trees and their applications in frequent subtree mining. Knowledge and Information Systems 8(2), 203–234 (2005)
Article Google Scholar
Inokuchi, A., Washio, T., Motoda, H.: An Apriori-based algorithm for mining frequent substructures from graph data. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS, vol. 1910, pp. 13–23. Springer, Heidelberg (2000)
Chapter Google Scholar
Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: Proceedings of IEEE International Conference on Data Mining, pp. 313–320 (2001)
Google Scholar
Kilpeäinen, P.: Tree matching problems with applications to structured text databases. PhD thesis in University of Helsinki (1992)
Google Scholar
Paik, J., Shin, D.R., Kim, U.M.: EFoX: a Scalable Method for Extracting Frequent Subtrees. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3516, pp. 813–817. Springer, Heidelberg (2005)
Chapter Google Scholar
Paik, J., Won, D., Fotouhi, F., Kim, U.M.: EXiT-B: A New Approch for Extracting Maximal Frequent Subtrees from XML Data. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 1–8. Springer, Heidelberg (2005)
Chapter Google Scholar
Termier, A., Rousset, M.-C., Sebag, M.: TreeFinder: a First step towards XML data mining. In: Proceedings of IEEE International Conference on Data Mining, pp. 450–457 (2002)
Google Scholar
Wang, K., Liu, H.: Schema discovery for semistructured data. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pp. 271–274 (1997)
Google Scholar
Xiao, Y., Yao, J.-F., Li, Z., Dunham, M.H.: Efficient data mining for maximal frequent subtrees. In: Proceedings of IEEE Internation Conference on Data Mining, pp. 379–386 (2003)
Google Scholar
Zaki, M.J.: Efficiently mining frequent trees in a forest. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 71–80 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Sungkyunkwan University, 300 Chunchun-dong, Jangan-gu, Gyeonggi-do, 440-746, Suwon, Republic of Korea
Juryon Paik & Ung Mo Kim

Authors

Juryon Paik
View author publications
You can also search for this author in PubMed Google Scholar
Ung Mo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science & Technology, Tsinghua University, Beijing, China
Ling Feng
Northeastern University,, 110004, Shenyang Liaoning, China
Guoren Wang
State Key Lab of Software Engineering, Wuhan University, 430072, Wuhan, P.R. China
Cheng Zeng
School of Information Management, Wuhan University, 430072, Wuhan, China
Ruhua Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paik, J., Kim, U.M. (2006). A Simple Yet Efficient Approach for Maximal Frequent Subtrees Extraction from a Collection of XML Documents. In: Feng, L., Wang, G., Zeng, C., Huang, R. (eds) Web Information Systems – WISE 2006 Workshops. WISE 2006. Lecture Notes in Computer Science, vol 4256. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11906070_9

Download citation

DOI: https://doi.org/10.1007/11906070_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-47663-4
Online ISBN: 978-3-540-47664-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics