Discovery of Useful Patterns from Tree-Structured Documents with Label-Projected Database

Paik, Juryon; Nam, Junghyun; Youn, Hee Yong; Kim, Ung Mo

doi:10.1007/978-3-540-69295-9_22

Juryon Paik¹,
Junghyun Nam²,
Hee Yong Youn¹ &
…
Ung Mo Kim¹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 5060))

Included in the following conference series:

International Conference on Autonomic and Trusted Computing

850 Accesses

Abstract

Due to its highly flexible tree structure, XML data is used to capture most kinds of data and provides a substrate in which almost any other data structure may be presented. With the continuous growth of XML tree data in electronic environments, the discovery of useful knowledge from them has been a main research area in the information retrieval community. The mostly used approach to this task is to extract frequently occurring subtree patterns from a set of trees. However, because the number of frequent subtrees grows exponentially with the size of trees, a more practical and scalable alternative is required, which is the discovery of maximal frequent subtrees. The maximal frequent subtrees hold all the useful information, though, the number of them is much smaller than that of frequent subtrees. Handling the maximal frequent subtrees is an interesting challenge, and represents the core of this paper. As far as we know, this is one of the first studies to directly discover maximal frequent subtrees without any candidate sets generations as well as eliminating the process of useless subtree pruning. To this end, we define and use a new type of projected database to represent XML tree data efficiently. It significantly improves the entire process of mining maximal frequent subtree patterns. We study the performance and the scalability of the proposed approach through experiments based on synthetic datasets.

This work was supported in part by the Ubiquitous Autonomic Computing and Network Project, 21st Century Frontier R&D Program, and by the ITRC(Information Technology Research Center) support program supervised by the IITA(Institute of Information Technology Advancement) (IITA-2008-C1090-0801-0028), both funded by the MKE(Ministry of Knowledge Economy), Korea.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Asai, T., Abe, K., Kawasoe, S., Arimura, H., Satamoto, H., Arikawa, S.: Efficient Substructure Discovery from Large Semi-Strucutured Data. In: Proceedings of the 2nd SIAM International Conference on Data Mining, pp. 158–174 (2002)
Google Scholar
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proceedings of the 20th International Conference on Very Large Databases (VLDB 1994), pp. 487–499 (1994)
Google Scholar
Chen, Y., Chen, Y.: A New Tree Inclusion Algorithm. Information Processing Letters 98, 253–262 (2006)
Article MathSciNet Google Scholar
Chi, Y., Yang, Y., Muntz, R.R.: Canonical Forms for Labeled Trees and Their Applications in Frequent Subtree Mining. Knowledge and Information Systems 8(2), 203–234 (2005)
Article Google Scholar
Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Proceedings of 2000 ACM SIGMOD International Conference on Management of Data (ICMD 2000), pp. 1–12 (2000)
Google Scholar
Mannila, H., Raiha, K.-J.: On Query Languages for the P-String Data Model. In: Information Modelling and Knowledge Bases, pp. 469–482. IOS Press, Amsterdam (1990)
Google Scholar
Paik, J., Shin, D.R., Kim, U.M.: EFoX: a Scalable Method for Extracting Frequent Subtrees. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3516, pp. 813–817. Springer, Heidelberg (2005)
Google Scholar
Paik, J., Won, D., Fotouhi, F., Kim, U.M.: EXiT-B: a New Approch for Extracting Maximal Frequent Subtrees from XML Data. In: Gallagher, M., Hogan, J.P., Maire, F. (eds.) IDEAL 2005. LNCS, vol. 3578, pp. 1–8. Springer, Heidelberg (2005)
Google Scholar
Termier, A., Rousset, M.-C., Sebag, M.: TreeFinder: a First Step towards XML Data Mining. In: Proceedings of IEEE International Conference on Data Mining (ICDM 2002), pp. 450–457 (2002)
Google Scholar
Wang, C., Hong, M., Pei, H., Zhou, H., Wang, W., Shi, B.: Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 441–451. Springer, Heidelberg (2004)
Google Scholar
Wang, K., Liu, H.: Schema Discovery for Semistructured Data. In: Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD 1997), pp. 271–274 (1997)
Google Scholar
Xiao, Y., Yao, J.-F., Li, Z., Dunham, M.H.: Efficient Data Mining for Maximal Frequent Subtrees. In: Proceedings of IEEE International Conference on Data Mining (ICDM 2003), pp. 379–386 (2003)
Google Scholar
Zaki, M.J.: Scalable Algorithms for Association Mining. IEEE Transactions on Knowledge and Data Engineering 12(3), 290–372 (2000)
Article MathSciNet Google Scholar
Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering 17(8), 1021–1035 (2005)
Article Google Scholar
Zou, L., Lu, Y., Zhang, H.: Mining Frequent Induced Subtrees by Prefix-Tree-Projected Pattern Growth. In: Yu, J.X., Kitsuregawa, M., Leong, H.-V. (eds.) WAIM 2006. LNCS, vol. 4016, pp. 18–25. Springer, Heidelberg (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Engineering, Sungkyunkwan University, Republic of Korea
Juryon Paik, Hee Yong Youn & Ung Mo Kim
Dept. of Computer Science, Konkuk University, Republic of Korea
Junghyun Nam

Authors

Juryon Paik
View author publications
You can also search for this author in PubMed Google Scholar
Junghyun Nam
View author publications
You can also search for this author in PubMed Google Scholar
Hee Yong Youn
View author publications
You can also search for this author in PubMed Google Scholar
Ung Mo Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Chunming Rong Martin Gilje Jaatun Frode Eika Sandnes Laurence T. Yang Jianhua Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paik, J., Nam, J., Youn, H.Y., Kim, U.M. (2008). Discovery of Useful Patterns from Tree-Structured Documents with Label-Projected Database. In: Rong, C., Jaatun, M.G., Sandnes, F.E., Yang, L.T., Ma, J. (eds) Autonomic and Trusted Computing. ATC 2008. Lecture Notes in Computer Science, vol 5060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-69295-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-540-69295-9_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-69294-2
Online ISBN: 978-3-540-69295-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics