Skip to main content
Log in

Mining rooted ordered trees under subtree homeomorphism

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Mining frequent tree patterns has many applications in different areas such as XML data, bioinformatics and World Wide Web. The crucial step in frequent pattern mining is frequency counting, which involves a matching operator to find occurrences (instances) of a tree pattern in a given collection of trees. A widely used matching operator for tree-structured data is subtree homeomorphism, where an edge in the tree pattern is mapped onto an ancestor-descendant relationship in the given tree. Tree patterns that are frequent under subtree homeomorphism are usually called embedded patterns. In this paper, we present an efficient algorithm for subtree homeomorphism with application to frequent pattern mining. We propose a compact data-structure, called occ, which stores only information about the rightmost paths of occurrences and hence can encode and represent several occurrences of a tree pattern. We then define efficient join operations on the occ data-structure, which help us count occurrences of tree patterns according to occurrences of their proper subtrees. Based on the proposed subtree homeomorphism method, we develop an effective pattern mining algorithm, called TPMiner. We evaluate the efficiency of TPMiner on several real-world and synthetic datasets. Our extensive experiments confirm that TPMiner always outperforms well-known existing algorithms, and in several cases the improvement with respect to existing algorithms is significant.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. The upper bound of the scope of the last vertex is already available in scope; for convenience of presentation, the information is duplicated in RP.

References

  • Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: Proceedings of the second SIAM international conference on data mining (SDM), SIAM, pp 158–174

  • Balcazar JL, Bifet A, Lozano A (2010) Mining frequent closed rooted trees. Mach Learn 78(1–2):1–33

    MathSciNet  MATH  Google Scholar 

  • Bille P, Gortz I (2011) The tree inclusion problem: in linear space and faster. ACM Trans Algorithm 7(3):1–47

    Article  MathSciNet  MATH  Google Scholar 

  • Chalmers R, Almeroth K (2001) Modeling the branching characteristics and efficiency gains of global multicast trees. In: Proceedings of the 20th IEEE international conference on computer communications (INFOCOM), pp 449–458

  • Chalmers RC, Member S, Almeroth KC (2003) On the topology of multicast trees. IEEE/ACM Trans Netw 11:153–165

    Article  Google Scholar 

  • Chaoji V, Hasan MA, Salem S, Zaki MJ (2008) An integrated, generic approach to pattern mining: data mining template library. Data Min Knowl Discov 17(3):457–495

    Article  MathSciNet  Google Scholar 

  • Chehreghani MH (2011) Efficiently mining unordered trees. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), pp 111–120

  • Chehreghani MH, Chehreghani MH, Lucas C, Rahgozar M (2011) OInduced: an efficient algorithm for mining induced patterns from rooted ordered trees. IEEE Trans Syst Man Cybern A 41(5):1013–1025

    Article  Google Scholar 

  • Chi Y, Muntz RR, Nijssen S, Kok JN (2005) Frequent subtree mining—an overview. Fundam Inf 66(1–2):161–198

    MathSciNet  MATH  Google Scholar 

  • Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 509–512

  • Cui J, Kim J, Maggiorini D, Boussetta K, Gerla M (2002) Aggregated multicast—a comparative study. In: Proceedings of the second international IFIP-TC6 networking conference on networking technologies, services, and protocols; performance of computer and communication networks; and mobile and wireless communications (NETWORKING), pp 1032–1044

  • Diestel R (2010) Graph theory, 4th edn. Springer, Heidelberg

    Book  MATH  Google Scholar 

  • Dietz PF (1982) Maintaining order in a linked list. In: Proceedings of the 14th ACM symposium on theory of computing (STOC), pp 122–127

  • Ivancsy R, Vajk I (2006) Frequent pattern mining in web log data. Acta Polytech Hung 3(1):77–90

    Google Scholar 

  • Kilpelainen P, Mannila H (1995) Ordered and unordered tree inclusion. SIAM J Comput 24(2):340–356

    Article  MathSciNet  MATH  Google Scholar 

  • Miyahara T, Suzuki Y, Shoudai T, Uchida T, Takahashi K, Ueda H (2004) Discovery of maximally frequent tag tree patterns with contractible variables from semistructured documents. In: Proceedings of the 8th Pacific Asia conference on knowledge discovery and data mining (PAKDD), pp 133–144

  • Nijssen S, Kok JN (2003) Efficient discovery of frequent unordered trees. In: Proceedings of the first international workshop on mining graphs, trees, and sequences (MGTS), pp 55–64

  • Qin L, Yu JX, Ding B (2007) TwigList: make twig pattern matching fast. In: Proceedings of the 12th international conference on database systems for advanced applications (DASFAA), pp 850–862

  • Sidhu AS, Dillon TS, Chang E (2006) Protein ontology. In: Ma Z, Chen JY (eds) Database modeling in biology: practices and challenges. Springer, New York, pp 39–60

    Google Scholar 

  • Tan H, Hadzic F, Dillon TS, Chang E, Feng L (2008) Tree model guided candidate generation for mining frequent subtrees from XML documents. ACM Trans Knowl Discov Data 2(2):43. doi:10.1145/1376815.1376818

    Article  Google Scholar 

  • Tatikonda S, Parthasarathy S (2009) Mining tree-structured data on multicore systems. Proc VLDB Endow 2(1):694–705

    Article  Google Scholar 

  • Tatikonda S, Parthasarathy S, Kurc TM (2006) TRIPS and TIDES: new algorithms for tree mining. In: Proceedings of the 15th ACM international conference on information and knowledge management (CIKM), pp 455–464 (2006)

  • Wang C, Hong M, Pei J, Zhou H, Wang W, Shi B (2004) Efficient pattern-growth methods for frequent tree pattern mining. In: Proceedings of the 8th Pacific Asia conference on knowledge discovery and data mining (PAKDD), pp 441–451

  • Xiao Y, Yao JF, Li Z, Dunham MH (2003) Efficient data mining for maximal frequent subtrees. In: Proceedings of the third IEEE international conference on data mining (ICDM), pp 379–386

  • Zaki MJ (2005) Efficiently mining frequent embedded unordered trees. Fundam Inf 66(1–2):33–52

    MathSciNet  MATH  Google Scholar 

  • Zaki MJ (2005) Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Tran Knowl Data Eng 17(8):1021–1035

    Article  Google Scholar 

  • Zaki MJ, Aggarwal CC (2006) XRules: an effective algorithm for structural classification of XML data. Mach Learn 62(1–2):137–170

    Article  Google Scholar 

Download references

Acknowledgments

We are grateful to Professor Mohammed Javeed Zaki for providing the VTreeMiner code, the CSLOGS datasets and the TreeGenerator program, to Dr Henry Tan for providing the MB3Miner code, to Dr Fedja Hadzic for providing the Prions dataset and to Professor Jun-Hong Cui for providing the NASA dataset. Finally, we would like to thank Dr Morteza Haghir Chehreghani for his discussion and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mostafa Haghir Chehreghani.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge and Concha Bielza.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Haghir Chehreghani, M., Bruynooghe, M. Mining rooted ordered trees under subtree homeomorphism. Data Min Knowl Disc 30, 1249–1272 (2016). https://doi.org/10.1007/s10618-015-0439-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-015-0439-5

Keywords

Navigation