CLS and CLS Close: The Scalable Method for Mining the Semi Structured Data Set
Semistructured pattern can be formally modeled as Graph Pattern. The most important problem to be solved in mining large semi structured dataset is the scalability of the method. With the successful development of efficient and scalable algorithms for mining frequent itemsets and sequences, it is natural to extend the scope of study to a more general pattern mining problem: mining frequent semistructured patterns or graph patterns. In this paper, we extend the methodology of pattern-growth and develop a novel algorithm called CLS (Canonical Labeling System), which discovers frequent connected subgraphs efficiently using either depth-first search or breadth-first search strategy.
A novel canonical labeling system and search order are devised to support efficient pattern growth. CLS has advantages of simplicity and efficiency over other methods since it combines pattern growing and pattern checking into one procedure. Based on CLS, we develop CLS Close to mine closed frequent graphs, which not only eliminates redundant patterns but also substantially increases the efficiency of mining, especially in the presence of large graph patterns.
Keywordsfrequent pattern closed pattern graph mining CLS code canonical label
Unable to display preview. Download preview PDF.
- R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of tems in large databases. InProc. 1993ACM-SIGMOD Int. Conf. Management of data (SIGMOD’93), pages 207–216, Washington, DC, May 1993.Google Scholar
- T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. InProc. 2002SIAM Int. Conf. Data Mining (SDM’02), Arlington, VA, April 2002.Google Scholar
- J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. Mining spatial motifs from protein structure graphs. InProc. of the 8th Annual Int. Conf. on Research in Computational Molecular Biology (RECOMB’04), pages 308–315, 2004.Google Scholar
- M. Kuramochi and G. Karypis. Frequent subgraph discovery. InProc. 2001 Int. Conf. Data Mining (ICDM’01), pages 313–320, San Jose, CA, Nov. 2001.Google Scholar
- Gaol, F.L & Widjaja, B.H, Frameworks of Graph Dataset Transformation into Canonical Form. InProc. of 3rd International Seminar Information &Communication and Technology, pages 143 – 150, ITS Surabaya, Sept 2007.Google Scholar