Abstract
Kernel method is one of the promising approaches to learning with tree-structured data, and various efficient tree kernels have been proposed to capture informative structures in trees. In this paper, we propose a new tree kernel function based on “subpath sets” to capture vertical structures in rooted unordered trees, since such tree-structures are often used to code hierarchical information in data. We also propose a simple and efficient algorithm for computing the kernel by extending the multikey quicksort algorithm used for sorting strings. The time complexity of the algorithm is O((|T 1| + |T 2|)log(|T 1| + |T 2|)) time on average, and the space complexity is O(|T 1| + |T 2|), where |T 1| and |T 2| are the numbers of nodes in two trees T 1 and T 2. We apply the proposed kernel to two supervised classification tasks, XML classification in web mining and glycan classification in bioinformatics. The experimental results show that the predictive performance of the proposed kernel is competitive with that of the existing efficient tree kernel for unordered trees proposed by Vishwanathan et al. [1], and is also empirically faster than the existing kernel.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Vishwanathan, S.V.N., Smola, A.: Fast kernels for string and tree matching. In: Advances in Neural Information Processing Systems, vol. 15, pp. 569–576 (2003)
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Haussler, D.: Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz (1999)
Collins, M., Duffy, N.: Convolution kernels for natural language. In: Proceedings of the Fourteenth Annual Conference on Neural Information Processing Systems, pp. 625–632 (2001)
Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: Proceedings of the Nineteenth International Conference on Machine Learning, pp. 291–298 (2002)
Kuboyama, T., Hirata, K., Aoki-Kinoshita, K.F., Kashima, H., Yasuda, H.: A gram distribution kernel applied to glycan classification and motif extraction. In: Proceedings of the Seventeenth International Conference on Genome Informatics, pp. 25–34 (2006)
Aiolli, F., Martino, G.D.S., Sperduti, A.: Route kernels for trees. In: Proceedings of the Twentie-sixth International Conference on Machine Learning, pp. 17–24 (2009)
Daumé III, H., Marcu, D.: A tree-position kernel for document compression. In: Proceedings of the Fourth Document Understanding Conference (2004)
Kashima, H.: Machine Learning Approaches for Structured-data. PhD thesis, Kyoto University (2007)
Ichikawa, H., Hakodaa, K., Hashimoto, T., Tokunaga, T.: Efficient sentence retrieval based on syntactic structure. In: Proceedings of the COLING/ACL, pp. 407–411 (2006)
Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 360–369 (1997)
Teo, C.H., Vishwanathan, S.V.N.: Fast and space efficient string kernels using suffix arrays. In: Proceedings of the Twentie-third International Conference on Machine Learning, pp. 929–936 (2006)
Shibuya, T.: Constructing the suffix tree of a tree with a large alphabet. IEICE Transactions on Fundamentals of Electronics 86(5), 1061–1066 (2003)
Kailing, K., Kriegel, H.P., Schönauer, S., Seidl, T.: Efficient similarity search for hierarchical data in large databases. In: Hwang, J., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 676–693. Springer, Heidelberg (2004)
Teo, C.H., Vishwanathan, S.V.N.: SASK: suffix arrays based string kernels (2006), http://users.cecs.anu.edu.au/~chteo/SASK.html
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm
Zaki, M.J., Aggarwal, C.C.: Xrules: An effective structural classifier for xml data. Machine Learning Journal 62(1-2), 137–170 (2006)
Hashimoto, K., Hamajima, M., Goto, S., Masumoto, S., Kawashima, M., Kanehisa, M.: Glycan: The database of carbohydrate structures. Genome Informatics 14, 649–650 (2003)
Doubet, S., Albersheim, P.: Carbbank. Glycobiology 2(6), 505 (1992)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. Journal of Machine Learning Research 2, 419–444 (2002)
Leslie, C., Eskin, E., Noble, W.: The spectrum kernel: A string kernel for SVM protein classification. In: Proceedings of the Pacific Symposium on Biocomputing, pp. 566–575 (2002)
Leslie, C., Eskin, E., Weston, J., Noble, W.S.: Mismatch string kernels for SVM protein classification. Neural Information Processing Systems 15, 1441–1448 (2003)
Kashima, H., Tsuda, K., Inokuchi, A.: Marginalized kernels between labeled graphs. In: Proceedings of the Twentieth International Conference on Machine Learning, pp. 321–328 (2003)
Gärtner, T., Flach, P., Wrobel, S.: On graph kernels: Hardness results and efficient alternatives. In: Proceedings of the Sixteenth Annual Conference on Computational Learning Theory, pp. 129–143 (2003)
Washio, T., Motoda, H.: State of the art of graph-based data mining. SIGKDD Explorations 5(1), 59–68 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kimura, D., Kuboyama, T., Shibuya, T., Kashima, H. (2011). A Subpath Kernel for Rooted Unordered Trees. In: Huang, J.Z., Cao, L., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2011. Lecture Notes in Computer Science(), vol 6634. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20841-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-20841-6_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20840-9
Online ISBN: 978-3-642-20841-6
eBook Packages: Computer ScienceComputer Science (R0)