Data Mining and Knowledge Discovery

, Volume 25, Issue 2, pp 325–357 | Cite as

Fast support vector machines for convolution tree kernels

  • Aliaksei Severyn
  • Alessandro Moschitti


Feature engineering is one of the most complex aspects of system design in machine learning. Fortunately, kernel methods provide the designer with formidable tools to tackle such complexity. Among others, tree kernels (TKs) have been successfully applied for representing structured data in diverse domains, ranging from bioinformatics and data mining to natural language processing. One drawback of such methods is that learning with them typically requires a large number of kernel computations (quadratic in the number of training examples) between training examples. However, in practice substructures often repeat in the data which makes it possible to avoid a large number of redundant kernel evaluations. In this paper, we propose the use of Directed Acyclic Graphs (DAGs) to compactly represent trees in the training algorithm of Support Vector Machines. In particular, we use DAGs for each iteration of the cutting plane algorithm (CPA) to encode the model composed by a set of trees. This enables DAG kernels to efficiently evaluate TKs between the current model and a given training tree. Consequently, the amount of total computation is reduced by avoiding redundant evaluations over shared substructures. We provide theory and algorithms to formally characterize the above idea, which we tested on several datasets. The empirical results confirm the benefits of the approach in terms of significant speedups over previous state-of-the-art methods. In addition, we propose an alternative sampling strategy within the CPA to address the class-imbalance problem, which coupled with fast learning methods provides a viable TK learning framework for a large class of real-world applications.


Kernel methods Tree kernels Natural language processing Large scale learning 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Aiolli F, Da San Martino G, Sperduti A, Moschitti A (2006) Fast on-line kernel learning for trees. In: Proceedings of the 2006 IEEE conference on data mining. IEEE Computer Society, Los Alamitos, CA, pp 787–791.
  2. Aiolli F, Martino GDS, Sperduti A, Moschitti A (2007) Efficient kernel-based learning for trees. In: CIDM, pp 308–315Google Scholar
  3. Asai T, Abe K, Kawasoe S, Arimura H, Sakamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: SDMGoogle Scholar
  4. Bšdi R, Herr K, Joswig M (2011) Algorithms for symmetric linear and integer programs. Mathematical programming, series A, Online First. Comments: 21 pp, 1 figure; sums up and extends results from 0908.3329 and 0908.3331Google Scholar
  5. Cancedda N, Gaussier E, Goutte C, Renders JM (2003) Word sequence kernels. J Mach Learn Res 3: 1059–1082MathSciNetMATHGoogle Scholar
  6. Carreras X, Mà àrquez L (2005) Introduction to the CoNLL-2005 shared task: semantic role labeling. In: Proceedings of the 9th conference on natural language learning, CoNLL-2005, Ann ArborGoogle Scholar
  7. Charniak E (2000) A maximum-entropy-inspired parser. In: ANLP, pp 132–139Google Scholar
  8. Chi Y, Yang Y, Muntz RR (2004) Hybridtreeminer: An efficient algorithm for mining frequent rooted trees and free trees using canonical form. In: SSDBM, pp 11–20Google Scholar
  9. Collins M, Duffy N (2002) New ranking algorithms for parsing and tagging: kernels over discrete structures, and the voted perceptron. In: ACL, pp 263–270Google Scholar
  10. Daumé III H, Marcu D (2004) A tree-position kernel for document compression. In: Proceedings of the DUC, BostonGoogle Scholar
  11. Denoyer L, Gallinari P (2007) Report on the xml mining track at INEX 2005 and INEX 2006: categorization and clustering of xml documents. SIGIR Forum 41: 79–90CrossRefGoogle Scholar
  12. Franc V, Sonnenburg S (2008) Optimized cutting plane algorithm for support vector machines. In: ICML, pp 320–327Google Scholar
  13. Giuglea AM, Moschitti A (2004) Knowledge Discovering using FrameNet, VerbNet and PropBank. In: Proceedings of the workshop on ontology and knowledge discovering at ECML 2004, Pisa, ItalyGoogle Scholar
  14. Giuglea AM, Moschitti A (2006) Semantic role labeling via framenet, verbnet and propbank. In: Proceedings of ACL, Sydney, AustraliaGoogle Scholar
  15. Haussler D (1999) Convolution kernels on discrete structures. Tech. Rep. UCSC-CRL-99-10, University of California, Santa CruzGoogle Scholar
  16. Joachims T (1999) Making large-scale SVM learning practical. In: Advances in kernel methods—support vector learning, chap 11. MIT Press, Cambridge, pp 169–184Google Scholar
  17. Joachims T (2005) A support vector method for multivariate performance measures. In: International conference on machine learning (ICML), pp 377–384Google Scholar
  18. Joachims T (2006) Training linear SVMs in linear time. In: KDDGoogle Scholar
  19. Joachims T, Yu CNJ (2009) Sparse kernel SVMS via cutting-plane training. Mach Learn 76(2–3): 179–193CrossRefGoogle Scholar
  20. Kate RJ, Mooney RJ (2006) Using string-kernels for learning semantic parsers. In: ACLGoogle Scholar
  21. Kuang R, Ie E, Wang K, Wang K, Siddiqi M, Freund Y, Leslie CS (2004) Profile-based string kernels for remote homology detection and motif extraction. In: 3rd international IEEE computer society computational systems bioinformatics conference (CSB 2004), pp 152–160Google Scholar
  22. Kudo T, Matsumoto Y (2003) Fast methods for kernel-based text analysis. In: Proceedings of ACL’03Google Scholar
  23. Leslie C, Eskin E, Cohen A, Weston J, Noble WS (2004) Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4): 467–476CrossRefGoogle Scholar
  24. Marcus M, Santorini B, Marcinkiewicz M (1993) Building a large annotated corpus of English: the Penn Treebank. Comput Linguist 19(2): 313–330Google Scholar
  25. Mehdad Y, Moschitti A, Zanzotto FM (2010) Syntactic/semantic structures for textual entailment recognition. In: HLT-NAACL, pp 1020–1028Google Scholar
  26. Moschitti A (2004) A study on convolution kernel for shallow semantic parsing. In: Proceedings of ACL’04. Barcelona, SpainGoogle Scholar
  27. Moschitti A (2006a) Efficient convolution kernels for dependency and constituent syntactic trees. In: Proceedings of ECMLGoogle Scholar
  28. Moschitti A (2006b) Making tree kernels practical for natural language learning. In: EACL. The Association for Computer LinguisticsGoogle Scholar
  29. Moschitti A (2008) Kernel methods, syntax and semantics for relational text categorization. In: Proceeding of CIKM ’08, New York, USAGoogle Scholar
  30. Moschitti A, Bejan CA (2004) A semantic kernel for predicate argument classification. In: Ng HT, Riloff E (eds) HLT-NAACL 2004 workshop: eighth conference on computational natural language learning (CoNLL-2004). Association for Computational Linguistics, Boston, pp 17–24Google Scholar
  31. Moschitti A, Pighin D, Basili R (2008) Tree kernels for semantic role labeling. Comput Linguist 34(2): 193–224MathSciNetCrossRefGoogle Scholar
  32. Nguyen TVT, Moschitti A (2011) Joint distant and direct supervision for relation extraction. In: Proceedings of 5th international joint conference on natural language processing. Asian Federation of Natural Language Processing, Chiang Mai, Thailand, pp 732–740.
  33. Noreen EW (1989) Computer-intensive methods for testing hypotheses : an introduction. Wiley-Interscience, New YorkGoogle Scholar
  34. Padó S (2006) User’s guide to sigf: significance testing by approximate randomisationGoogle Scholar
  35. Palmer M, Kingsbury P, Gildea D (2005) The proposition bank: an annotated corpus of semantic roles. Comput Linguist 31(1): 71–106CrossRefGoogle Scholar
  36. Pighin D, Moschitti A (2009a) Efficient linearization of tree kernel functions. In: Proceedings of the thirteenth conference on computational natural language learning (CoNLL-2009). Association for Computational Linguistics, Boulder, pp 30–38Google Scholar
  37. Pighin D, Moschitti A (2009b) Reverse engineering of tree kernel feature spaces. In: Proceedings of the 2009 conference on empirical methods in natural language processing. Association for Computational Linguistics, Singapore, pp 111–120Google Scholar
  38. Pighin D, Moschitti A (2010) On reverse feature engineering of syntactic tree kernels. In: Proceedings of the fourteenth conference on computational natural language learning. Association for Computational Linguistics, Uppsala, Sweden, pp 223–233Google Scholar
  39. Rieck K, Krueger T, Brefeld U, Mueller KRs (2010) Approximate tree kernels. J Mach Learn Res 11: 555–580MathSciNetGoogle Scholar
  40. Saigo H, Vert J, Akutsu T, Ueda N (2004) Protein homology detection using string alignment kernels. Bioinformatics 20: 1682–1689CrossRefGoogle Scholar
  41. Severyn A, Moschitti A (2010) Large-scale support vector learning with structural kernels. In: ECML/PKDD (3), pp 229–244Google Scholar
  42. Severyn A, Moschitti A (2011) Fast support vector machines for structural kernels. In: ECMLGoogle Scholar
  43. Shasha D, Wang JTL, Zhang S (2004) Unordered tree mining with applications to phylogeny. In: ICDE, pp 708–719Google Scholar
  44. Shervashidze N, Borgwardt K (2009) Fast subtree kernels on graphs. In: Proceedings of advances in neural information processing systemsGoogle Scholar
  45. Shi Q, Petterson J, Dror G, Langford J, Smola AJ, Vishwanathan SVN (2009) Hash kernels for structured data. JMLR 10: 2615–2637MathSciNetMATHGoogle Scholar
  46. Steinwart I (2003) Sparseness of support vector machines. J Mach Learn Res 4: 1071–1105MathSciNetGoogle Scholar
  47. Termier A, Rousset MC, Sebag M (2004) Dryade: a new approach for discovering closed frequent trees in heterogeneous tree databases. In: ICDM, pp 543–546Google Scholar
  48. Trentini F, Hagenbuchner M, Sperduti A, Scarselli F (2006) A self-organising map approach for clustering of xml documents. In: IJCNN, pp 1805–1812. IEEEGoogle Scholar
  49. Tsochantaridis I, Joachims T, Hofmann T, Altun Y (2005) Large margin methods for structured and interdependent output variables. J Mach Learn Res 6: 1453–1484MathSciNetMATHGoogle Scholar
  50. Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the IJCAI, pp 55–60Google Scholar
  51. Versley Y, Moschitti A, Poesio M, Yang X (2008) Coreference systems based on kernels methods. In: The 22nd international conference on computational linguistics (Coling’08). Manchester, EnglandGoogle Scholar
  52. Wang C, Hong M, Pei J, Zhou H, Wang W, Shi B (2004) Efficient pattern-growth methods for frequent tree pattern mining. In: PAKDD, pp 441–451Google Scholar
  53. Wu G, Chang E (2003) Class-boundary alignment for imbalanced dataset learning. In: ICML 2003 workshop on learning from imbalanced data sets II, Washington, DC, pp 49–56Google Scholar
  54. Xia Y, Yang Y (2005) Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Trans Knowl Data Eng 17(2): 190–202MathSciNetCrossRefGoogle Scholar
  55. Yang LH, Lee ML, Hsu W, Guo X (2004) 2pxminer: an efficient two pass mining of frequent xml query patterns. In: KDD, pp 731–736Google Scholar
  56. Yu CNJ, Joachims T (2008) Training structural svms with kernels using sampled cuts. In: KDD, pp 794–802Google Scholar
  57. Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of ICDMGoogle Scholar
  58. Zaki MJ (2005) Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans Knowl Data Eng 17(8): 1021–1035CrossRefGoogle Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of TrentoPovoItaly

Personalised recommendations