1 Introduction

Kernel methods are a family of algorithms from statistical machine learning (61; 67). These include the Support Vector Machine (SVM) for regression and classification as well as methods for principal component analysis (62), feature selection (72), clustering (94), two-sample tests (7; 19), or dimensionality reduction (93). These kernel methods have witnessed a huge surge in popularity in bioinformatics over the last decade. To illustrate this popularity: pubmed, the search engine for biomedical literature, lists 1,710 hits for ‘kernel methods’ and 1,798 hits for ‘SVM’ (as of May 28, 2009).

The goal of this article is to review which problems in bioinformatics have been tackled using kernel methods, and to explain their popularity in this field. Section 15 provides a summary of the central terminology in kernel methods. Section 15 describes how kernels can be used for data integration. Section 15 illustrates the power of kernel methods in dealing with structured objects such as strings or graphs. Section 15 presents an overview of applications of Support Vector Machines in bioinformatics, and Sect. 15 reviews applications of kernel methods in bioinformatics beyond SVM-based classification or regression. The interested reader is referred to Chaps. ?? and ?? of Schölkopf et al. (63) for primers on molecular biology and kernel methods, to an introduction to Support Vector Machines and kernel methods in computational biology (4), and to a primer on Support Vector Machines for biologists (49).

2 Terminology

A kernel function is an inner product between two objects x, x′ ∈ 1D4B3; in a feature space :

$$k(x,x\prime ) =\langle \phi (x),\phi (x\prime )\rangle$$
(15.1)

where ϕ : x1D4B3; →  maps the data points from the input space X to feature space . k(x, x′) is referred to as the kernel value of x and x′. If this kernel function is applied to all pairs of objects from a set of objects, one obtains a matrix of kernel values, the kernel matrixK. K is always positive semi-definite,Footnote 1 that is all its eigenvalues are non-negative. Intuitively, a kernel function can be thought of as a similarity function between x and x′, and k(x, x′) can be thought of as a similarity score, and the matrix K as a similarity matrix, that is a matrix of similarity scores.

The idea underlying kernel methods is to map the original input data, on which statistical inference is to be performed, to a higher dimensional space, the so-called feature space, and to perform inference in this feature space. Naively, this procedure would comprise two steps: (1) mapping the data points to feature space via a mapping ϕ, (2) performing the prediction or computing the statistics of interest in this feature space. Kernel methods manage to perform this procedure in one single step: rather than separating mapping and prediction into two steps, inference is performed by evaluating kernel functions on the objects in input space. By means of these kernel functions, one implicitly solves the problem in feature space, but without explicitly computing the mapping ϕ. Hence any algorithm that solves a learning problem by accessing the data points only by means of kernel functions is a kernel method.

3 Data Integration

One major reason for the popularity of kernel methods in bioinformatics is their power in data integration. This attractiveness is due to the closure properties which kernels possess:

  1. 1.

    k 1, k 2 are kernels ⇒ k = k 1 + k 2 is a kernel

  2. 2.

    k 1, k 2 are kernels ⇒ k = k 1 ∗ k 2 is a kernel

  3. 3.

    k 1 is a kernel, λ is a positive scalar ⇒ k = λ ∗ k 1 is a kernel

Hence kernels can easily be combined in linear combinations or products. For instance, to compare two proteins, one can define a kernel on their sequences and on their 3D structures and then combine these into a joint sequence–structure kernel for proteins (40).

The goal of multiple kernel learning is to optimise the weights in a linear combination of kernels for a particular prediction task (34); a related technique is referred to as hyperkernels (50). Lack of runtime efficiency turned out to be a limitation of early approaches to multiple kernel learning and triggered further research that addressed this problem (54; 75). In bioinformatics, (35) applied the kernel learning technique to protein function prediction by optimally combining kernels on genome-wide data sets, including amino acid sequences, hydropathy profiles, gene expression data and known protein-protein interactions. Tsuda et al. (84) present an efficient variant of multiple kernel learning for protein function prediction from multiple networks, such as physical interaction networks and metabolic networks.

4 Analysing Structured Data

A second advantage of kernel methods is that they can easily be applied to structured data (22), for instance, graphs, sets, time series, and strings. The single requirement is that one can define a positive definite kernel on two structured objects, which intuitively speaking, quantifies the similarity between these two objects. As strings are abundant in bioinformatics as nucleotide and amino acid sequences, and biological networks steadily gain more attention, this applicability to structured data is another reason for the popularity of kernel methods in bioinformatics. In the following, we will describe the basic concepts underlying string and graph kernels.

4.1 String Kernels

The classic kernel for measuring the similarity of two strings s and s′ from an alphabet Σ is the spectrum kernel (36) that counts common substrings of length n in the two strings:

$$(s,s\prime ) ={\sum\nolimits}_{q\in {\Sigma}^{n}}\#(q \subseteq s)\#(q \subseteq s\prime ),$$
(15.2)

where #(q ⊆ s) is the frequency of substring q in string s, which can be computed in O( | s |  +  | s′ | ) (89), where | s | is the length of string s.

As nucleotide and protein sequences are prone to mutations, insertions, deletions and other changes over time, the spectrum kernel was extended in several ways to allow for mismatches (37), substitutions, gaps and wildcards (38). Recently, the runtime of these string kernels with inexact matching was sped up significantly in Kuksa et al. (32). Approaches such as (74) allow to perform SVM training on very large string datasets.

4.2 Graph Kernels

The classic kernel for quantifying the similarity of two graphs is the random-walk graph kernel (17; 28), which counts matching walks in two graphs. It can be computed elegantly by means of the direct product graph, also referred to as tensor or categorical product (26).

Definition 1.

Let G = (V, E, ) be a graph with vertex set V , edge set E and a label function: : VE → . The direct product of two graphs G = (V, E, ) and G′ = (V′, E′, ℒ′) shall be denoted as G × = G ×G′. The node set V × and edge set E × of the direct product graph are defined as:

$$\begin{array}{rcl}{ V }_{\times }& =\{ ({v}_{i},{v\prime }_{i\prime }) : {v}_{i} \in V \wedge {v\prime }_{i\prime } \in V \prime \wedge \mathcal{L}({v}_{i}) = \mathcal{L}\prime ({v}_{i\prime })\} & \\ {E}_{\times }& =\{ (({v}_{i},{v\prime }_{i\prime }),({v}_{j},{v\prime }_{j\prime })) \in {V }_{\times }\times {V }_{\times } : &<EquationNumber>15.3</EquationNumber> \\ & \quad \ ({v}_{i},{v}_{j}) \in E \wedge ({v\prime }_{i\prime },{v\prime }_{j\prime }) \in E\prime \wedge (\mathcal{L}({v}_{i},{v}_{j}) = \mathcal{L}\prime ({v\prime }_{i\prime },{v\prime }_{j\prime }))\}& \\ \end{array}$$
(15.3)

Using this product graph, the random walk kernel (also known as product graph kernel) can be defined as follows.

Definition 2.

Let G and G′ be two graphs, let A × denote the adjacency matrix of their product graph G ×, and let V × denote the node set of the product graph G ×. With a sequence of weights λ = λ0, λ1,  i  ∈ ; λ i  ≥ 0 for all i ∈ ) the product graph kernel is defined as

$${k}_{\times }(G,G\prime ) ={ \sum \nolimits }_{i,j=1}^{\vert {V }_{\times }\vert }{[{\sum \nolimits }_{k=0}^{\infty }{\lambda }_{ k}{A}_{\times }^{k}]}_{ ij}$$
(15.4)

if the limit exists.

Naively implemented, random walk kernels scale as O(n 6), where n is the number of nodes in the larger of the two graphs, but their runtime was reduced to O(n 3) by means of Sylvester equations (91). As random walk kernels are limited in their ability to detect common (non-path-shaped) substructures, a family of graph kernels has been proposed that count other types of matching subgraph patterns, for instance shortest paths (8), cycles (24), subtrees (55), and limited-size subgraphs (69).

In recent work (68), a highly-scalable graph kernel was presented based on so-called subtree patterns or tree-walks. Its runtime scales as O(N h m), where N is the number of graphs in the dataset, h the height of the subtree patterns and m the number of edges per graph. This graph kernel is orders of magnitude faster than previous approaches, while leading to competitive or better results on several benchmark datasets.

5 Support Vector Machines in Bioinformatics

The ultimate reason why kernel methods became a central branch of statistical bioinformatics was the Support Vector Machine, which reached or outperformed the accuracy levels of state-of-the-art classifiers on numerous prediction tasks in computational biology. For a comprehensive review of Support Vector Machines in computational biology up to the year 2004, the interested reader is referred to Noble (48).

Support Vector Machines were originally defined for binary classification problems (11; 85): Given two classes of data points, a positive and a negative class, one wants to be able to correctly predict the class membership of new, unlabeled data points. Support Vector Machines tackle this task by introducing a hyperplane that separates the positive from the negative class, and which maximises the margin, that is the distance to any point from the positive or negative class. New data points are then predicted to be members of the positive or negative class depending on which half-space they are located in with respect to the separating hyperplane. The enormous impact of Support Vector Machines was triggered by the observation that the dual form of the Support Vector Machine optimization problem only accesses the data points by means of inner products (60), and that this inner product could be replaced by any other inner product, that is by another kernel function.

Over the following decade, a multitude of applications of Support Vector Machines in bioinformatics emerged, which can be divided into three large branches: SVM applications on DNA/RNA sequences, proteins, and gene expression profiles. These branches differ in the biological objects or data types that they study, but they often make use the of same computational techniques. String kernels, for example, can be applied both to DNA/RNA and protein sequences.

5.1 DNA and RNA Sequences

Classification of DNA and RNA sequences via Support Vector Machines is one of the prime applications of SVMs in computational biology.

5.1.1 DNA Sequences

Several SVM-based prediction problems on DNA sequences have been studied in the literature, including secondary structure prediction from DNA sequence by an RBF kernel (25), but gene finding is the central prediction task on genomic sequences that SVMs have been applied to over recent years.

Support Vector Machines were successfully applied to various tasks in gene finding, in particular for splice site recognition. The prediction task is here to discriminate between sequences that do contain a true splice site versus sequences with a decoy splice site (73). The string kernel employed is the weighted degree shift kernel. It builds upon the spectrum kernel, counting matching n-mers in two strings, but the n-mers must occur at similar positions within in the sequence, not at arbitrary positions as in the spectrum kernel. Multiple kernel learning techniques were employed in Sonnenburg et al. (76) to determine the sequence motifs that are predictive of true splice sites (see also Sect. 15). In Ratsch et al. (56), this technique was further extended to the recognition of alternatively spliced exons. It was applied both to known exons to detect alternatively spliced ones, and to introns in order to check whether they might contain a yet unknown alternatively spliced exon. In Sonnenburg et al. (78), SVMs were employed for promoter recognition in humans. The SVM used a combination of kernels on weak indicators of promoter presence, including strings kernels on specific sequence motifs and properties and a linear kernel on the stacking energy and the twistedness of the DNA. These algorithmic components were assembled into a complete system for gene finding that was used to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans (57), correctly identifying all exons and introns in 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations. A kernel-based approach was also presented for the identification of regulatory modules in euchromatic sequences (64). The prediction task is here to decide whether a promoter region is the target of a transcription factor or not. The kernel designed for this task compares the sequence region around the best matches of a set of motifs within the sequence and their relative positions to the transcription start site.

5.1.2 RNA Sequences

Support Vector Machines have also been applied in RNA research. A major classification problem that arises in this field is to decide whether an RNA sequence is member of a functional RNA family. For this task, special-purpose kernels on RNA sequences have been defined, so-called stem kernels, which compare the stem structures that appear in the secondary structure of two RNA sequences (58; 59). The stem kernel examines all possible common base pairs and stem structures of arbitrary lengths, including pseudoknots between two RNA sequences, and calculates the inner product of common stem structure counts. Other typical applications of SVMs in RNA research include distinguishing protein-coding from non-coding RNA (42) and predicting target genes for microRNAs (31; 92).

5.2 Proteins

A second large area of SVM applications in biology is proteomics, in particular in protein structure, function and interaction prediction.

5.2.1 Protein Sequence Comparison

Protein comparison tries to establish the similarity of two proteins in order to find proteins that belong to the same structural or functional class. This comparison can focus on different aspects of the protein: its amino acid sequence, (approximated) physicochemical properties, or its 3D structure.

Comparing and classifying protein sequences is one of the classic tasks in bioinformatics, and one step towards goals such as protein function prediction, protein structure prediction, fold recognition, or remote homology detection. Kernels on sequences in combination with Support Vector Machines contributed to the field of sequence comparison by enabling discriminative classification of sequences. This field in kernel machines in bioinformatics witnessed a lot of work on kernel design, resulting in a number of conceptually different kernels, which we describe in the following.

The Fisher kernel combines Support Vector Machines with Hidden Markov Models for protein remote homology detection (27). The Hidden Markov Model is trained on protein sequences from the positive class and then applied to all proteins in the training and test set to derive a feature vector representation of the protein in terms of a gradient vector. This Fisher-kernel – used within a SVM – outperformed classic sequence alignment techniques such as BLAST (1) in protein homology detection. The Fisher-kernel was later generalised to the class of marginalised kernels on sequences (82): these kernels apply to all objects that are generated from latent variable models (e.g., HMM). The central idea is to first define a joint kernel for the complete data which includes both visible and hidden variables. The marginalized kernel for visible data is then obtained by taking the expectation with respect to the hidden variables.

Ding and Dubchak (14) derived feature vector representations of the physicochemical properties of proteins from their amino acid sequence and then used these vectors, a kernel on vectors and SVMs to predict SCOP fold membership of proteins (47). The physicochemical properties for these composition kernels were derived by means of amino acid indices (30): These indices are tables which map each amino acid type to one scalar that approximately describes a physicochemical property of this amino acid, for instance, its polarity, polarizability, van der Waals volume, or hydrophobicity. Cai et al. (13) used a similar approach to classify proteins into structural classes.

Motif kernels, as defined by Logan et al. (43) and Ben-Hur and Brutlag (2), are an alternative way of representing a protein sequence by a vector whose components indicate motif occurrence or absence. Logan et al. (43) use weight matrix motifs from the BLOCKS database (23), which are derived from multiple sequence alignments and occur in highly conserved, and often functionally important, regions of the proteins. These motifs are compared to proteins and the resulting scores are used as feature vector representations of the proteins. Ben-Hur and Brutlag (2) employ motifs from the eBLOCKS database of discrete sequence motifs (80), and show how to efficiently compute the resulting motif kernel using a trie data structure.

Liao and Noble (41) defined a different feature vector representation of protein sequence, resulting in an empirical kernel that directly uses existing sequence alignment techniques: For a set of n proteins, they first compute a n ×n matrix of sequence similarity scores (for instance, Smith-Waterman scores (70)) and then represent each protein by its corresponding vector of sequence similarity scores in this matrix.

The most recent class of protein sequence kernels are string kernels that count common substrings in two strings (see Sect. 15). These kernels either require exact matches (36), allow for a limited number of mismatches (39), or allow for substitutions, gaps or wildcards (38).

Further kernels on sequences have been defined which take local properties of the sequence (44) and local alignments (88) into account for specific prediction tasks, such as subcellular localisation prediction.

5.2.2 Protein Structure Comparison

With the ability to determine protein structure more rapidly advancing than our ability to study function, function predictions from protein structure gained more and more attention in computational biology. Dobson and Doig (15) described 1,178 protein structures as vectors by means of simple features such as secondary-structure content, amino acid propensities, surface properties and ligands, to then classify them into enzymes and non-enzymes via Support Vector Machines. Borgwardt et al. (9) modeled proteins from the same dataset as graphs, in which nodes represent secondary structure elements and edges represent neighborship of these elements along the amino acid chain or in 3D space. They then employed a random walk graph kernel on these graph models to perform function prediction and improved over the results achieved by Dobson and Doig (15). On other benchmark datasets for functional and structural classification, Qiu et al. (52) showed that a kernel that employs similarity scores based on the structural alignment tool MAMMOTH (51) outperforms the previous vector- and graph-based approaches.

5.2.3 Protein Interaction Prediction

A third central topic in computational proteomics is the prediction of protein–protein interactions, due to the numerous false-positive and false-negative edges in currently known protein–protein interaction networks. This problem can be cast as a binary classification problem: a pair of proteins is predicted to interact (positive class) or not (negative class). Bock and Gough (5) defined the first Support Vector Machine approach to this problem, in which they represented each pair of proteins as a concatenated feature vector of physicochemical and surface properties of these two proteins. Ben-Hur and Noble (3) further refined this approach by defining a pairwise tensor kernel k tensor on two pairs of proteins (a, b) and (c, d):

$${k}_{\mathit{tensor}}((a,b),(c,d)) = {k}_{\mathit{single}}(a,c){k}_{\mathit{single}}(b,d) + {k}_{\mathit{single}}(b,c){k}_{\mathit{single}}(a,d),$$
(15.5)

where k single measures the similarity between two proteins based on their sequences, gene ontology annotations, local properties of the network, and homologous interactions in other species. Two pairs of proteins are similar in this kernel, if for each protein in one pair, a protein with similar properties can be found in the other pair.

A setback of the tensor product kernels is the fact that the similarity or dissimilarity of the proteins within one pair is not taken into account. This changed when the metric learning pairwise kernel k mlpk was defined (87):

$${k}_{\mathit{mlpk}}((a,b),(c,d)) = {[(\phi (a) - \phi (b))\prime (\phi (c) - \phi (d))]}^{2},$$
(15.6)

which directly compares the relative similarity of the two proteins, (ϕ(a) − ϕ(b)) and (ϕ(c) − ϕ(d)), to each other and improves upon the prediction accuracy of the tensor kernel.

The pairwise tensor kernel and a Gaussian Radial Basis Function (RBF) kernel that considers within-pair-similarity of proteins were used in a recent study to predict co-complex-membership of protein pairs in yeast (53). The tensor kernel was based on a kernel k single , a weighted sum of kernels including three kernels on protein sequences and three diffusion kernels which measure proximity of the proteins within a physical or genetic interaction network. The Gaussian RBF kernel was computed on features that reflect coexpression, coregulation, colocalisation, similar gene ontology annotation and interologs of the proteins within a pair.

All the kernel methods for protein-interaction prediction via SVMs have in common that they treat the existence of interactions as pairwise independent events, that is, the existence of one interaction does not make the existence of other interactions more or less likely.

5.2.4 Other Kernel Applications in Proteomics

Other applications of SVMs in proteomics mainly involve protein function prediction from data sources other than sequence or structure, for which we describe some representative examples here. In one of the early studies in this direction, (86) defines a kernel on trees for function prediction from phylogenetic profiles of proteins. Tsuda and Noble (83) present an approach for predicting the function of unannotated proteins in protein-interaction or metabolomic networks. Their method uses a locally constrained diffusion kernel, which maximises the von Neumann entropy network, to measure similarity between nodes, and a Support Vector Machine for annotating proteins with unknown function.

5.3 Gene Expression Profiles

Another popular field of SVM applications are predictions based on microarray gene expression measurements. Existing kernels on vectors, such as the linear, polynomial and Gaussian RBF kernel, can be readily applied here without involved kernel design.

5.3.1 Diagnosis and Prognosis

The most common task in this field is to predict the phenotype of a patient based on his or her gene expression levels, primarily for disease diagnosis or for drug response prediction. The first study of this kind was conducted by Mukherjee et al. (46) on the dataset of gene expression levels of two classes of leukemia patients from Golub et al. (18), to tell apart these two subtypes of leukemia using a linear kernel and a SVM. Many similarly interesting studies followed, each of them focusing on a particular task of diagnosis or prognosis. The first kernel for time series of microarrays was defined in Borgwardt et al. (10). Here, gene expression profiles of multiple sclerosis patients were compared to predict their response to treatment by the drug beta-interferon by means of a dynamical systems kernel (90).

5.3.2 Function Prediction

SVMs on gene expression levels were also used for gene function prediction. Here, a gene is represented as a vector of its expression levels across different conditions, tissues or patients. The underlying assumption is that two genes are functionally related if they exhibit similar expression levels under different external conditions. The first study is this direction (12) predicted the membership of 6,000 yeast genes to five functional classes from the MIPS Yeast Genome Database (45).

6 Kernel Methods Beyond Classification

While Support Vector Machines are clearly the most popular kernel method in bioinformatics, there are also learning problems in bioinformatics which require different algorithmic machinery and statistical tests than classification or regression.

6.1 Data Integration for Network Inference

First, several kernel methods for data integration, in particular on networks, were defined.

Kato et al. (29) model protein interaction prediction as a kernel matrix completion problem. Their setting is that they are given a large dataset of proteins with different types of information on these proteins, including gene expression levels, protein localization, and phylogenetic profiles. They represent each of these data types by a ‘large’ kernel matrix. They are also given the true protein interactions between a small subset of all proteins, which they convert into a ‘small’ kernel matrix. They then define an algorithm for completing the small kernel matrix by means of the information from the large kernel matrices and to thereby infer the missing, unknown interactions.

Yamanishi et al. (95) also define a supervised approach to protein network inference from multiple types of data including gene expression, localisation information and phylogenetic profiles. They combine ideas from spectral clustering and kernel canonical correlation analysis to derive features that are indicative of protein interaction. This technique is further refined in Yamanishi et al. (96) for enzyme network inference by enforcing chemical constraints to be fulfilled by the resulting network structure.

6.2 Feature Selection

Second, feature selection is an important problem in computational biology, as the features that are relevant for an accurate prediction are extremely important to understand the underlying biological process.

A typical example for the relevance of feature selection in bioinformatics is gene selection from microarray data. Support Vector Machine-based approaches to feature selection were defined early on, which recursively eliminate irrelevant features (21) or iteratively downscale the less informative features (94).

Borgwardt et al. (9) and Sonnenburg et al. (76) employed multiple kernel learning for feature selection, to weight different kernels used by a Support Vector Machine. In (9), hyperkernels were used to determine which node attributes in a graph model of protein structure were most important for correct protein function prediction. These nodes represented alpha-helices or beta-sheets in the tertiary structure of the protein, and their attributes were their length in amino acids and Angstroms, and statistics on their hydrophobicity, polarity, polarizability and van der Waals volume. Among all these attributes, hyperkernel learning assigned the largest weight to the amino acid length.

In (76), multiple kernel learning was used to determine those sequence motifs that are most relevant for correct splice site recognition. Each kernel represented one single sequence motif at a specific sequence position, and multiple kernel learning determined the weight for each of these motifs, resulting in a set of position-specific sequence patterns that are associated with true splice sites. This technique was further refined in Sonnenburg et al. (77), now taking the overlap in sequence between different substrings into account and allowing to assess the importance of (consensus) sequence motifs for correct prediction, even if they do not occur in the given collection of sequences.

Song et al. (71) define a kernel-based approach to gene selection from microarray data. They show that many of the vast number of feature selection algorithms from the microarray literature are indeed instances of this framework, which are obtained by a different choice of kernel and/or a particular type of normalisation. New gene selection algorithms can easily be derived from this framework, even for regression and multi-class settings, and existing techniques can be objectively compared to each other, by replacing one kernel by another, while keeping other properties fixed, such as the normalisation technique employed.

6.3 Statistical Tests

Third, a recent development in machine learning are kernel-based statistical tests (19; 20), which led to a first application in bioinformatics: Borgwardt (7) define a kernel-based statistical test to check cross-platform comparability of microarray data. This two-sample test, whose goal it is to establish whether two samples were drawn from the same distribution or not, computes the distance between the means of the two samples in a universal reproducing kernel Hilbert Space (79) as its test statistic. The larger this distance, the smaller the probability that the two samples originate from the same distribution. In experiments on microarray cross-platform comparability, the test manages to clearly distinguish between samples of microarray measurements that were generated on the same platform and those from different platforms.

6.4 Kernel Methods for Structured Output

Fourth, another recent development in kernel machine learning are kernel methods for structured output domains. The classic Support Vector Machine was designed for binary classification problems, and data objects that were drawn i.i.d (independently and identically distributed) from an underlying distribution. However, it is obvious that many prediction problems in biology are multi-class problems, and that predictions on different objects can depend highly on each other.

For instance, if one wants to annotate a DNA sequence in gene finding, the predicted label of a nucleotide (e.g., exonic or intronic) is highly dependent on those of the neighbouring bases. This is often referred to as the label sequence learning problem: Given a sequence of n letters, one wants to predict a sequence of n class labels. Hidden Markov Models are the classic tool for this problem in computational biology (16). Conditional random fields were developed as a discriminative alternative to the generative model that Hidden Markov Models are based upon Lafferty et al. (33). Kernel-based discriminative approaches to this problem have recently been defined in machine learning as well, and employed successfully for sequence alignment (6; 65), gene finding and genome annotation (57; 66), and tiling array analysis (97; 98). A general approach to Support Vector Machine classification in multiclass and structured output domains was proposed by Tsochantaridis et al. (81), and promises to trigger further research in this direction in computational biology.

6.5 Outlook

In our opinion, the success story of kernel methods in bioinformatics will continue over the next decade. The strength of kernels in dealing with structured objects will lead to more applications of kernels in biological network analysis. Their ability to elegantly handle high-dimensional data and to integrate various data sources will make them one attractive tool for tasks such as genome-wide association studies. Furthermore, the ability to encode prior knowledge in the kernel function will foster the use of kernel methods in various specialised prediction tasks in computational biology.