Stochastic feature mapping for PACBayes classification
Abstract
Hidden information derived from probabilistic generative models of data distributions can be used to construct features for discriminative classifiers. This observation has motivated the development of approaches that attempt to couple generative and discriminative models together for classification. However, existing approaches typically feed features derived from generative models to discriminative classifiers, and do not refine the generative models or the feature mapping functions based on classification results. In this paper, we propose a coupling mechanism developed under the PACBayes framework that can finetune the generative models and the feature mapping functions iteratively to improve the classifier’s performance. In our approach, a stochastic feature mapping, which is a function over the random variables of a generative model, is derived to generate feature vectors for a stochastic classifier. We construct a stochastic classifier over the feature mapping and derive the PACBayes generalization bound for the classifier, for both supervised and semisupervised learning. This allows us to jointly learn the feature mapping and the classifier by minimizing the bound with an EMlike iterative algorithm using labeled and unlabeled data. The resulting framework integrates the learning of the discriminative classifier and the generative model and allows iterative finetuning of the generative models, and the feedforward feature mappings based on task performance feedback. Our experiments show, in three distinct applications, this new framework produces a general classification tool with stateoftheart performance.
Keywords
Stochastic feature mapping PACBayes generalization bound Hybrid generativediscriminative classification1 Introduction
Probabilistic generative models that seek to model data distributions are adept in exploiting hidden information, in dealing with structured data (e.g. protein sequence with variable length) and in solving nonlinear classification problems by means of maximum posteriori (MAP) classifiers, while discriminative models designed to find decision boundaries among different classes based on extracted features still furnish most of the stateoftheart tools for classification. A number of promising methods (Jaakkola and Haussler 1999; Jaakkola et al. 1999; Raina et al. 2003; McCallum et al. 2006; Li et al. 2010, 2011; Perina et al. 2012) have been developed to exploit the complementarities of these two major paradigms (Jaakkola et al. 1999; Ng and Jordan 2002). These methods can be roughly categorized into two classes based on how they couple the generative and discriminative models: methods with explicit feature mappings (Jaakkola and Haussler 1999; Perina et al. 2012; Li et al. 2011) and methods without explicit feature mappings (Jaakkola et al. 1999; Raina et al. 2003; McCallum et al. 2006). In this paper, we focus on the first class as it is more flexible and can be directly used in discriminative classifiers.
Methods with explicit feature mappings, called generative feature mapping or generative score space (Jaakkola and Haussler 1999; Perina et al. 2012; Li et al. 2011), are motivated by the two findings revealed by earlier works in the context of classification: (1) generative models can provide useful information from their parameters and variables to construct feature mappings and simultaneously transform structured data of variable length into data in a fixed dimension feature space; (2) discriminative models are effective in finding decision boundaries in such a feature space. A feature mapping is a function over the hidden variables, observed variables and model parameters. It transforms a data point into a feature vector for the classifier. While these existing methods have tried to exploit the power of the generative models in uncovering hidden information, the generative models and the classifiers in these methods are insulated from each other and the resulting feature mappings could be suboptimal. Thus, it is desirable to develop a closedloop coupling mechanism that allows the generative models and the feature maps to be finetuned by the classification performance.
PACBayes theory (McAllester 1999; Seeger 2002; McAllester 2003; Langford 2006; Lacasse et al. 2006; Germain et al. 2009; Seldin et al. 2012; Tolstikhin and Seldin 2013) potentially can provide a framework to learn feature mappings and classifiers jointly, allowing the fine tuning of feature mapping. PACBayes is a theory proposed to bound the generalization error of classifiers, where classifiers are learned by minimizing the generalization bound with respect to the parameters of the classifiers over the training set. Similarly, feature mappings can also be learned by minimizing the generalization bound with respect to the quantities of feature mappings.
 (1)
We derived a stochastic feature mapping that is effective in capturing generative (distribution and hidden) information in the data;
 (2)
We derived a PACBayes generalization bound for the stochastic classifier over this stochastic feature mapping for both supervised and semisupervised learning;
 (3)
We developed a joint learning approach to learn feature mapping and classifier by minimizing the derived bound.
 (1)
The proposed stochastic feature mapping and its generalization bound can effectively be exploited to utilize hidden variables in the classification process, yielding stateoftheart classification performance;
 (2)
The proposed method produces satisfactory performance when the ‘capacity’ of generative models is small, suggesting that it is efficient in both inference and learning;
 (3)
When the number of labeled training data is limited, the unlabeled data can be used to bootstrap the training of the classifier to improve performance.
The notation list
Obj.  Description  Obj.  Description 

\(\mathbf {x}\)  Input data  y  Output label 
\(\theta \)  Model parameter  \(\mathbf {h}\)  Hidden variables 
\(Q(\mathbf {h}\,\, \mathbf {x})\)  Posterior distribution  \(Q(\mathbf {h})\)  \(\int Q(\mathbf {h}\,\, \mathbf {x}) P(\mathbf {x}) d \mathbf {x}\) 
\(\phi \)  Feature mapping  \(\bar{\phi }\)  \(\bar{\phi }=\phi / \parallel \phi \parallel \) 
\(f_Q\)  Stochastic classifier  \(\mathbf {w}\)  Weight, \(\mathrm {E}_{Q}[\mathbf {w}]=\mathbf {u}\) 
D  Unknown \(P(\mathbf {x},y)\)  S  Training set with size \(m=S\) 
\(R_D(f_Q)\)  True risk  \(R_S(f_Q)\)  Empirical risk 
\(e(f_Q)\)  Risk for labeled data  \(d(f_Q)\)  Risk for unlabeled data 
\(S_l\)  Labeled set with size \(m_l=S_l\)  \(S_u\)  Labeled set with size \(m_u=S_u\) 
2 Related works
2.1 Generative score spaces
Generative feature mapping (Jaakkola and Haussler 1999; Tsuda et al. 2002; Smith and Gales 2002; Holub et al. 2008; Perina et al. 2012; Li et al. 2011) is a class of methods that are designed to exploit the generative information for discriminative classification. Feature mappings are scores or measures computed over the generative models. They are functions over the observed variables, hidden variables, and parameters of generative models. The space spanned by a feature mapping is called as a score space or feature space.
2.2 PACBayes generalization bounds
PACBayes (McAllester 1999; Seeger 2002; McAllester 2003; Langford 2006; Lacasse et al. 2006; Germain et al. 2009; Seldin et al. 2012; Tolstikhin and Seldin 2013) is a theory for bounding the generalization error of classifiers. A variety of PACBayes generalization bounds (McAllester 1999; Seeger 2002; McAllester 2003; Langford 2006; Lacasse et al. 2006; Germain et al. 2009; Seldin et al. 2012; Tolstikhin and Seldin 2013) have been proposed for different classifiers such as deterministic classifiers, Gibbs classifiers (McAllester 1999), linear classifiers or nonlinear classifiers (e.g. Gaussian process (Seeger 2002)). Gibbs classifier, which we will use, is a stochastic classifier that usually operates under majority voting decision rules.
PACBayes can bound classifiers built from different discriminative criteria, for example, the large margin criterion. The generalization bounds, derived from PACBayes theory, can be expressed in two typical forms: an implicit form which bounds the difference between the empirical risk and the true risk (Seeger 2002; Langford 2006; Lacasse et al. 2006), or an explicit form which bounds the true risk directly (McAllester 2003; Germain et al. 2009). Besides, there are some tight bounds (Seldin et al. 2012; Tolstikhin and Seldin 2013) available. In this paper, we will focus on explicit bounds because they allow us to derive the analytic solution of the posteriors of hidden variables, with higher computational efficiency.
Our proposed method is related to transductive methods (Joachims 1999, 2003) which exploit both labeled data and unlabeled data for classification. Different with their methodology that explicitly infers the labels of unlabeled examples, our method instead minimizes the error rate of unlabeled examples. These methods work particularly well when the labeled training set is relatively small.
3 Stochastic feature mapping from free energy lower bound
Exploiting generative information, i.e., hidden variable, observed variable and data distribution, for discriminative classification (Jaakkola and Haussler 1999; Holub et al. 2008; Perina et al. 2012; Li et al. 2011) has shown promise in a variety of real world applications. A way to achieve this is to derive feature mapping from probabilistic generative models.
This section aims to derive a feature mapping to exploit generative information. Given a generative model with observed variable \(\mathbf {x}\), hidden variable \(\mathbf {h}\) and parameter \(\theta \), the problem is to find a feature mapping \(\phi (\mathbf {x},\mathbf {h})\) over the random variables. Our method is to fish out the informative components from the free energy the lower bound of log likelihood of generative models. The feature mapping takes a stochastic form rather than a deterministic form. The use of stochastic form makes it easier to derive and optimize the generalization bound. Further, the feature mapping is not an explicit function of parameters, simplifying the estimation procedure of model parameters (see Sect. 5.3).
3.1 Formulation
3.2 An illustrative example
Hence, each 1D data point x is mapped to a 12D feature space in this case. Figure 2 illustrates only two feature dimensions from \(T(x,\mathbf {z})\), i.e. \(z_1 x\) and \(z_2 x\), which already produces a feature space in which the projected data points are linearly separable, greatly simplifying the classification problem.
4 Stochastic classifier and generalization bound
Given the stochastic feature mapping (Eq. 5), the problem of this section is to derive a generalization error bound for a stochastic classifier (Eq. 6) equipped with the feature mapping, for both supervised and semisupervised learning. Our method is to decompose the risk term into two parts which are respectively for labeled data and unlabeled data. The error bound allows us to learn an effective feature mapping for classification in a discriminative manner by minimizing it with respect to the parameters of the feature mapping.
To obtain this error bound, we use a stochastic classifier over the feature mapping here. There are two reasons for our using a stochastic feature mapping and a stochastic classifier instead of a deterministic classifier: (1) the general setting of PACBayes theory assumes a stochastic form which allows simple derivation of the generalization error bound; (2) the stochastic form also allows solving the resulting model in a simple algorithm.
4.1 Linear stochastic classifier over feature mapping
4.2 Classification risk of stochastic classifier
Lemma 1
The proof of this Lemma can be found in the Appendix. Noticing that, the classifier \(f = \mathrm {sign}[\mathbf {w}\cdot \phi (\mathbf {x},\mathbf {h})]\) (Eq. 6) is parameterized by \(\mathbf {w}\), therefore \(f_1\sim Q(f)\) means \(\mathbf {w}_1 \sim Q(\mathbf {w})\), i.e., the weight \(\mathbf {w}\) of a stochastic classifier is sampled from the posterior distribution of f. \(e_S\) is a measure of the variance of the classification error, and is estimated from labeled data. \(d_S\) measures the disagreement of the classification, and is estimated from the unlabeled data.
4.3 Generalization bound for classification risk
Having defined the stochastic classifier over the feature mapping and derived the classification risks, we now proceed to derive the generalization bound for the classifier using PACBayes theory. We can learn the stochastic feature mapping discriminatively and train the stochastic classifier over the feature mapping by minimizing the error bound.
In this derivation, although there are some tighter bounds (Seldin et al. 2012; Tolstikhin and Seldin 2013) available, we prefer explicit bounds for the true risk \(R_D(f_Q)\), which allows an analytical derivation of the posterior Q. We choose to bound the true risk following the oneside version in McAllester (2003) and use the explicit bound in Keshet et al. (2011). Considering the measures \(\mathrm {kl}(q \parallel p)=q \ln \frac{q}{p}+(1q)\ln \frac{1q}{1p}\) and \(\mathrm {KL}(Q\parallel P)=\mathrm {E}_Q[\log \frac{Q}{P}]\), we have the following bound.
Theorem 1
The proof of this theorem is summarized in the Appendix. Note that, the theorem differs from the bounds in McAllester (2003), Seeger (2002) and Lacasse et al. (2006) by the extra variable \(\mathbf {h}\) introduced along with the stochastic feature mapping. This bound has a parameter \(\epsilon \) and is difficult to minimize. However, in the following theorem, we can formulate this bound into a more practical bound that can be minimized directly.
Theorem 2
The proof of the theorem can be found in the Appendix. Here we extend the bound to accommodate both labeled and unlabeled data for semisupervised learning. Letting \(S_l\) be the labeled training set, \(S_u\) be the unlabeled training set, \(S=S_u\cup S_l\), we have the following theorem.
Theorem 3
The proof of this theorem can be found in the Appendix.
Remarks
This bound allows classifiers to exploit unlabeled data, since \(d_S(f_Q)\) dose not involve class label. Minimizing \(d_S(f_Q)\) will contract the posteriors over the stochastic classifier and the stochastic feature space, reducing the uncertainty or ambiguity in classification and feature mappings. In the above bound, we use the \(S=S_u\cup S_l\) instead of \(S_u\) to build the risk term for unlabeled data, because the labeled set \(S_l\) can be simultaneously used as the unlabeled set. Noticing that, the above semisupervised bound is different with that in Lacasse et al. (2006) which is over the variance of the classification risk.
Also we derived a semisupervised bound on the basis of the explicit bound proposed in Germain et al. (2009). However, in the experiments, we found that the solutions to the classifier and the generative model are difficult to find by optimization, as they are sensitive to the specification of parameters and the initial weights of the classifier (Germain et al. 2009). In the remainder of this paper, we will show that the bound derived in Theorem 3 can be minimized effectively using an EMlike algorithm and can produce generative model and classifier solutions that yield satisfied classification performance.
5 Learning and inference
5.1 Specification and expression
To derive the four expressions required in Eq. (9), we first need to specify the form of stochastic classifier. We consider the linear stochastic classifier in Eq. (6). In this case, \(f_Q=f_\mathbf {w}\). Then, as were done in Langford (2006), we choose the prior of the weight \(\mathbf {w}\) to be Gaussian \(P(\mathbf {w})= N(0,\mathrm {I})\) and its posterior also to be Gaussian but with a different mean, \(Q(\mathbf {w})= N(\mathbf {u},\mathrm {I})\).
5.2 The objective function
The unlabeled data benefit the classifier in two ways: (1) by shaping the feature space so that the mapped features are more distinct for the classifier; (2) by providing more data to train generative models. We will describe an EMlike iterative algorithm (Jordan et al. 1999) that can be used to minimize J in Eq. (16).
5.3 Inference and parameter estimation
5.4 A toy example
Figure 3a visualizes the decision bounds of the supervised version (blue) and the semisupervised version (green) of SFMGMM, where the test accuracies are 78.13 and \(81.25\,\%\) respectively. In general, both supervised and semisupervised SFMGMM can separate the two categories appropriately. Figure 3a presents the negative log likelihood for supervised SFMGMM, as a function of the number of iterations. It can be found that, with the pretrained GMM, our approach reaches convergence within about 20 iterations.
6 Experiments
In this section, we will evaluate the proposed stochastic feature mapping (SFM) and related methods empirically on general classification tasks, scene recognition and protein sequence classification respectively. We seek to demonstrate three advantages of SFM: (1) the proposed stochastic feature mapping and its generalization bound can effectively exploit information from generative models for classification, producing results that are competitive with several stateoftheart methods; (2) SFM shows satisfactory performance when the ‘capacity’ of a generative model is small, meaning that SFM is efficient in inference and learning; (3) when the amount of labeled training data is small, unlabeled data can help train the generative models, resulting in improvement in performance.
6.1 Overall testing approach and evaluation strategies
For each of these multipleclass classification problems, we break it down to many binary classification problems, each of which is a oneversusrest classification that distinguishes one class from all the others. For each binary problem, we test each binary classification problem on 20 random partitions, and report the average accuracy of the labeled data. For each application, we perform three groups of experiments to verify the three advantages of the proposed SFM method stated above: (1) we randomly partition the positive examples into \(50\,\%\) training and \(50\,\%\) test sets, and do so also for the negative examples; (2) we vary the capacity of generative models (e.g., the number of mixture centers) to evaluate how capacity affects performance; (3) in the semisupervised scenario, we vary the percentage of the labeled training examples and the unlabeled training examples to evaluate how and whether the unlabeled data improves the classifier performance.
For each problem, a generative model appropriate for the database has to be chosen. We used Gaussian mixture models (GMM) for the UCI datasets, latent Dirichlet allocation (LDA) for the scene dataset, and Hidden Markov models (HMM) for the protein sequence datasets. Thus, our approach is called SFMGMM, SFMLDA and SFMHMM in the three different applications to indicate the generative models used.

LMKL (localized multiple kernel learning) (Gönen and Alpaydin 2008) is a stateoftheart classifier. We use the authors’ toolbox^{1}, where linear kernel and 2degree polynomial kernel are chosen.

PBGD3 (PACBayes gradient descent) (Germain et al. 2009) is a classifier also derived by minimizing a PACBayes generalization bound, we implement this algorithm according to the authors’ suggestions, with confidence parameter \(\delta =0.05\); C based on cross validation, and the random initial number \(k =\) 10.

SVM (Supported Vector Machine) is a popular classifier. We use a popular toolbox libsvm (Chang and Lin 2011)^{2} with a RBF kernel. The cost is set to \(C=1\), and the bandwidth parameter is chosen by cross validation around \(\gamma =1/\#\)feature.

TSVM (transducitve SVM) (Joachims 1999) is a stateoftheart semisupervised classifier. We use the toolbox^{3} released by the authors, with the parameters chosen by cross validation.

MAP (maximum a posteriori). Probabilistic generative models with a maximum the posteriori decision rule. The models are same with those used in FS and FESS.

SFM (stochastic feature mapping, our approach). We implemented Algorithm 1. Since the solution of \(\mathbf {u}\) could be trapped in local minima problem, we typically repeated the optimization \(2 \!\sim \! 6\) times, with a different random initial point each time within the range \([10,10]\), to obtain a satisfactory solution. As discussed in Sect. 5, we augment the unlabeled set to \(S_u \cup S_l\). The maximum iteration number is set to 20 for Experiment I and 30 for Experiments II and III.

FS (Fisher score) (Jaakkola and Haussler 1999). We implement FSLDA and FSHMM following the suggestions of the authors and (Chatfield et al. 2011). The parameters of generative models, i.e., the number of mixture centers, topics and hidden states, are chosen according to cross validation.

FESS (free energy score space) (Perina et al. 2012). We implement FESSLDA according to the authors’ suggestion, and use the authors’ toolbox for FESSHMM^{4}.
6.2 Experiment I: deriving a general classification tool
Classification accuracy (\(\%\pm \)std) on UCI database, with oneversusrest scheme
Dataset  TSVM (Joachims 1999)  SVM (Chang and Lin 2011)  LMKL (Gönen and Alpaydin 2008)  PBGD3 (Germain et al. 2009)  SFMGMM 

Breast cancer  \(\mathbf{96.91 } \pm \mathbf{1.32 }\)  \(96.79 \pm 1.79\)  \(96.41\pm 0.97\)  \(93.98\pm 1.52\)  \(95.26\pm 0.93\) 
Breast tissue  \(88.25\pm 5.74\)  \(83.37\pm 4.31\)  \(87.69\pm 5.24\)  \(88.14\pm 4.50\)  \(\mathbf{89.61 } \pm \mathbf{3.84 }\) 
Wine  \(95.61\pm 2.46\)  \(\mathbf{97.36 } \pm \mathbf{1.94 }\)  \(95.48\pm 4.10\)  \(92.22\pm 8.56\)  \(96.11\pm 1.38\) 
Sonar  \(75.29\pm 4.83\)  \(74.45\pm 3.22\)  \(80.21\pm 1.52\)  \(75.52\pm 5.70\)  \(\mathbf{81.54 } \pm \mathbf{2.93 }\) 
Credit approval  \(84.01\pm 1.72\)  \(84.61\pm 1.83\)  \(81.92\pm 1.41\)  \(83.53\pm 1.82\)  \(\mathbf{85.06 } \pm \mathbf{1.31 }\) 
SPECTF heart  \(78.27\pm 1.05\)  \(76.56\pm 2.97\)  \(80.38\pm 3.40\)  \(79.70\pm 0.65\)  \(\mathbf{81.34 } \pm \mathbf{0.46 }\) 
Libras movement  \(95.47\pm 2.12\)  \(91.74\pm 3.14\)  \(\mathbf{96.58 } \pm \mathbf{1.78 }\)  \(94.52\pm 2.80\)  \(95.97\pm 3.12\) 
Steel plates faults  \(88.60\pm 8.94\)  \(86.52\pm 9.03\)  \(\mathbf{92.63 } \pm \mathbf{8.14 }\)  \(87.30\pm 8.26\)  \(90.24\pm 8.23\) 
Figure 4b demonstrates that when the amount of labeled data is small, introduction of unlabeled data yield improved performance, particularly when only \(2\sim 10\) percent of the labeled data are used in the training. When the amount of labeled data increased, the benefit of unlabeled data diminishes. Increasing the amount of unlabeled data in semisupervised training produces performance benefit particularly when the amount of labeled data used is small, as shown in Fig. 4c. The diminished benefit of the unlabeled data when significant amount of labeled data is present in the training set is because the labeled examples have an increasing dominating effect.
This experiment shows that the proposed stochastic feature mapping and the feedback tuning mechanism in our approach could yield improvement for the general class of Gaussian mixture models for classification.
6.3 Experiment II: scene recognition using LDA
We evaluate our SFM method and compare its performance against comparable methods on a scene recognition task popular in computer vision. The distribution of a collection of visual words, typically some informative image patterns, or cluster center of image pattern descriptors, has found to be informative in this task. Such representation based on visual words is found to be relatively robust against topic variation and spatial position variation. We use latent Dirichlet allocation (LDA) (Blei et al. 2003) to model the distributions of visual words, and derive a recognition tool with our proposed framework. As in Griffiths and Steyvers (2004), we sample the topic variable using collapsed Gibbs sampling and reject examples according to the rule in Eq. (17). We fix the LDA model’s parameter \(\alpha \) and allow \(\beta \) (Griffiths and Steyvers 2004) to be updated. Note that \(\alpha \) is the parameter of the distribution over the mixture of topics, or scene, and \(\beta \) is the parameter of the distribution over topics.
Accuracy (\(\%\pm \)std.) of oneversusrest scene recognition
SCENE  PHOW (Vedaldi et al. 2009)  LDAMAP  FSLDA (Jaakkola and Haussler 1999)  FESSLDA (Perina et al. 2012)  SFMLDA 

Coast  \(90.66\pm 0.65\)  \(83.85\pm 0.92\)  \(90.42\pm 0.34\)  \(93.89\pm 0.46\)  \(\mathbf{94.56 } \pm \mathbf{0.61 }\) 
Forest  \(96.49\pm 0.39\)  \(94.94\pm 0.46\)  \(94.45\pm 0.46\)  \(97.92\pm 0.26\)  \(\mathbf{98.15 } \pm \mathbf{0.34 }\) 
Mountain  \(92.58\pm 0.64\)  \(84.99\pm 1.78\)  \(88.62\pm 0.50\)  \(93.29\pm 0.47\)  \(\mathbf{93.97 } \pm \mathbf{0.41 }\) 
Country  \(\mathbf{91.38 }\pm \!\mathbf{0.71 }\)  \(72.30\pm 1.74\)  \(87.40\pm 0.46\)  \(90.62\pm 0.33\)  \(90.81\pm 0.63\) 
Highway  \(95.27\pm 0.49\)  \(81.50\pm 1.28\)  \(92.48\pm 0.22\)  \(94.67\pm 0.34\)  \(\mathbf{96.18 } \pm \mathbf{0.27 }\) 
InsideCity  \(93.96\pm 0.62\)  \(85.14\pm 1.74\)  \(90.79\pm 0.14\)  \(94.26\pm 0.65\)  \(\mathbf{95.81 } \pm \mathbf{0.37 }\) 
Street  \(93.89\pm 0.64\)  \(76.46\pm 1.23\)  \(93.76\pm 0.24\)  \(94.21\pm 0.42\)  \(\mathbf{95.40 } \pm \mathbf{0.45 }\) 
Building  \(94.40\pm 0.49\)  \(87.85\pm 0.55\)  \(92.83\pm 0.57\)  \(96.06\pm 0.51\)  \(\mathbf{96.39 } \pm \mathbf{0.44 }\) 
Figure 5a compares the performance among the three methods (our SFMLDA, FESSLDA and FSLDA) as a function of the number of topics used in the model in the binary classification of “highway” category against all other categories. The models are trained with 50 % of the labeled data, and tested with the rest. The results show that both SFM and FESS are better than FS in this case, and that SFM has a performance advantage over FESS when small number of topics are used (5–20), and their performance converge at 30 topics. Fig. 5b compares the benefit of using unlabeled data to train the models first, versus not using any unlabeled data at all. 25 % or 672 images of the dataset is used as unlabeled data, i.e. not using the label of the images. Training with unlabeled data yields significant benefit when the labeled data used is relatively small, i.e. up to 268. As more and more labeled data are used, the overall performance of the classifier continues to improve, but the benefit of training with unlabeled data disappears because the classifier relies more and more on the labeled data. Figure 5c demonstrates this trend from a different perspective.
6.4 Experiment III: protein classification using HMMs
An advantage of the stochastic feature mapping is that it can map structured input data of variable length into feature vector of a fixed dimensional feature space. To demonstrate this feature of our approach, we apply our proposed framework to remote homology recognition in molecular biology. The problem here is that given a test protein sequence, we assign it to one of the domain superfamilies defined in the SCOP (1.53) taxonomy tree according to the functions of proteins. The protein sequence data is obtained from the ASTRAL database. Evalue threshold of \(10^{25}\) is applied to the database to reduce similar sequences. We use four labeled domain superfamilies, i.e. metabolism, information, intracellular processes and extracellular processes in our evaluation. The numbers of sequences are 804, 950, 695 and 992 respectively. Each protein sequence is a string composed of 22 distinct letters, and the string length varies from 20 to 994.
Accuracy (\(\%\pm \)std.) of oneversusrest protein recognition
SUP.FAM.  2GRAM  HMMMAP  FSHMM (Jaakkola and Haussler 1999)  FESSHMM (Perina et al. 2012)  SFMHMM 

# 1  \(78.79\pm 1.13\)  \(80.91\pm 1.53\)  \(80.03\pm 0.78\)  \(80.12\pm 0.84\)  \(\mathbf{83.43 } \pm \mathbf{0.91 }\) 
# 2  \(79.01\pm 0.97\)  \(80.10\pm 0.51\)  \(77.56\pm 0.64\)  \(78.96\pm 0.59\)  \(\mathbf{84.16 } \pm \mathbf{0.60 }\) 
# 3  \(75.19\pm 0.86\)  \(77.92\pm 0.79\)  \(73.31\pm 0.21\)  \(73.35\pm 0.41\)  \(\mathbf{80.12 } \pm \mathbf{0.54 }\) 
# 4  \(96.01\pm 0.33\)  \(95.10\pm 0.39\)  \(94.27\pm 0.37\)  \(\mathbf{97.58 } \pm \mathbf{0.13 }\)  \(96.89\pm 0.35\) 
6.5 Discussions
6.5.1 Generalization bound and performance
The proposed learning approach for the stochastic feature mapping is based on the minimization of generalization bound. Even though the generalization bound is not always tight, the proposed approach shows some promising attributes. The primary reason is that its advantage comes from the exploitation of hidden variables and the feedback mechanism based on the generalization bound, namely, tuning the generative models and feature mapping according to classification results.
6.5.2 Semisupervised versus supervised
The above experiments also arise the comparison discussion on semisupervised learning and supervised learning. It is worth noting that, the semisupervised learning scheme uses the same labeled examples with the supervised learning scheme, but exploits additional unlabeled examples to train generative model and reduce the classification variance. The unlabeled examples are significantly helpful when the number of labeled examples is few, and seldom bring degeneration to the classification. Thus, in our experiments, the semisupervised scheme usually outperforms against the supervised scheme.
7 Conclusions
This paper presents a new approach to integrate generative models and discriminative models for classification under the PACBayes theoretical framework. The bridge for this integration is a stochastic feature mapping derived from the negative free energy function for exponential family models. This feature mapping is an explicit function over the hidden and observed variables, but not over the parameters of the generative models. This allows the update of the generative models to be independent of feature mapping, as if it is in a uncoupled system. This allows the SFM scheme to be easily and flexibly coupled to many types of generative models, greatly increase the flexibility of the framework. Under this framework, the generative model and the discriminative model form a close loop, with stochastic feature mapping being tuned in the feedforward path to improve the discriminative classifier, and the classification performance in the feedback path to tune the generative models. This innovation makes the classifier more flexible and adaptive, yielding stateoftheart results in many application scenarios. Another innovation of this work is the derivation of the PACBayes bound for semisupervised learning. This allows the generative models to learn from both labeled and unlabeled data, significantly enhancing the ability of the classifier when labeled data are limited. The fact that the generative model can be optimized independent of the feature mapping allows the SFM to be coupled with a large variety of generative models, adding to the versatility of our framework.
We performed three experiments on distinct datasets from medicine, computer vision, and molecular biology and demonstrated a number of advantages offered by this framework over other existing approaches. In particular, because our method allows the fine tuning of the generative models and consequently the feature mapping function based on classification results, it is versatile and adaptive to the data. This leads to a more efficient generative model that can explain data with small capacity, as well as a more effective classifier that yields consistent stateoftheart performance across multiple datasets. We demonstrated that when there is a limited amount of training data, this framework can capitalize on the strength of the generative models to learn from unlabeled data and tune the feature mapping to achieve better classifier performance. We further demonstrated in our applications that the SFM can be coupled to a variety of generative models, including GMM, LDA and HMM. A major remaining difficulty is the non convexity of the objective function, which can trap the solution in local minima. We have adopted a multiple initialization or seeding strategy to remedy the situation, and have achieved good results.
Nevertheless, we expect the exploitation of more robust and efficient optimization methods will likely yield better performance, and the development of incremental learning algorithm or parallel learning method could scale the approach to large dataset. Besides, coupling the proposed framework with tighter bounds is left as a future work.
Footnotes
References
 Baum, L., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. The Annals of Mathematical Statistics, 41(1), 164–171.MathSciNetCrossRefMATHGoogle Scholar
 Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.MATHGoogle Scholar
 Chang, C.C., & Lin, C.J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.CrossRefGoogle Scholar
 Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). The devil is in the details: an evaluation of recent feature encoding methods. In British machine vision conference (pp. 76.1–76.12).Google Scholar
 Germain, P., Lacasse, A., Laviolette, F., & Marchand, M. (2009). PACBayesian learning of linear classifiers. In International conference on machine learning (pp. 353–360).Google Scholar
 Gönen, M., & Alpaydin, E. (2008). Localized multiple kernel learning. In International conference on machine learning (pp. 352–359).Google Scholar
 Graça, J., Ganchev, K., & Taskar, B. (2007). Expectation maximization and posterior constraints. Advances in Neural Information Processing Systems, 20, 569–576.Google Scholar
 Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1), 5228–5235.CrossRefGoogle Scholar
 Holub, A., Welling, M., & Perona, P. (2008). Hybrid generativediscriminative visual categorization. International Journal of Computer Vision, 77(1), 239–258.CrossRefGoogle Scholar
 Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. Advances in Neural Information Processing Systems, 11, 487–493.Google Scholar
 Jaakkola, T., Meila, M., & Jebara, T. (1999). Maximum entropy discrimination. MIT Technical Report AITR1668 Google Scholar
 Joachims, T. (1999). Transductive inference for text classification using support vector machines. In International conference on machine learning (pp. 200–209). Slowenien: Bled.Google Scholar
 Joachims, T. (2003) Transductive learning via spectral graph partitioning. In International conference on machine learning (pp. 290–297).Google Scholar
 Jordan, M., Ghahramani, Z., Jaakkola, Tommi S., & Saul, Lawrence K. (1999). Introduction to variational methods for graphical models. Machine Learning, 37, 183–233.CrossRefMATHGoogle Scholar
 Keshet, J., McAllester, D., & Hazan, T. (2011). Pacbayesian approach for minimization of phoneme error rate In IEEE conference on acoustics, speech and signal processing (pp. 2224–2227).Google Scholar
 Lacasse, A., Laviolette, F., Marchand, M., Germain, P., & Usunier, N. (2006). Pacbayes bounds for the risk of the majority vote and the variance of the gibbs classifier. Advances in Neural Information Processing Systems, 19, 769–776.Google Scholar
 Langford, J. (2006). Tutorial on practical prediction theory for classification. Journal of Machine Learning Research, 6(1), 273.MathSciNetGoogle Scholar
 Li, X., Lee, T.S., & Liu, Y. (2011). Hybrid generativediscriminative classification using posterior divergence. In IEEE conference on computer vision and pattern recognition (pp. 2713–2720).Google Scholar
 Li, X., Wang, B., Liu, Y., & Lee, T.S. (2013). Learning discriminative sufficient statistics score space for classification. In European conference on machine learning (pp. 49–64).Google Scholar
 Li, X., Zhao, X., Fu, Y., & Liu, Y. (2010). Bimodal gender recognition from face and fingerprint. In IEEE conference on computer vision and pattern recognition (pp. 2590–2597).Google Scholar
 Lowe, D. (2004). Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
 McAllester, D. (1999). Some pacbayesian theorems. Machine Learning, 37(3), 355–363.CrossRefMATHGoogle Scholar
 McAllester, D. (2003). Simplified PACBayesian margin bounds. Learning theory and Kernel machines (pp. 203–215). Newyork: Springer.CrossRefGoogle Scholar
 McCallum, A., Pal, c, Druck, G., & Wang, X. (2006). Multiconditional learning: Generative/discriminative training for clustering and classification. National Conference on Artificial Intelligence, 21(1), 433.Google Scholar
 Ng, A.Y., & Jordan, M.I. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. Advances in Neural Information Processing Systems, 14, 841–848.Google Scholar
 Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.CrossRefMATHGoogle Scholar
 Perina, A., Cristani, M., Castellani, U., Murino, V., & Jojic, N. (2012). Free energy score spaces: Using generative information in discriminative classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7), 1249–1262.CrossRefGoogle Scholar
 Rabiner, L. (1989). A tutorial on hidden Markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2), 257–286.CrossRefGoogle Scholar
 Raina, R., Shen, Y., Ng, A., & McCallum, A. (2003) Classification with hybrid generative/discriminative models. Advances in Neural Information Processing Systems, 16, 545–552.Google Scholar
 Seeger, M. (2002). Pacbayesian generalisation error bounds for gaussian process classification. The Journal of Machine Learning Research, 3, 233–269.MathSciNetCrossRefGoogle Scholar
 Seldin, Y., CesaBianchi, N., Auer, P., Laviolette, F.,&ShaweTaylor, J. (2012). Pacbayesbernstein inequality for martingales and its application to multiarmed bandits. In JMLR: workshop and conference proceedings (no. 26, pp. 98–111).Google Scholar
 Smith, N., & Gales, M. (2002). Speech recognition using svms. Advances in Neural Information Processing Systems, 14, 1197–1204.Google Scholar
 Tolstikhin, I., & Seldin, Y. (2013). PacBayesempiricalbernstein inequality. Advances in Neural Information Processing Systems, 26, 109–117.Google Scholar
 Tsuda, K., Kawanabe, M., Ratsch, G., Sonnenburg, S., & Muller, K. (2002). A new discriminative kernel from probabilistic models. Neural Computation, 14(10), 2397–2414.CrossRefMATHGoogle Scholar
 Vapnik, V. (2000). The nature of statistical learning theory. Berlin: Springer.CrossRefMATHGoogle Scholar
 Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). Multiple kernels for object detection. In IEEE international conference on computer vision (pp. 606–613).Google Scholar
 Yu, C.N.J., & Joachims, T. (2009). Learning structural svms with latent variables. In International conference on machine learning (pp. 1169–1176).Google Scholar