Skip to main content
Log in

Adaptive semi-supervised learning on labeled and unlabeled data with different distributions

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Developing methods for designing good classifiers from labeled samples whose distribution is different from that of test samples is an important and challenging research issue in the fields of machine learning and its application. This paper focuses on designing semi-supervised classifiers with a high generalization ability by using unlabeled samples drawn by the same distribution as the test samples and presents a semi-supervised learning method based on a hybrid discriminative and generative model. Although JESS-CM is one of the most successful semi-supervised classifier design frameworks based on a hybrid approach, it has an overfitting problem in the task setting that we consider in this paper. We propose an objective function that utilizes both labeled and unlabeled samples for the discriminative training of hybrid classifiers and then expect the objective function to mitigate the overfitting problem. We show the effect of the objective function by theoretical analysis and empirical evaluation. Our experimental results for text classification using four typical benchmark test collections confirmed that with our task setting in most cases, the proposed method outperformed the JESS-CM framework. We also confirmed experimentally that the proposed method was useful for obtaining better performance when classifying data samples into either known or unknown classes, which were included in given labeled samples or not, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Although the JESS-CM framework was applied to the tasks of labeling structural data such as sequence labeling and dependency parsing in the original papers, we review the JESS-CM framework in multi-class and single-label problems to discuss simply the difference between the hybrid frameworks of JESS-CM and our proposed method.

  2. Original JESS-CM classifiers are constructed by using multiple generative models. Since the method for combining and training the discriminative function and generative models does not depend on the number of generative models, \(J\), we show the JESS-CM framework at \(J=1\) to simplify the discussion.

  3. In our experiments, we employed fixed initial values computed by using labeled and unlabeled samples, as described in Sect. 5.2.

  4. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz.

  5. http://www.cs.umass.edu/~mccallum/data/cora-classify.tar.gz.

  6. http://www.cs.umass.edu/~mccallum/data/sraa.tar.gz.

  7. http://people.csail.mit.edu/jrennie/20Newsgroups/20news-18828.tar.gz.

  8. The latest version of UniverSVM can be downloaded from http://mloss.org/software/view/19/.

  9. With our experimental settings, where the number of labeled samples was smaller than the number of unlabeled samples (e.g. \(N=500\) vs. \(M=2500\)), the number of vocabulary words appearing in a labeled document set, \(V_l\), was usually smaller than that appearing in an unlabeled documents set, \(V_u\). Therefore, \(r_l\) was larger than \(r_u\) as shown in Table 1. The difference between \(V_l\) and \(V_u\) also derived that \(V_l+V_u-V_b\) was similar to \(V_u\). Therefore, \(r_a\) was close to \(r_u\).

References

  1. Agarwal A, Daumé III H (2009) Exponential family hybrid semi-supervised learning. In: Proceedings of the 21st international joint conference on artifical, intelligence (IJCAI-09), pp 974–979

  2. Ando RK, Zhang T (2005) A framework for learning predictive structures from multiple tasks and unlabeled data. J Mach Learn Res 6:1817–1853

    MathSciNet  MATH  Google Scholar 

  3. Bickel S, Brückner M, Scheffer T (2007) Discriminative learning for differing training and test distributions. In: Proceedings of the 24th international conference on machine learning (ICML 2007), pp 81–88

  4. Blitzer J, Foster D, Kakade S (2011) Domain adaptation with coupled subspaces. In: Proceedings of the 14th international conference on artificial intelligence and statistics (AISTATS 2011), pp 173–181

  5. Blitzer J, McDonald R, Pereira F (2006) Domain adaptation with structural correspondence learning. In: Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP 2006), pp 120–128

  6. Bouchard G (2007) Bias-variance tradeoff in hybrid generative-discriminative models. In: Proceedings of the sixth international conference on machine learning and applications (ICMLA’07), pp 124–129

  7. Bouchard G, Triggs B (2004) The tradeoff between generative and discriminative classifiers. In: Proceedings of the IASC international symposium on computational statistics, pp 721–728

  8. Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge

    Google Scholar 

  9. Chen SF, Rosenfeld R (1999) A Gaussian prior for smoothing maximum entropy models. Carnegie Mellon University, Technical report

  10. Collobert R, Sinz F, Weston J, Bottou L (2006) Large scale transductive SVMs. J Mach Learn Res 7:1687–1712

    MathSciNet  MATH  Google Scholar 

  11. Dai W, Xue GR, Yang Q, Yu Y (2007) Transferring naive Bayes classifiers for text classification. In: Proceedings of the 22nd national conference on artificial intelligence (AAAI-07), pp 540–545

  12. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38

    MathSciNet  MATH  Google Scholar 

  13. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  14. Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42:143–175

    Article  MATH  Google Scholar 

  15. Druck G, McCallum A (2010) High-performance semi-supervised learning using discriminatively constrained generative models. In: Proceedings of the 27th international conference on machine learning (ICML 2010), pp 319–326

  16. Druck G, Pal C, Zhu X, McCallum A (2007) Semi-supervised classification with hybrid generative/discriminative methods. In: Proceedings of 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’07), pp 280–289

  17. Fujino A, Ueda N, Nagata M (2010) A robust semi-supervised classification method for transfer learning. In: Proceedings of the 19th ACM international conference on information and knowledge management (CIKM’10), pp 379–388

  18. Fujino A, Ueda N, Saito K (2008) Semi-supervised learning for a hybrid generative/discriminative classifier based on the maximum entropy principle. IEEE Trans Pattern Anal Mach Intell (TPAMI) 30(3): 424–437

    Article  Google Scholar 

  19. Grandvalet Y, Bengio Y (2005) Semi-supervised learning by entropy minimization. In: Lawrence K. Saul, Yair Weiss, Léon Bottou (eds) Advances in neural information processing systems 17. MIT Press, Cambridge, pp 529–536

  20. Jiang J (2007) A literature survey on domain adaptation of statistical classifiers. http://sifaka.cs.uiuc.edu/jiang4/domain_adaptation/survey/

  21. Jiang J, Zhai C (2007) Instance weighting for domain adaptation in NLP. In: Proceedings of the 45th annual meeting of the association of computational linguistics (ACL-07), pp 264–271

  22. Lasserre JA, Bishop CM, Minka TP (2006) Principled hybrids of generative and discriminative models. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), pp 87–94

  23. Liang P, Jordan MI (2008) An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In: Proceedings of the 25th international conference on machine learning (ICML 2008), pp 584–591

  24. Ling X, Dai W, Xue GR, Yang Q, Yu Y (2008) Spectral domain-transfer learning. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’08), pp 488–496

  25. Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math Program Ser B 45(3):503–528

    Article  MathSciNet  MATH  Google Scholar 

  26. Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134

    Article  MATH  Google Scholar 

  27. Pan SJ, Tsang IW, Kwok JT, Yang Q (2009) Domain adaptation via transfer component analysis. In: Proceedings of the 21st international joint conference on artifical intelligence (IJCAI-09), pp 1187–1192

  28. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359

    Article  Google Scholar 

  29. Seeger M (2001) Learning with labeled and unlabeled data. University of Edinburgh, Technical report

  30. Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244

    Article  MathSciNet  MATH  Google Scholar 

  31. Sugiyama S, Müller KR (2005) Input-dependent estimation of generalization error under covariate shift. Stat Decis 23(4):249–279

    MATH  Google Scholar 

  32. Suzuki J, Isozaki H (2008) Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In: Proceedings of the 46th annual meeting of the association of computational linguistics (ACL-08), pp 665–673

  33. Suzuki J, Isozaki H, Carreras X, Collins M (2009) An empirical study of semi-supervised structured conditional models for dependency parsing. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP 2009), pp 551–560

  34. Vapnik V (1999) The nature of statistical learning theory, 2nd edn. Springer, New York

    Google Scholar 

  35. Wang Z, Song Y, Zhang C (2009) Knowledge transfer on hybrid graph. In: Proceedings of the 21st international joint conference on artifical intelligence (IJCAI-09), pp 1291–1296

  36. Wu D, Lee WS, Ye N, Chieu HL (2009) Domain adaptive bootstrapping for named entity recognition. In: Proceedings of 2009 conference on empirical methods in natural language processing (EMNLP 2009), pp 1523–1532

  37. Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st international conference on machine learning (ICML 2004), pp 114–121

  38. Zhu X (2005) Semi-supervised learning literature survey. Technical report, University of Wisconsin

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Akinori Fujino.

Appendix

Appendix

1.1 Derivation of objective function for parameter estimation

We derive Eq. (14) from Eq. (12). By substituting Eq. (7) for \(P(k|{\varvec{{x}}})\) in Eq. (9) for labeled samples, \(D_l = \{({\varvec{{x}}}_n,y_n)\}_{n=1}^N\), and substituting Eq. (13) for \(P(k|{\varvec{{x}}})\) in Eq. (9) for unlabeled samples, \(D_u =\{{\varvec{{x}}}_m\}_{m=1}^M\), we can transform Eq. (8) to

$$\begin{aligned} J_d(W)&= \sum _{n=1}^N \log P_d(y_n|{\varvec{{x}}}_n;W)\nonumber \\&+ \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;W,\Theta ,\beta ) \log \frac{P_d(k|{\varvec{{x}}}_m;W)}{P(k|{\varvec{{x}}}_m;W,\Theta ,\beta )} + \log p(W) \nonumber \\&= \sum _{n=1}^N \log P_d(y_n|{\varvec{{x}}}_n;W) + \sum _{m=1}^M \log \sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W)p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k)^{\beta } \nonumber \\&- \beta \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;W,\Theta ,\beta ) \log p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k) + \log p(W). \end{aligned}$$
(24)

By substituting Eqs. (7) and (13) for \(P(k|{\varvec{{x}}})\) in Eq. (11) for \(D_l = \{({\varvec{{x}}}_n,y_n)\}_{n=1}^N\) and \(D_u =\{{\varvec{{x}}}_m\}_{m=1}^M\), respectively, we can transform Eq. (10) to

$$\begin{aligned} J_g(\Theta )&= \sum _{n=1}^N p_g({\varvec{{x}}}_n,y_n;|\varvec{\theta }_{y_n}) + \sum _{m=1}^M P(k|{\varvec{{x}}}_m;W,\Theta ,\beta ) \log p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k) + \log p(\Theta ).\nonumber \\ \end{aligned}$$
(25)

By substituting these equations for \(J_d(W)\) and \(J_g(\Theta )\) in Eq. (12), we can obtain Eq. (14).

1.2 Proof of inequality about \(Q\)-function

We prove the inequality, \(J(\Psi ) - J (\Psi ^{(t)}) \ge Q(\Psi , \Psi ^{(t)}) - Q(\Psi ^{(t)}, \Psi ^{(t)})\), described in Sect. 4.2. From Eq. (14), we can obtain the equation,

$$\begin{aligned}&J(\Psi )- J (\Psi ^{(t)}) \nonumber \\&\quad = \log \frac{p(W)}{p(W^{(t)})} + \beta \log \frac{p(\Theta )}{p(\Theta ^{(t)})} + \sum _{n=1}^N \log \frac{P_d(y_n|{\varvec{{x}}}_n;W)p_g({\varvec{{x}}}_n,y_n;\varvec{\theta }_{y_n})^{\beta }}{P_d(y_n|{\varvec{{x}}}_n;W^{(t)})p_g({\varvec{{x}}}_n,y_n;\varvec{\theta }_{y_n}^{(t)})^{\beta }} \nonumber \\&\quad + \sum _{m=1}^M \log \frac{\sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W)p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k)^{\beta }}{\sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W^{(t)})p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k^{(t)})^{\beta }} \nonumber \\&\quad = \log \frac{p(W)}{p(W^{(t)})} + \beta \log \frac{p(\Theta )}{p(\Theta ^{(t)})} + \sum _{n=1}^N \log \frac{P_d(y_n|{\varvec{{x}}}_n;W)}{P_d(y_n|{\varvec{{x}}}_n;W^{(t)})} \frac{p_g({\varvec{{x}}}_n,y_n;\varvec{\theta }_{y_n})^{\beta }}{p_g({\varvec{{x}}}_n,y_n;\varvec{\theta }_{y_n}^{(t)})^{\beta }} \nonumber \\&\quad + \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) \log \frac{P_d(k|{\varvec{{x}}}_m;W)}{P_d(k|{\varvec{{x}}}_m;W^{(t)})} \frac{p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k)^{\beta }}{p_g({\varvec{{x}}}_m,k;\varvec{\theta }_k^{(t)})^{\beta }} \frac{P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta )}{P(k|{\varvec{{x}}}_m;\Psi ,\beta )}.\nonumber \\ \end{aligned}$$
(26)

According to Eqs. (15)–(17), we can transform the above equation to

$$\begin{aligned}&J(\Psi ) - J (\Psi ^{(t)}) \nonumber \\&\quad = Q(\Psi , \Psi ^{(t)}) - Q(\Psi ^{(t)}, \Psi ^{(t)}) + \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) \log \frac{P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta )}{P(k|{\varvec{{x}}}_m;\Psi ,\beta )}. \end{aligned}$$
(27)

Since \(\log b \le b - 1\), \(\sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ,\beta ) = 1\), and \(\sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) = 1\),

$$\begin{aligned}&J(\Psi ) - J (\Psi ^{(t)}) \nonumber \\&\quad \ge Q(\Psi , \Psi ^{(t)}) - Q(\Psi ^{(t)}, \Psi ^{(t)}) + \sum _{m=1}^M \sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) \left\{ 1- \frac{P(k|{\varvec{{x}}}_m;\Psi ,\beta )}{P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta )} \right\} \nonumber \\&\quad = Q(\Psi , \Psi ^{(t)}) - Q(\Psi ^{(t)}, \Psi ^{(t)}). \end{aligned}$$
(28)

1.3 Proof that \(g_d\) is Concave

If the Hessian matrix of \(g_d (W,\Psi ^{(t)})\) shown in Eq. (16) is negative semidefinite, \(g_d (W,\Psi ^{(t)})\) is a concave function with respect to \(W\). We prove that the Hessian matrix of \(g_d (W,\Psi ^{(t)})\) is negative semidefinite when applying the MLR model and Gaussian prior described in Sect. 4.4.

Using the MLR model and Gaussian prior, \(P_d(y|{\varvec{{x}}};W) = \exp \left( {\varvec{{w}}}_y^T {\varvec{{x}}}\right)/\sum _{k=1}^K \exp \left( {\varvec{{w}}}_k^T {\varvec{{x}}}\right)\) and \(p(W) = \prod _{k=1}^K \exp \left(-{\varvec{{w}}}_k^T {\varvec{{w}}}_k / 2 \sigma ^2\right)\), the objective function, \(g_d (W,\Psi ^{(t)})\), shown in Eq. (16) is rewritten as

$$\begin{aligned} g_d (W;\Psi ^{(t)})&= - \sum _{k=1}^K \frac{{\varvec{{w}}}_k^T {\varvec{{w}}}_k}{2 \sigma ^2} + \sum _{n=1}^N \left\{ {\varvec{{w}}}_{y_n}^T {\varvec{{x}}}_n - \log \sum _{k=1}^K \exp ({\varvec{{w}}}_k^T {\varvec{{x}}}_n) \right\} \nonumber \\&+ \sum _{m=1}^M \left\{ \sum _{k=1}^K P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) {\varvec{{w}}}_k^T {\varvec{{x}}}_m - \log \sum _{k=1}^K \exp ({\varvec{{w}}}_k^T {\varvec{{x}}}_m) \right\} . \end{aligned}$$
(29)

To obtain the Hessian matrix \(\left[\partial ^2 g_d/\partial {\varvec{{w}}}_k \partial {\varvec{{w}}}_{k^{\prime }}^T\right]_{k,k^{\prime }}\) of \(g_d\), we partially differentiate \(g_d\) with respect to \({\varvec{{w}}}_k\) such that

$$\begin{aligned} \frac{\partial g_d}{\partial {\varvec{{w}}}_k}&= - \frac{{\varvec{{w}}}_k}{\sigma ^2} + \sum _{n=1}^N \left\{ I_{y_n} (k) - P_d(k|{\varvec{{x}}}_n;W)\right\} {\varvec{{x}}}_n \nonumber \\&+ \sum _{m=1}^M \left\{ P(k|{\varvec{{x}}}_m;\Psi ^{(t)},\beta ) - P_d(k|{\varvec{{x}}}_m;W)\right\} {\varvec{{x}}}_m, \end{aligned}$$
(30)

where \(I_{y_n} (k)\) is an indicator function that satisfies \(I_{y_n} (k) = 1~(I_{y_n} (k) = 0)\) when \(k = y_n (k \ne y_n)\). Then, we partially differentiate \(\partial g_d/\partial {\varvec{{w}}}_k\) with respect to \({\varvec{{w}}}_{k^{\prime }}\) such that

$$\begin{aligned} \frac{\partial ^{2} g_{d}}{\partial {\varvec{{w}}}_k \partial {\varvec{{w}}}_{k^\prime }^{T}}&= - \frac{1}{\sigma ^2} I_k (k^{\prime }) {\varvec{{I}}}_V - \sum _{n=1}^N P_d(k|{\varvec{{x}}}_n;W) \left\{ I_{k} (k^\prime ) - P_d(k^\prime |{\varvec{{x}}}_n;W)\right\} {\varvec{{x}}}_n {\varvec{{x}}}_n^T \nonumber \\&- \sum _{m=1}^M P_d(k|{\varvec{{x}}}_m;W) \left\{ I_{k} (k^\prime ) - P_d(k^\prime |{\varvec{{x}}}_m;W)\right\} {\varvec{{x}}}_m {\varvec{{x}}}_{m}^{T}, \end{aligned}$$
(31)

where \({\varvec{{I}}}_V\) is the \((V \times V)\)-dimensional identity matrix, and \(V\) is consistent with the dimension of \({\varvec{{w}}}_k\). Then, for arbitrary \(VK\)-dimensional vector \({\varvec{{u}}}=({\varvec{{u}}}_{1}^{T},\ldots ,{\varvec{{u}}}_{k}^{T},\ldots ,{\varvec{{u}}}_{K}^{T})^T\), where \({\varvec{{u}}}_k = (u_{k1},\ldots ,u_{ki},\ldots ,u_{kV})^{T}\),

$$\begin{aligned}&{\varvec{{u}}}^T \left[ \frac{\partial ^{2} g_d}{\partial {\varvec{{w}}}_{k} \partial {\varvec{{w}}}_{k^\prime }^{T}} \right]_{k,k^\prime }{\varvec{{u}}}\nonumber \\&\quad = - \sum _{k=1}^K \frac{{\varvec{{u}}}_{k}^{T} {\varvec{{u}}}_{k}}{\sigma ^{2}} - \sum _{n=1}^N \sum _{k=1}^K P_d(k|{\varvec{{x}}}_n;W) {\varvec{{u}}}_{k}^{T} {\varvec{{x}}}_n \left\{ {\varvec{{x}}}_{n}^{T} {\varvec{{u}}}_k - \sum _{k^\prime =1}^{K} P_d (k^\prime |{\varvec{{x}}}_{n};W) {\varvec{{x}}}_{n}^{T} {\varvec{{u}}}_{k^\prime }\right\} \nonumber \\&\quad - \sum _{m=1}^M \sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W) {\varvec{{u}}}_k^T {\varvec{{x}}}_m \left\{ {\varvec{{x}}}_m^T {\varvec{{u}}}_k - \sum _{k^\prime =1}^K P_d (k^\prime |{\varvec{{x}}}_m;W) {\varvec{{x}}}_{m}^{T}{\varvec{{u}}}_{k^\prime }\right\} \nonumber \\&\quad = - \sum _{k=1}^K \frac{{\varvec{{u}}}_{k}^{T} {\varvec{{u}}}_k}{\sigma ^{2}} - \sum _{n=1}^N \sum _{k=1}^K P_d(k|{\varvec{{x}}}_n;W) \left\{ {\varvec{{x}}}_{n}^{T}{\varvec{{u}}}_k - \sum _{k^\prime =1}^K P_d (k^\prime |{\varvec{{x}}}_n;W) {\varvec{{x}}}_{n}^{T} {\varvec{{u}}}_{k^\prime }\right\} ^2 \nonumber \\&\quad - \sum _{m=1}^M \sum _{k=1}^K P_d(k|{\varvec{{x}}}_m;W) \left\{ {\varvec{{x}}}_{m}^{T} {\varvec{{u}}}_k - \sum _{k^\prime =1}^K P_d (k^\prime |{\varvec{{x}}}_m;W) {\varvec{{x}}}_{m}^{T} {\varvec{{u}}}_{k^\prime } \right\} ^2, \end{aligned}$$
(32)

because \(\sum _{k=1}^K \!P_d (k|{\varvec{{x}}};\!W) \!=\! 1\) and \(P_d (k|{\varvec{{x}}};\!W) \!\ge \! 0\). When \({\varvec{{u}}}\!\ne \! \mathbf{0}, {\varvec{{u}}}^{T} \!\left[\partial ^2 g_d /\partial {\varvec{{w}}}_k \partial {\varvec{{w}}}_{k^\prime }^T \right]_{k,k^\prime }\) \({\varvec{{u}}}\!<\! 0\) for arbitrary \(W\). This shows that the Hessian matrix of \(g_d(W,\Psi ^{(t)})\) with respect to \(W\) is negative semidefinite.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fujino, A., Ueda, N. & Nagata, M. Adaptive semi-supervised learning on labeled and unlabeled data with different distributions. Knowl Inf Syst 37, 129–154 (2013). https://doi.org/10.1007/s10115-012-0576-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0576-8

Keywords

Navigation