Principled analytic classifier for positive-unlabeled learning via weighted integral probability metric

Kwon, Yongchan; Kim, Wonyoung; Sugiyama, Masashi; Paik, Myunghee Cho

doi:10.1007/s10994-019-05836-9

Principled analytic classifier for positive-unlabeled learning via weighted integral probability metric

Published: 04 November 2019

Volume 109, pages 513–532, (2020)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Principled analytic classifier for positive-unlabeled learning via weighted integral probability metric

Download PDF

Yongchan Kwon¹,
Wonyoung Kim¹,
Masashi Sugiyama^2,3 &
…
Myunghee Cho Paik¹

1786 Accesses
6 Citations
2 Altmetric
Explore all metrics

Abstract

We consider the problem of learning a binary classifier from only positive and unlabeled observations (called PU learning). Recent studies in PU learning have shown superior performance theoretically and empirically. However, most existing algorithms may not be suitable for large-scale datasets because they face repeated computations of a large Gram matrix or require massive hyperparameter optimization. In this paper, we propose a computationally efficient and theoretically grounded PU learning algorithm. The proposed PU learning algorithm produces a closed-form classifier when the hypothesis space is a closed ball in reproducing kernel Hilbert space. In addition, we establish upper bounds of the estimation error and the excess risk. The obtained estimation error bound is sharper than existing results and the derived excess risk bound has an explicit form, which vanishes as sample sizes increase. Finally, we conduct extensive numerical experiments using both synthetic and real datasets, demonstrating improved accuracy, scalability, and robustness of the proposed algorithm.

Some Proposal of the High Dimensional PU Learning Classification Procedure

Robust model selection for positive and unlabeled learning with constraints

Article 25 October 2022

Positive-unlabeled classification under class-prior shift: a prior-invariant approach based on density ratio estimation

Article 27 June 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Supervised binary classification assumes that all the training data are labeled as either being positive or negative. However, in many practical scenarios, collecting a large number of labeled samples from the two categories is often costly, difficult, or not even possible. In contrast, unlabeled data are relatively cheap and abundant. As a consequence, semi-supervised learning is used for partially labeled data (Chapelle et al. 2006). In this paper, as a special case of semi-supervised learning, we consider Positive-Unlabeld (PU) learning, the problem of building a binary classifier from only positive and unlabeled samples (Denis et al. 2005; Li and Liu 2005). PU learning provides a powerful framework when negative labels are impossible or very expensive to obtain, and thus has frequently appeared in many real-world applications. Examples include document classification (Elkan and Noto 2008; Xiao et al. 2011), image classification (Zuluaga et al. 2011; Gong et al. 2018), gene identification (Yang et al. 2012, 2014), and novelty detection (Blanchard et al. 2010; Zhang et al. 2017).

Several PU learning algorithms have been developed over the last 2 decades. Liu et al. (2002) and Li and Liu (2003) considered a two-step learning scheme: in Step 1, assigning negative labels to some unlabeled observations believed to be negative, and in Step 2, learning a binary classifier with existing positive samples and the negatively labeled samples from Step 1. Liu et al. (2003) pointed out that the two-step learning scheme is based on heuristics, and suggested fitting a biased support vector machine by regarding all the unlabeled observations as being negative.

Scott and Blanchard (2009) and Blanchard et al. (2010) suggested a modification of supervised Neyman–Pearson classification, whose goal is to find a classifier minimizing the false positive rate keeping the false negative rate low. To circumvent the problem of lack of negative samples, they tried to build a classifier minimizing the marginal probability of being classified as positive while keeping the false negative rate low. Solving the empirical version of this constrained optimization problem is challenging, but the authors did not present an explicit algorithm.

Recently, many PU learning algorithms based on the empirical risk minimization principle have been studied. Du Plessis et al. (2014) proposed the use of the ramp loss and provided an algorithm that requires solving a non-convex optimization problem. Du Plessis et al. (2015) formulated a convex optimization problem by using the logistic loss or double hinge loss. However, all the aforementioned approaches involve solving a non-linear programming problem. This causes massive computational burdens for calculating the large Gram matrix when the sample size is large. Kiryo et al. (2017) suggested a stochastic algorithm for large-scale datasets with a non-negative risk estimator. However, to execute the algorithm, several hyperparameters are required, and choosing the optimal hyperparameter may demand substantial trials of running the algorithm (Oh et al. 2018), causing heavy computation costs.

In supervised binary classification, Sriperumbudur et al. (2012) proposed a computationally efficient algorithm building a closed-form binary discriminant function. The authors showed that their function estimator obtained by evaluating the negative of the empirical integral probability metric (IPM) is the minimizer of the empirical risk using the specific loss defined in Sect. 3.1. They further showed that a closed form can be derived as the result of restricting a hypothesis space to a closed unit ball in reproducing kernel Hilbert space (RKHS).

In this paper, capitalizing on the properties shown in the supervised learning method by Sriperumbudur et al. (2012), we extend it to PU learning settings. In addition, we derive new theoretical results on excess risk bounds. We first define a weighted version of IPM between two probability measures and call it the weighted integral probability metric (WIPM). We show that computing the negative of WIPM between the unlabeled data distribution and the positive data distribution is equivalent to minimizing the hinge risk. Based on this finding, we propose a binary discriminant function estimator that computes the negative of the empirical WIPM, and then derive associated upper bounds of the estimation error and the excess risk. Under a mild condition, our obtained upper bounds are shown to be sharper than the existing ones because of using Talagrand’s inequality over McDiarmid’s inequality (Kiryo et al. 2017). Moreover, we pay special attention to the case where the hypothesis space is a closed ball in RKHS and propose a closed-form classifier. We show that the associated excess risk bound has an explicit form that converges to zero as the sample sizes increase. To the best of our knowledge, this is the first result to explicitly show the excess risk bound in PU learning.

As a summary, our main contributions are:

We formally define WIPM and establish a link with the infimum of the hinge risk (Theorem 1). We derive an estimation error bound and show that it is sharper than existing results (Theorem 2 and Proposition 1).
The proposed algorithm produces a closed-form classifier when the underlying hypothesis space is a closed ball in RKHS (Proposition 2). Furthermore, we obtain a novel excess risk bound that converges to zero as sample sizes increase (Theorem 3).
Numerical experiments using both synthetic and real datasets show that our method is comparable to or better than existing PU learning algorithms in terms of accuracy, scalability, and robustness in the case of unknown class-priors.

2 Preliminaries

In this section, we describe the L-risk for binary classification and present its PU representation. We briefly review several PU learning algorithms based on the L-risk minimization principle. We first introduce problem settings and notations.

2.1 Problem settings of PU learning

Let X and Y be random variables for input data and class labels, respectively, whose range is the product space ${\mathcal {X}} \times \{ \pm \, 1 \} \subseteq {\mathbb {R}}^{d} \times \{\pm \, 1 \}$. The d is a positive integer. We denote the joint distribution of (X, Y) by $P_{X,Y}$ and the marginal distribution of X by $P_{X}$. The distributions of positive and negative data are defined by conditional distributions, ${P}_{X \mid Y=1}$ and ${P}_{X\mid Y=-1}$, respectively. Let $\pi _{+} := P_{X,Y}(Y=1)$ be the marginal probability of being positive and set $\pi _{-}= 1-\pi _{+}$. We follow the two samples of data scheme (Ward et al. 2009; Niu et al. 2016). That is, let ${\mathcal {X}}_{\mathrm{p}} =\{x_i ^{\mathrm{p}} \}_{i=1} ^{n_{\mathrm{p}}}$ and ${\mathcal {X}}_{\mathrm{u}} =\{x_i ^{\mathrm{u}} \}_{i=1} ^{n_{\mathrm{u}}}$ be observed sets of independently identically distributed samples from the positive data distribution ${P}_{X \mid Y=1}$ and the marginal distribution ${P}_X$, respectively. Here, the $n_{\mathrm{p}}$ and $n_{\mathrm{u}}$ are the number of positive and unlabeled data points, respectively. Note that the unlabeled data distribution is the marginal distribution.

Let ${\mathcal {U}}$ be a class of real-valued measurable functions defined on ${\mathcal {X}}$. A function $f \in {\mathcal {U}}$, often called a hypothesis, can be understood as a binary discriminant function and we classify an input x with the sign of a discriminant function, ${\mathrm{sign}}(f(x))$. Define ${\mathcal {M}} = \{f: {\mathcal {X}} \rightarrow {\mathbb {R}} \mid ||{f} ||_{\infty } \le 1 \} \subseteq {\mathcal {U}}$, where $||{f} ||_{\infty } = \sup _{x \in {\mathcal {X}}} | f(x) |$ is the supremum norm. We restrict our attention to a class ${\mathcal {F}} \subseteq {\mathcal {M}}$ and call ${\mathcal {F}}$ a hypothesis space. Throughout this paper, we assume that the hypothesis space is symmetric, i.e., $f \in {\mathcal {F}}$ implies $-f \in {\mathcal {F}}$. In PU learning, the main goal is to construct a classifier ${\mathrm{sign}}(f(x))$ only from the positive dataset ${\mathcal {X}}_{\mathrm{p}}$ and the unlabeled dataset ${\mathcal {X}}_{\mathrm{u}}$ with $f \in {\mathcal {F}}$.

In this paper, the quantity $\pi _{+}$, often called the class-prior, is assumed to be known as in the literature (Kiryo et al. 2017; Kato et al. 2019) to focus on theoretical and practical benefits of our proposed algorithm. We examine the performance when $\pi _{+}$ is unknown in Experiment 3 of Sects. 6.1 and 6.2.

2.2 L-risk minimization in PU learning

In supervised binary classification, the L-risk is defined by

$$\begin{aligned} R_L(f)&:= \int _{{\mathcal {X}} \times \{\pm \, 1\}} L(y, f(x)) dP_{X,Y}(x,y) \nonumber \\&= \pi _{+} \int _{{\mathcal {X}}} L(1, f(x)) dP_{X \mid Y=1}(x) + \pi _{-} \int _{{\mathcal {X}}} L(-\,1, f(x)) dP_{X \mid Y=-\,1}(x), \end{aligned}$$

(1)

for a loss function $L : \{\pm \, 1\} \times {\mathbb {R}} \rightarrow {\mathbb {R}}$ (Steinwart and Christmann 2008, Section 2.1). We denote the margin-based loss function by $\ell (yt) := L(y, t)$ if a loss function L(y, t) can be represented as a function of margin yt, the product of a label y and a score t for all possible $y \in \{ \pm \, 1 \}$ and $t \in {\mathbb {R}}$.

Under the PU learning framework, however, the right-hand side of Eq. (1) cannot be directly estimated due to lack of negatively labeled observations. To circumvent this problem, many studies in the field of PU learning exploited the relationship $P_X = \pi _{+} P_{X \mid Y =1}+\pi _{-} P_{X \mid Y =-1}$ and replaced $P_{X \mid Y =-1}$ in Eq. (1) with $(P_X - \pi _{+} P_{X \mid Y =1})/\pi _{-}$ (Du Plessis et al. 2014; Sakai et al. 2017). That is, the L-risk can be alternatively expressed as:

$$\begin{aligned} R_L(f) = \int _{{\mathcal {X}}} L(-1, f(x)) dP_X(x) + \pi _{+} \int _{{\mathcal {X}}} L(1, f(x))- L(-1, f(x)) dP_{X \mid Y=1}(x). \end{aligned}$$

(2)

Now the right-hand side of Eq. (2) can be empirically estimated by the positive dataset ${\mathcal {X}}_{\mathrm{p}}$ and the unlabeled dataset ${\mathcal {X}}_{\mathrm{u}}$. However, the L-risk $R_L (f)$ is not convex with respect to f in general, and minimizing an empirical estimator for $R_L (f)$ is often formulated as a complicated non-convex optimization problem.

There have been several approaches to resolving the computational difficulty by modifying loss functions. Du Plessis et al. (2014) proposed to use non-convex loss functions satisfying the symmetric condition, $L(1, f(x)) + L(-\,1, f(x))=1$. They proposed to optimize the empirical risk based on the ramp loss $\ell _{\mathrm {ramp}}(yt)=0.5 \times \max ( 0,$$\min (2, 1-yt))$ via the concave-convex procedure (Collobert et al. 2006). Du Plessis et al. (2015) converted the problem to convex optimization through the linear-odd condition, $L(1, f(x)) - L(-\,1, f(x)) = -\,f(x)$. They showed that the logistic loss $\ell _{\mathrm {log}}(yt) = \log (1+\exp (-yt))$ and the double hinge loss $\ell _{\mathrm {dh}}(yt) = \max (0, \max (-\,yt, (1-yt)/2))$ satisfy the linear-odd condition. However, all the aforementioned methods utilized a weighted sum of $n_{\mathrm{p}} + n_{\mathrm{u}}$ predefined basis functions as a binary discriminant function, which triggered calculating the $(n_{\mathrm{p}}+n_{\mathrm{u}})$$\times $$(n_{\mathrm{p}}+n_{\mathrm{u}})$ Gram matrix. Hence, executing algorithms is not scalable and can be intractable when $n_{\mathrm{p}}$ and $n_{\mathrm{u}}$ are large (Sansone et al. 2019). Our first goal in this paper is to overcome this computational problem by providing a computationally efficient method.

3 Weighted integral probability metric and L-risk

In this section, we formally define WIPM, a key tool for constructing the proposed algorithm, and build a link with the L-risk in Theorem 1 below. Based on the link, we propose a new binary discriminant function estimator and present its theoretical properties in Theorem 2. We first introduce the earlier work by Sriperumbudur et al. (2012) that provided a closed-form classifier in supervised binary classification.

3.1 Relation between IPM and L-risk in supervised binary classification

Müller (1997) introduced an IPM for any two probability measures P and Q defined on ${\mathcal {X}}$ and a class ${\mathcal {F}}$ of bounded measurable functions, given by

$$\begin{aligned} {\mathrm{IPM}}({P}, {Q}; {\mathcal {F}}) := \sup _{f \in {\mathcal {F}}} \left| \int _{{\mathcal {X}}} f(x) d{P} (x) - \int _{{\mathcal {X}}} f(x) d{Q}(x) \right| . \end{aligned}$$

IPM has been studied as either a metric between two probability measures (Sriperumbudur et al. 2010a; Arjovsky et al. 2017; Tolstikhin et al. 2018) or a hypothesis testing tool (Gretton et al. 2012).

Under the supervised binary classification setting, Sriperumbudur et al. (2012) showed that calculating IPM between $P_{X \mid Y=1}$ and $P_{X \mid Y=-\,1}$ is negatively related to minimizing the risk with a loss function, i.e., ${\mathrm{IPM}}(P_{X\mid Y=1}, P_{X\mid Y=-\,1}; {\mathcal {F}})=-\,\inf _{f \in {\mathcal {F}}} R_{L_{\mathrm {c}}}(f)$, where $L_{\mathrm {c}}(1, t)= -\,t/\pi _{+}$ and $L_{\mathrm {c}}(-\,1, t)= t/\pi _{-}$ for all $t \in {\mathbb {R}}$. They further showed that a discriminant function minimizing the $L_{\mathrm {c}}$-risk can be obtained analytically when ${\mathcal {F}}$ is a closed unit ball in RKHS. This result cannot be directly extended to PU learning due to absence of negatively labeled observations. In the next subsection, we define a generalized version of IPM and extend the previous results for supervised binary classification to PU learning.

3.2 Extension to WIPM and L-risk in PU learning

Let ${\mathcal {F}}$ be a given class of bounded measurable functions and let ${\tilde{w}}: {\mathcal {X}} \rightarrow {\mathbb {R}}$ be a weight function such that $||{{\tilde{w}}} ||_{\infty } < \infty $. We define WIPM^{Footnote 1} between two probability measures P and Q with a function class ${\mathcal {F}}$ and a weight function ${\tilde{w}}$ by

$$\begin{aligned} {\mathrm{WIPM}}({P}, {Q}; {\tilde{w}}, {\mathcal {F}}) := \sup _{f \in {\mathcal {F}}} \left| \int _{{\mathcal {X}}} f(x) d{P} (x) - \int _{{\mathcal {X}}} {\tilde{w}}(x) f(x) d{Q}(x) \right| . \end{aligned}$$

(3)

Note that WIPM reduces to IPM if ${\tilde{w}}(x)=1$ for all $x \in {\mathcal {X}}$. Other special cases of Eq. (3) have been discussed in many applications. In the covariate shift problem, Huang et al. (2007) and Gretton et al. (2009) proposed to minimize WIPM with respect to ${\tilde{w}}$ when ${\mathcal {F}}$ is the unit ball in RKHS and P, Q are empirical distributions of test and training data, respectively. In unsupervised domain adaptation, Yan et al. (2017) regarded P, Q as empirical distributions of target and source data, respectively, where in this case, ${\tilde{w}}$ is a ratio of two class-prior distributions.

We pay special attention to the case where ${\tilde{w}}(x)$ is constant, $w \in {\mathbb {R}}$, for every input value and denote WIPM by ${\mathrm{WIPM}}({P}, {Q}; w, {\mathcal {F}})$,

$$\begin{aligned} {\mathrm{WIPM}}({P}, {Q}; w, {\mathcal {F}}) := \sup _{f \in {\mathcal {F}}} \left| \int _{{\mathcal {X}}} f(x) d{P} (x) - w \int _{{\mathcal {X}}} f(x) d{Q}(x) \right| . \end{aligned}$$

In the following theorem, we establish a link between ${\mathrm{WIPM}}(P_{X}, P_{X \mid Y = 1} ; 2\pi _{+}, {\mathcal {F}})$ and the infimum of the $\ell _{\mathrm {h}}$-risk over ${\mathcal {F}}$ for the hinge loss $\ell _{\mathrm {h}}(yt) = \max (0, 1-yt)$.

Theorem 1

(Relationship between $\ell _{\mathrm {h}}$-risk and WIPM) Let ${\mathcal {F}}$ be a symmetric hypothesis space in ${\mathcal {M}}$ and $\ell _{\mathrm {h}}(yt) = \max (0, 1-yt)$ be the hinge loss. Then, we have

$$\begin{aligned} \inf _{f \in {\mathcal {F}}} R_{\ell _{\mathrm {h}}} (f) = 1 - {\mathrm{WIPM}} (P_{X}, P_{X \mid Y = 1} ; 2\pi _{+}, {\mathcal {F}}). \end{aligned}$$

Moreover, if $g_{{\mathcal {F}}}$ satisfies

$$\begin{aligned} {\mathrm{WIPM}} (P_{X}, P_{X \mid Y = 1} ; 2\pi _{+}, {\mathcal {F}}) = \int _{{\mathcal {X}}} g_{{\mathcal {F}}}(x) dP_X(x) - 2\pi _{+} \int _{{\mathcal {X}}} g_{{\mathcal {F}}}(x) dP_{X \mid Y=1}(x), \end{aligned}$$

then $\inf _{f \in {\mathcal {F}}} R_{\ell _{\mathrm {h}}}(f)$$=R_{\ell _{\mathrm {h}}} (-\,g_{{\mathcal {F}}})$.

Theorem 1 shows that the infimum of the $\ell _{\mathrm {h}}$-risk over a hypothesis space ${\mathcal {F}}$ equals the negative WIPM between the unlabeled data distribution $P_X$ and the positive data distribution $P_{X \mid Y=1}$ with the same hypothesis space ${\mathcal {F}}$ and the weight $2\pi _{+}$ up to addition by constant. Furthermore, by negating the WIPM optimizer $g_{{\mathcal {F}}}$, we obtain the minimizer of the $\ell _{\mathrm {h}}$-risk over the hypothesis space ${\mathcal {F}}$. Here, we define a WIPM optimizer $g_{{\mathcal {F}}}$ as a function that attains the supremum, i.e., ${\mathrm{WIPM}} (P_{X}, P_{X \mid Y = 1} ; 2\pi _{+}, {\mathcal {F}}) = \int _{{\mathcal {X}}} g_{{\mathcal {F}}}(x)$$dP_X(x)$$-2\pi _{+} \int _{{\mathcal {X}}} g_{{\mathcal {F}}}(x) dP_{X \mid Y=1}(x)$ and we set $f_{{\mathcal {F}}} = -\,g_{{\mathcal {F}}}$ for later notational convenience. Sriperumbudur et al. (2012) derived a similar result to Theorem 1 by showing ${\mathrm{IPM}}(P_{X\mid Y=1}, P_{X\mid Y=-1}; {\mathcal {F}})$$=$$-\,\inf _{f \in {\mathcal {F}}} R_{L_{\mathrm {c}}}(f)$ in supervised binary classification. However, as we mentioned in Sect. 3.1, their method is only applicable to supervised binary classification settings.

3.3 Theoretical properties of empirical WIPM optimizer

We denote the empirical distributions of ${P}_{X \mid Y=1}$ and ${P}_{X}$ by ${P}_{X \mid Y=1, n_{\mathrm{p}}}$ and ${P}_{X, n_{\mathrm{u}}}$, respectively. Let ${P}_{X \mid Y=1, n_{\mathrm{p}}} = n_{\mathrm{p}} ^{-1} \sum _{i=1} ^{n_{\mathrm{p}}} \delta _{x_i ^{\mathrm{p}} }$ and ${P}_{X, n_{\mathrm{u}}} = n_{\mathrm{u}} ^{-1} \sum _{i=1} ^{n_{\mathrm{u}}} \delta _{x_{i} ^{\mathrm{u}} }$, where $\delta (\cdot )$ defined on ${\mathcal {X}}$ is the Dirac delta function and $\delta _x (\cdot ) := \delta (\cdot -x)$ for $x \in {\mathcal {X}}$. The empirical Rademacher complexity of ${\mathcal {F}}$ given a set $S= \{z_1, \dots , z_{m}\} $ is defined by ${\mathfrak {R}}_{S}({\mathcal {F}} ) := {\mathbb {E}}_{\sigma } \left( \frac{1}{m} \sup _{f \in {\mathcal {F}}} \left| \sum _{i=1} ^{m} \sigma _i f(z_i) \right| \right) $. Here, $\{\sigma _i\}_{i=1} ^{m}$ is a set of independent Rademacher random variables taking 1 or $-\,1$ with probability 0.5 each and ${\mathbb {E}}_{\sigma }(\cdot )$ is the expectation operator over the Rademacher random variables (Bartlett and Mendelson 2002). Denote a maximum by $a \vee b := \max (a,b)$, a minimum by $ a \wedge b := \min (a,b)$. For a probability measure Q defined on ${\mathcal {X}}$, denote the expectation of a discriminant function f by ${\mathbb {E}}_{Q}(f) := \int _{{\mathcal {X}}} f(x) d{Q}(x)$ and the variance by ${\mathrm {Var}}_{{Q}} (f) := {\mathbb {E}}_{Q}(f^2) - ({\mathbb {E}}_{Q}(f))^2$.

The empirical estimator for ${\mathrm{WIPM}}({P}_{X}, {P}_{X \mid Y=1}; w, {\mathcal {F}})$ is given by plugging in the empirical distributions,

$$\begin{aligned}&{\mathrm{WIPM}}({P}_{X, n_{\mathrm{u}}}, {P}_{X \mid Y=1, n_{\mathrm{p}}}; w, {\mathcal {F}}) = \sup _{f \in {\mathcal {F}}} \left| \frac{1}{n_{\mathrm{u}}} \sum _{i=1} ^{n_{\mathrm{u}}} f( x_i ^{\mathrm{u}}) - \frac{w}{n_{\mathrm{p}}} \sum _{i=1} ^{n_{\mathrm{p}}} f( x_i ^{\mathrm{p}}) \right| , \end{aligned}$$

and we define an empirical WIPM optimizer ${\hat{g}}_{{\mathcal {F}}} \in {\mathcal {F}}$ that satisfies the following equation,

$$\begin{aligned} {\mathrm{WIPM}}({P}_{X, n_{\mathrm{u}}}, {P}_{X \mid Y=1, n_{\mathrm{p}}}; w, {\mathcal {F}}) = \frac{1}{n_{\mathrm{u}}} \sum _{i=1} ^{n_{\mathrm{u}}} {\hat{g}}_{{\mathcal {F}}} ( x_i ^{\mathrm{u}}) - \frac{w}{n_{\mathrm{p}}} \sum _{i=1} ^{n_{\mathrm{p}}} {\hat{g}}_{{\mathcal {F}}} ( x_i ^{\mathrm{p}}). \end{aligned}$$

(4)

We set ${\hat{f}}_{{\mathcal {F}}}= -{\hat{g}}_{{\mathcal {F}}}$ for notational convenience as in Sect. 3.2.

We analyze the estimation error $R_{\ell _{\mathrm {h}}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {F}} } R_{\ell _{\mathrm {h}}} (f)$ in the following theorem. To begin, let $\chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (w)=w/\sqrt{n_{\mathrm{p}}} + 1/\sqrt{n_{\mathrm{u}}}$ and $\chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (w) = 2(w/n_{\mathrm{p}} + 1/n_{\mathrm{u}})$.

Theorem 2

(Estimation error bound for general function space) Let ${\hat{g}}_{{\mathcal {F}}}$ be an empirical WIPM optimizer defined in Eq. (4) and set ${\hat{f}}_{{\mathcal {F}}} = -\,{\hat{g}}_{{\mathcal {F}}}$. Let ${\mathcal {F}}$ be a symmetric hypothesis space such that $||{f} ||_{\infty } \le \nu \le 1$, ${\mathrm {Var}}_{{P}_{X \mid Y=1}} (f) \le \sigma _{X \mid Y=1} ^2$, and ${\mathrm {Var}}_{{P}_{X}} (f) \le \sigma _{X} ^2$. Denote $\rho ^2 = \sigma _{X \mid Y=1} ^2 \vee \sigma _{X} ^2$. Then, for all $\alpha , \tau >0$, the following holds with probability at least $1-e^{-\tau }$,

$$\begin{aligned} R_{\ell _{\mathrm {h}}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {F}} } R_{\ell _{\mathrm {h}}} (f)&\le C_{\alpha } ({\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) ) + 2 \pi _{+} {\mathbb {E}}_{P_{X \mid Y =1 } ^{n_{\mathrm{p}}} }( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}({\mathcal {F}}) ) ) \nonumber \\&\quad +\, C_{\tau , \rho ^2}^{(1)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) + C_{\tau , \nu , \alpha }^{(2)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (2\pi _{+}), \end{aligned}$$

(5)

where $C_{\alpha }=4(1+\alpha )$, $C_{\tau , \rho ^2}^{(1)} =2\sqrt{ 2 \tau \rho ^2 }$, $C_{\tau , \nu , \alpha }^{(2)} = 2\tau \nu \left( \frac{2}{3} + \frac{1}{\alpha } \right) $.

Due to Talagrand’s inequality, Theorem 2 provides a sharper bound than the existing result based on McDiarmid’s inequality. Specifically, Kiryo et al. (2017, Theorem 4) utilized McDiarmid’s inequality and showed that for $\tau >0$ and some $\varDelta >0$ the following holds with probability at least $1-e^{-\tau }$,

$$\begin{aligned} R_{\ell _{\mathrm {h}}}({\hat{f}}) - \inf _{f \in {\mathcal {F}} } R_{\ell _{\mathrm {h}}} (f)&\le 8 ( {\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) ) + 2\pi _{+} {\mathbb {E}}_{P_{X \mid Y =1} ^{n_{\mathrm{p}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}({\mathcal {F}}) ) ) \nonumber \\&\quad +\, \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) (1 + \nu ) \sqrt{2\tau } + \varDelta . \end{aligned}$$

(6)

The following proposition shows that the proposed upper bound (5) is sharper than the upper bound (6) under a certain condition.

Proposition 1

With the notations defined in Theorem 2, suppose that the following holds,

$$\begin{aligned} \frac{1+\nu }{2} - \frac{ 5 \sqrt{2 \tau } \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)}(2\pi _{+}) \nu }{ 6 \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+})} \ge \rho . \end{aligned}$$

(7)

Then, the proposed upper bound (5) is sharper than the previous result (6) proposed by Kiryo et al. (2017).

It is noteworthy that the second term in the left-hand side of (7) converges to zero as $n_{\mathrm{p}}$ and $n_{\mathrm{u}}$ increase because $\chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+})$$=$$O_{ {P}_{X \mid Y=1}, {P}_{X}} ( (n_{\mathrm{p}} \wedge n_{\mathrm{u}})^{-1/2} )$ and $\chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)}(2\pi _{+})$$=$$O_{ {P}_{X \mid Y=1}, {P}_{X}} ( (n_{\mathrm{p}} \wedge n_{\mathrm{u}})^{-1} )$. Due to $(1+\nu )/2 \ge \nu \ge \rho $, the condition (7) is quite reasonable if the upper bounds of the variances, $\sigma _{X} ^2$ and $\sigma _{X \mid Y=1} ^2$, are sufficiently small.

In binary classification, one ultimate goal is to find a classifier minimizing the misclassification error, or equivalently, minimizing the excess risk. Bartlett et al. (2006) showed that there is an invertible function $\psi : [-\,1,1] \rightarrow [0,\infty )$ such that the excess risk $R_{\ell _{01}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f)$ is bounded above by $\psi ^{-1} (R_{\ell } ({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell }(f) )$ if the margin-based loss $\ell $ is classification-calibrated. In particular, Zhang (2004) showed that the excess risk is bounded above by the excess $\ell _{\mathrm {h}}$-risk, i.e., $R_{\ell _{01}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f)$$\le $$R_{\ell _{\mathrm {h}}} ({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{\mathrm {h}}}(f)$. This implies that an excess risk bound can be obtained by analyzing the excess $\ell _{\mathrm {h}}$-risk bound with Theorem 2. The following corollary provides the excess risk bound.

Corollary 1

(Excess risk bound for general function space) With the notations defined in Theorem 2, for all $\alpha , \tau >0$, the following holds with probability at least $1-e^{-\tau }$,

$$\begin{aligned} R_{\ell _{01}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f)&\le \inf _{f \in {\mathcal {F}} } R_{\ell _{\mathrm {h}}} (f) -\inf _{f \in {\mathcal {U}} } R_{\ell _{\mathrm {h}}}(f) \\&\quad + \, C_{\alpha } ({\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) ) + 2 \pi _{+} {\mathbb {E}}_{P_{X \mid Y =1 } ^{n_{\mathrm{p}}} }( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}({\mathcal {F}}) ) ) \\&\quad +\, C_{\tau , \rho ^2}^{(1)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) + C_{\tau , \nu , \alpha }^{(2)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (2\pi _{+}). \end{aligned}$$

4 WIPM optimizer with reproducing kernel Hilbert space

In this section, we provide a computationally efficient PU learning algorithm which builds an analytic classifier when a hypothesis space is a closed ball in RKHS. In addition, unlike the excess risk bound in Corollary 1, we explicitly derive the bound that converges to zero when the sample sizes $n_{\mathrm{p}}$ and $n_{\mathrm{u}}$ increase.

4.1 An analytic classifier via WMMD optimizer

To this end, we assume that ${\mathcal {X}} \subseteq [0,1]^{d}$ is compact. Let $k: {\mathcal {X}} \times {\mathcal {X}} \rightarrow {\mathbb {R}}$ be a reproducing kernel defined on ${\mathcal {X}}$ and ${\mathcal {H}}_k$ be the associated RKHS with the inner product $\langle \cdot , \cdot \rangle _{{\mathcal {H}}_k} : {\mathcal {H}}_k \times {\mathcal {H}}_k \rightarrow {\mathbb {R}}$. We denote the induced norm by $||{ \cdot } ||_{{\mathcal {H}}_k}$. Denote a closed ball in RKHS ${\mathcal {H}}_k$ with a radius $r>0$, by ${\mathcal {H}}_{k,r} = \{f: ||{f} ||_{{\mathcal {H}}_k} \le r \}$. We define the weighted maximum mean discrepancy (WMMD) between two probability measures P and Q with a weight w and a closed ball ${\mathcal {H}}_{k,r}$ by ${\mathrm{WMMD}}_k ({P}, {Q}; w, r) $$ := $${\mathrm{WIPM}} ({P}, {Q};w, {\mathcal {H}}_{k,r})$. The name of WMMD comes from the maximum mean discrepancy (MMD), a popular example of the IPM whose function space is the unit ball ${{\mathcal {H}}}_{k,1}$, i.e., $\mathrm{MMD}_k ({P}, {Q}) := {\mathrm{IPM}} ({P}, {Q};{\mathcal {H}}_{k,1})$ (Sriperumbudur et al. 2010a, b). As defined in Eq. (4), let ${\hat{g}}_{{\mathcal {H}}_{k,r}} \in {\mathcal {H}}_{k,r}$ be the empirical WMMD optimizer such that

$$\begin{aligned} {\mathrm{WMMD}}_k ({P}_{X, n_{\mathrm{u}}}, {P}_{X \mid Y=1, n_{\mathrm{p}}}; w, r) = \frac{1}{n_{\mathrm{u}}} \sum _{i=1} ^{n_{\mathrm{u}}} {\hat{g}}_{{\mathcal {H}}_{k,r}} ( x_i ^{\mathrm{u}}) - \frac{w}{n_{\mathrm{p}}} \sum _{i=1} ^{n_{\mathrm{p}}} {\hat{g}}_{{\mathcal {H}}_{k,r}} ( x_i ^{\mathrm{p}}). \end{aligned}$$

In addition, we set ${\hat{f}}_{{\mathcal {H}}_{k,r}} = -\,{\hat{g}}_{{\mathcal {H}}_{k,r}}$, which leads the corresponding classification rule to ${\mathrm{sign}}({\hat{f}}_{{\mathcal {H}}_{k,r}}(z))$. In the following proposition, we show that this classification rule has an analytic expression by exploiting the reproducing property $f(x) = \langle f, k(\cdot , x) \rangle _{{\mathcal {H}}_k}$ and the Cauchy-Schwarz inequality.

Proposition 2

Let $k: {\mathcal {X}} \times {\mathcal {X}} \rightarrow {\mathbb {R}} $ be a bounded reproducing kernel. Then, the classification rule has a closed-form expression given by

$$\begin{aligned} {\mathrm{sign}} ({\hat{f}}_{{\mathcal {H}}_{k,r}}(z) ) = {\left\{ \begin{array}{ll} +\,1 &{}\quad {\mathrm{if}} \quad (2\pi _{+})^{-1} < {\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }(z), \\ -\,1 &{}\quad otherwise, \end{array}\right. } \end{aligned}$$

(8)

where

$$\begin{aligned} {\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }(z) = \frac{ n_{\mathrm{p}} ^{-1} \sum _{i=1 } ^{n_{\mathrm{p}}} k(z, x_i ^{\mathrm{p}})}{ n_{\mathrm{u}} ^{-1} \sum _{i=1 } ^{n_{\mathrm{u}}} k(z, x_i ^{\mathrm{u}})}. \end{aligned}$$

We call the classifier defined in Eq. (8) the WMMD classifier and the score ${\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }(z)$ the WMMD score for z. One strength of the WMMD classifier is that the classification rule has a closed-form expression, resulting in computational efficiency. Furthermore, the WMMD score ${\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }$ is independent of the class-prior $\pi _{+}$, and thus we can obtain the score function without prior knowledge of the class-prior.

4.2 Explicit excess risk bound of WMMD classifier

Since the empirical WMMD optimizer ${\hat{g}}_{{\mathcal {H}}_{k,r}}$ is a special case of the empirical WIPM optimizer, we have an excess risk bound from the result of Corollary 1. However, without knowing convergence rates of the Rademacher complexities, ${\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) )$ and ${\mathbb {E}}_{P_{X \mid Y=1} ^{n_{\mathrm{p}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}({\mathcal {F}}) )$, and the approximation error, the consistency of the classifier remains unclear. In this subsection, we establish an explicit excess risk bound that vanishes. We first derive an explicit estimation error bound in the following lemma.

Lemma 1

(Explicit estimation error bound) With the notations defined in Theorem 2, assume that a reproducing kernel k defined on a compact space ${\mathcal {X}}$ is bounded. Let $r_1^{-1}=\sup _{x \in {\mathcal {X}}} \sqrt{k(x,x)}$. Then, we have ${\mathcal {H}}_{k, r_1} \subseteq {\mathcal {M}}$. Moreover, for all $\alpha , \tau >0$, the following holds with probability at least $1-e^{-\tau }$,

$$\begin{aligned} {R_{\ell _{\mathrm {h}}} ({\hat{f}}_{{\mathcal {H}}_{k,r_1}}) - \inf _{f \in {\mathcal {H}}_{k,r_1} } R_{\ell _{\mathrm {h}}} (f)} \le ( C_{\alpha } + C_{\tau , \rho ^2}^{(1)} ) \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) + C_{\tau , \nu , \alpha }^{(2)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (2\pi _{+}). \end{aligned}$$

While the bound in Theorem 2 is expressed in terms ${\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) )$ and ${\mathbb {E}}_{P_{X \mid Y=1} ^{n_{\mathrm{p}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}( {\mathcal {F}}) )$, these are evaluated in terms of $n_p$ and $n_u$ in the upper bound in Lemma 1, giving an explicit estimation error bound with $O((n_{\mathrm{p}} \wedge n_{\mathrm{u}})^{-1/2})$ convergence rate. The key idea is to use reproducing property $f(x) = \langle f, k(\cdot , x) \rangle _{{\mathcal {H}}_k}$ and the Cauchy-Schwarz inequality to obtain an upper bound for the Rademacher complexity.

In the following lemma, we elaborate on the approximation error bound. To begin, for any $0< \beta \le 1$, let $\beta {\mathcal {M}} := \{ \beta f : f \in {\mathcal {M}} \}$. Set $f_1^* (x) = {\mathrm{sign}} ( P(Y=1 \mid X=x) - \frac{1}{2} )$.

Lemma 2

(Approximation error bound over uniformly bounded hypothesis space) With the notations defined in Lemma 1, we have

$$\begin{aligned} \inf _{f \in {\mathcal {H}}_{k,r_1} } {R_{\ell _{\mathrm {h}}} ( f ) - \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{\mathrm {h}}} (f)} \le \beta \inf _{g \in {\mathcal {H}}_{k,{r_1 /\beta }}} ||{ g- f_1^*} ||_{L_2(P_X)}, \end{aligned}$$

for any $0< \beta \le 1$.

When $\beta =1$, Lemma 2 implies that the approximation error $ \inf _{f \in {\mathcal {H}}_{k,r_1} } R_{\ell _{\mathrm {h}}} ( f ) $$ - \inf _{f \in {\mathcal {U}} } R_{\ell _{\mathrm {h}}} (f) $ is bounded above by $ \inf _{g \in {\mathcal {H}}_{k,{r_1}}} ||{ g- f_1^*} ||_{L_2(P_X)}$ due to $\inf _{f \in {\mathcal {U}} } R_{\ell _{\mathrm {h}}}(f) $$= \inf _{f \in {\mathcal {M}} } R_{\ell _{\mathrm {h}}}(f)$ (Lin 2002). Hence, a naive substitution to Corollary 1 will give a sub-optimal bound because $\inf _{g \in {\mathcal {H}}_{k,{r_1}}} ||{ g- f_1^*} ||_{L_2(P_X)}$ is non-zero in general. In the following theorem, we rigorously establish the explicit excess risk bound which vanishes as $n_{\mathrm{p}}$ and $ n_{\mathrm{u}}$ increase. We provide the pointer to a proof and the conditions (C1) and (C2) in “Appendix A”.

Theorem 3

(Explicit excess risk bound) With the notations defined in Theorem 2 and in Lemma 1, assume that the Gaussian kernel $k(x,y) = \exp (-||{x-y} ||_2 ^2 / 2)$ is used. Then, $r_1 =1$. Furthermore, under the conditions (C1) and (C2), for all $0<s<1$, $\alpha >0$, and $\tau >0$, the following holds with probability at least $1-e^{-\tau }$:

$$\begin{aligned}&R_{\ell _{01}}({\hat{f}}_{{\mathcal {H}}_{k,1}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f) \\&\quad \le ( C_{\alpha } + C_{\tau , \rho ^2}^{(1)} ) \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) ^{1-s} + C_{\tau , \nu , \alpha }^{(2)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) ^{-s} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (2\pi _{+}) \\&\qquad + C_d ||{\frac{dP_{X}}{d\lambda }} ||_{L_2 (\lambda )} ||{f_1 ^*} ||_{1/4,2} \{-\, s \ln \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) \}^{-1/16}, \end{aligned}$$

where

$$\begin{aligned} C_{d}=\left\{ \left( \frac{\sqrt{2}\pi }{\log 2}\sqrt{d}\right) ^{1/8}+64\sqrt{d}\left( \frac{8}{\pi }\right) ^{d}\right\} . \end{aligned}$$

Compared to Corollary 1, the excess risk bound in Theorem 3 has an explicit form, thus obviates the approximation error term. In addition, this bound converges to zero when $n_{\mathrm{p}}$ and $n_{\mathrm{u}}$ increase and the convergence rate is $O( \{\ln (n_{\mathrm{p}} \wedge n_{\mathrm{u}})\}^{-1/16} )$. To derive the excess risk bound, we first note that the misclassification error is determined by the sign of a discriminant function alone, i.e., $\inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f)= \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{01}}(f)$ for any $0< \beta \le 1$. Next, we show that $R_{\ell _{01}}(g) - \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{01}}(f) \le \{ R_{\ell _{\mathrm {h}}} (g) - \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{\mathrm {h}}}(f) \}/ \beta $ by modifying Bartlett et al. (2006). We next obtain bounds for $\{ R_{\ell _{\mathrm {h}}}({\hat{f}}_{{\mathcal {H}}_{k,r_1}}) - \inf _{f \in {\mathcal {H}}_{k,r_1} } R_{\ell _{\mathrm {h}}}(f) \} /\beta $ using Lemma 1 and $ \{ \inf _{f \in {\mathcal {H}}_{k,r_1} } R_{\ell _{\mathrm {h}}}(f) - \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{\mathrm {h}}}(f) \} /\beta $ using Lemma 2 and the previous result by Smale and Zhou (2003, Proposition 1.1) in terms of $\beta $. A carefully chosen $\beta $ provides the explicit excess risk bound.

Niu et al. (2016) provided the excess risk bound expressed as a function of $n_{\mathrm{p}}, n_{\mathrm{u}}$. However, their bound included combined terms of the approximation error and the Rademacher complexity, as in Corollary 1. To the best of our knowledge, we are the first to explicitly derive the excess risk bound with convergence rate in terms of a function of $n_{\mathrm{p}}, n_{\mathrm{u}}$ in PU learning.

5 Related work

Excess risk bound in noisy label literature PU learning can be considered as a special case of classification with asymmetric label noise, and many studies in this literature have shown consistency results similar to Theorem 3 (Natarajan et al. 2013). Patrini et al. (2016) derived an explicit estimation error when ${\mathcal {F}}$ is a set of linear hypotheses and Blanchard et al. (2016) showed a consistency result of the excess risk bound when the hypothesis space is RKHS with universal kernels. While the two studies assumed the one sample of data scheme, the proposed bound is based on the two samples of data scheme. Therefore, our proposed excess risk bound is expressed in $n_{\mathrm{p}}$ and $n_{\mathrm{u}}$, giving a new consistency theory.

Closed-form classifier Blanchard et al. (2010) suggested a score function similar to the WMMD score by using different bandwidth hyperparameters for the denominator and the numerator. However, with these differences, our method gains theoretical justification while their score function does not. Du Plessis et al. (2015) derived a closed-form classifier based on the squared loss. They estimated $P(Y=1 \mid X)- P(Y=-\,1 \mid X)$ and showed the consistency of the estimation error bound in the two samples of data scheme. However, the classifier is not scalable because it requires to compute the inverse of a $(n_{\mathrm{p}}+n_{\mathrm{u}})$$\times $$(n_{\mathrm{p}}+n_{\mathrm{u}})$ matrix.

6 Numerical experiments

In this section, we empirically analyze the proposed algorithm to demonstrate its practical efficacy using synthetic and real datasets. Pytorch implementation for the experiments is available at https://github.com/eraser347/WMMD_PU.

6.1 Synthetic data analysis

We first visualize the effect of increasing the sample sizes $n_{\mathrm{p}}$ and $n_{\mathrm{u}}$ on the discriminant ability of the proposed algorithm (Experiment 1). Then we compare performance with (i) the logistic loss $\ell _{\mathrm {log}}$, denoted by LOG, (ii) the double hinge loss $\ell _{\mathrm {dh}}$, denoted by DH, both proposed by Du Plessis et al. (2015), (iii) the non-negative risk estimator method, denoted by NNPU, proposed by Kiryo et al. (2017), (iv) the threshold adjustment method, denoted by tADJ, proposed by Elkan and Noto (2008), and (v) the proposed algorithm, denoted by WMMD (Experiments 2, 3, and 4).

Experiment 1 In this case, we used the two_moons dataset whose underlying distributions are

$$\begin{aligned} X|Y=y, U&\sim N\left( \begin{bmatrix}2(1+y) - 4y\cos (\pi U) \\ (1+y)-4y\sin (\pi U) \end{bmatrix},\begin{bmatrix}0.4^{2}&0\\ 0&0.4^{2} \end{bmatrix}\right) , \end{aligned}$$

where U refers to a uniform random variable ranges from 0 to 1 and $N(\mu ,\varSigma )$ is the normal distribution with mean $\mu $ and covariance $\varSigma $. We used the ‘make_moons’ function in the Python module ‘sklearn.datasets’ (Pedregosa et al. 2011) to generate the datasets.

Figure 1 illustrates the decision boundaries of WMMD using the two_moons dataset. The first row displays the case where the unlabeled sample size is small, $n_{\mathrm{u}}=50$, and the second row displays the case where the unlabeled sample size is large, $n_{\mathrm{u}}=400$. The first and second columns display the case where the positive sample sizes are $n_{\mathrm{p}}=5$ and $n_{\mathrm{p}}=10$, respectively. The class-prior is fixed to $\pi _{+}=0.5$, and we assumed that the class-prior is known. We visualize the true mean function of the positive and negative data distributions with blue and red lines, respectively. The positive data are represented by blue diamond points, and the unlabeled data are represented by gray points. The decision boundaries of the WMMD classifier tend to correctly separate the two clusters as $n_{\mathrm{p}}$ and $n_{\mathrm{u}}$ increase.

In Experiments 2, 3, and 4, we evaluate: (i) the accuracy and area under the receiver operating characteristic curve (AUC) as $n_{\mathrm{u}}$ and $\pi _{+}$ change when the class-prior is known (Experiment 2) and unknown (Experiment 3); (ii) the elapsed training time (Experiment 4). In these experiments, we set up the underlying joint distribution as follows:

$$\begin{aligned} X\mid Y=y \sim N\left( y\frac{{\varvec{1}}_{2}}{\sqrt{2}}, I_{2}\right) , Y \sim 2 \times {\mathrm {Bern}} (\pi _{+}) -1, \end{aligned}$$

(9)

where ${\mathrm {Bern}}(p)$ is the Bernoulli distribution with mean p, ${\varvec{1}}_{2} = (1,1)^T$ is the 2 dimensional vector of all ones and $I_{2}$ is the identity matrix of size 2.

Experiment 2 In this experiment, we compare the accuracy and AUC of the five PU learning algorithms when the true class-prior $\pi _{+}$ is known. Figure 2a, c show the accuracy and AUC on various $n_{\mathrm{u}}$. The training sample size for the positive data is $n_{\mathrm{p}}=100$ and the class prior is $\pi _{+}=0.5$. The unlabeled sample size changes from 40 to 500 by 20. We repeat a random generation of training and test data 100 times. For comparison purposes, we add the 1-$\hbox {Bayes}$ risk for each unlabeled sample size. In terms of accuracy, the proposed WMMD tends to be closer to the 1-$\hbox {Bayes}$ risk as the $n_{\mathrm{u}}$ increases. Compared with other PU learning algorithms, WMMD achieves higher accuracy in every $n_{\mathrm{u}}$ and achieves comparable to or better AUC.

Figure 2b, d show a comparison of accuracy and AUC as $\pi _{+}$ changes. The training sample size for the positive and unlabeled data are $n_{\mathrm{p}}=100$ and $n_{\mathrm{u}}=400$, respectively. The class-prior $\pi _{+}$ changes from 0.05 to 0.95 by 0.05. The test sample size is $10^3$. Training and test data are repeatedly generated 100 times with different random seeds. In terms of accuracy, the proposed WMMD performs comparably with LOG and NNPU, showing advantages over DH and tADJ. When the true class-prior is less than equal to 0.8, WMMD performs better in terms of AUC, except for tADJ. The tADJ achieves the highest AUC because $P(Y=1 \mid X=x)$ is proportional to $P( \{ x \text { is from the positive dataset} \} \mid X=x)$. This empirically shows that WMMD has a comparable discriminant ability to the other algorithms for a wide range of class-priors.

Experiment 3 The main goal of this subsection is to show the robustness of the proposed classifier in the case of unknown class-prior ${\pi }_{+}$. In PU learning literature, $\pi _{+}$ has been frequently assumed to be known (Du Plessis et al. 2015; Niu et al. 2016; Kiryo et al. 2017; Kato et al. 2019). However, this assumption can be considered to be strong in real-world applications, and to correctly execute existing PU learning algorithms, an accurate estimate of $\pi _{+}$ is necessary. In this experiment, we compare the accuracy and AUC when the class-prior ${\pi }_{+}$ is unknown. For the WMMD classifier, we used a density-based method for the class-prior estimation which can be obtained as a byproduct of the proposed algorithm. We provide the pointer to a description in “Appendix A”. The results of LOG, DH, and NNPU are given for completeness sake using the ‘KM1’ method^{Footnote 2} by Ramaswamy et al. (2016). We take these estimates as true values and repeat the same comparative numerical experiments in Experiment 2.

Since the objective functions of the LOG, DH, and NNPU algorithms depend on the estimate ${\hat{\pi }}_{+}$, we anticipate that both the accuracy and AUC rely on the quality of the estimation. On the other hand, the tADJ algorithm does not depend on the class-prior, so the performance is not affected. Also, as the proposed score function does not depend on the class-prior $\pi _{+}$, and since $\pi _{+}$ is used only to determine a cutoff, the AUC of the proposed algorithm is less affected by the estimation of $\pi _{+}$.

Figure 3a, c compare the accuracy and AUC as a function of $n_{\mathrm{u}}$. WMMD performs worse than LOG, DH, and NNPU, while AUC is higher. Though tADJ shows poor accuracy in a wide range, it achieves high AUC comparable to WMMD. As we anticipated, WMMD is more robust than LOG, DH, and NNPU in AUC. This is possibly because our score function ${\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }$ does not depend on $\pi _{+}$. A similar trend can be found in Figure 3b, d. We note that the ‘KM1’ method is not scalable and thus may not be used for large-scale datasets.

Experiment 4 In this experiment, we compare the elapsed training time, including hyperparameter optimization, of the five PU learning algorithms. The data are generated from the distributions described in Eq. (9), and we set $n_{\mathrm{p}}=100, n_{\mathrm{u}}=400$, and $\pi _{+} = 0.5$. The elapsed time is measured with 20 Intel®Xeon®E5-2630 v4@2.20GHz CPU processors.

Table 1 compares the elapsed training time and its ratio relative to that of WMMD. WMMD takes the shortest time among the five baseline methods. In particular, the training time for WMMD is at least about 300 times shorter than that of the LOG and DH methods. This is because the WMMD classifier has an analytic form while the LOG and DH methods require solving a non-linear programming problem.

6.2 Real data analysis

We demonstrate the practical utility of the proposed algorithm using the eight real binary classification datasets from the LIBSVM^{Footnote 3} (Chang and Lin 2011). Since some observations from the raw datasets are not completely recorded, we removed such observations and construct the dataset with fully recorded data. Next, to investigate the effect of varying $\pi _{+}$, we artificially reconstructed ${\mathcal {X}}_{\mathrm{p}}$ and ${\mathcal {X}}_{\mathrm{u}}$ through a random sampling from the fully recorded datasets. For the three datasets australian_scale, breast-cancer_scale and skin_nonskin, we reconstructed the data so that the resulting class-prior $\pi _{+}$ ranges from 0.15 to 0.79. We add the suffix 2 for those datasets. We randomly resampled data 100 times for the seven small datasets and 10 times for the four big datasets: skin_nonskin, skin_nonskin2, epsilon_normalized, and HIGGS. Table 2 summarizes statistics for the eleven real datasets. We conduct two comparative numerical experiments when $\pi _{+}$ is known and unknown.

Table 1 A summary of elapsed training time and its ratio for the five PU learning algorithms based on 100 replications

Full size table

Table 2 A summary of the eleven binary classification datasets

Full size table

Table 3 shows the average and the standard error of the accuracy and AUC when the class-prior $\pi _{+}$ is known. LOG and DH fail to compute the $(n_{\mathrm{p}}+n_{\mathrm{u}})$$\times $$(n_{\mathrm{p}}+n_{\mathrm{u}})$ Gram matrix due to out of memory in the 12 GB GPU memory limit. WMMD achieves comparable to or better accuracy and AUC than LOG, DH, and tADJ on most datasets. Compared to NNPU, WMMD performs comparably on the small datasets. However, NNPU achieves higher accuracy on skin_nonskin, epsilon_normalized, and HIGGS. The neural network used in NNPU fits well to the complicated and high-dimensional structure of data and shows high accuracy.

Table 3 Accuracy and AUC comparison using the real datasets when the class-prior $\pi _{+}$ is known

Full size table

Table 4 compares the average and the standard error of the accuracy and AUC when the class-prior $\pi _{+}$ is unknown. As in Experiment 3 in Sect. 6.1, we estimate $\pi _{+}$ using the ‘KM1’ method for LOG, DH, and NNPU, and using the density-based method for WMMD. The LOG, DH, and NNPU algorithms are implemented on the seven small-scale datasets alone because the method by Ramaswamy et al. (2016) is not feasible with the large-scale datasets (Bekker and Davis 2018). Overall, WMMD shows comparable to or better performances than other PU learning algorithms on most datasets. Compared to Table 3, WMMD and tADJ show robustness to unknown $\pi _{+}$ in terms of AUC. This is because WMMD and tADJ do not require estimation of $\pi _{+}$ to construct score functions. In contrast, the other methods require an estimate ${\hat{\pi }}_{+}$, and we observe a substantial drop in accuracy and AUC when the ‘KM1’ method estimate is used.

Table 4 Accuracy and AUC comparison using the real datasets when the class-prior $\pi _{+}$ is unknown

Full size table

7 Concluding remarks

Existing methods use different objective functions and hypothesis spaces, and as a consequence, different optimization algorithms. Hence, there is no reason that one method outperforms uniformly for all scenarios. It is possible that one particular method may outperform in one scenario, for example, NNPU proposed by Kiryo et al. (2017) would perform better in complicated data settings because of the expressive power of neural networks. However, the proposed method has a clear computational advantage due to the closed-form as well as theoretical strength in terms of the explicit excess risk bound. Further, the proposed method works reasonably well in both cases in which $\pi _{+}$ is known or unknown. In this regard, we believe the proposed method can be used as a principled and easy-to-compute baseline algorithm in PU learning.

Notes

Although WIPM is not a metric in general, we keep saying the name WIPM to emphasize that it is a weighted version of IPM.
While the ‘KM2’ method by Ramaswamy et al. (2016) is often considered to be a state-of-the-art method for estimating $\pi _{+}$, in our experiments, estimates based on the ‘KM2’ method have a larger estimation error than that of the ‘KM1’ method and thus we omitted it.
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

References

Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223.
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 138–156.
Article MathSciNet Google Scholar
Bartlett, P. L., & Mendelson, S. (2002). Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov), 463–482.
MathSciNet MATH Google Scholar
Bekker, J., & Davis, J. (2018). Estimating the class prior in positive and unlabeled data through decision tree induction. In Proceedings of the 32th AAAI conference on artificial intelligence.
Blanchard, G., Flaska, M., Handy, G., Pozzi, S., & Scott, C. (2016). Classification with asymmetric label noise: Consistency and maximal denoising. Electronic Journal of Statistics, 10(2), 2780–2824.
Article MathSciNet Google Scholar
Blanchard, G., Lee, G., & Scott, C. (2010). Semi-supervised novelty detection. Journal of Machine Learning Research, 11(Nov), 2973–3009.
MathSciNet MATH Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Google Scholar
Chapelle, O., Schölkopf, B., & Zien, A. (2006). Semi-supervised Learning. Cambridge: MIT Press.
Book Google Scholar
Collobert, R., Sinz, F., Weston, J., & Bottou, L. (2006). Trading convexity for scalability. In Proceedings of the 23rd international conference on machine learning, ACM, pp. 201–208.
Denis, F., Gilleron, R., & Letouzey, F. (2005). Learning from positive and unlabeled examples. Theoretical Computer Science, 348(1), 70–83.
Article MathSciNet Google Scholar
Du Plessis, M. C., Niu, G., & Sugiyama, M. (2014). Analysis of learning from positive and unlabeled data. In Advances in neural information processing systems, pp. 703–711.
Du Plessis, M. C., Niu, G., & Sugiyama, M. (2015). Convex formulation for learning from positive and unlabeled data. In International conference on machine learning, pp. 1386–1394.
Elkan, C., & Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp. 213–220.
Gong, T., Wang, G., Ye, J., Xu, Z., & Lin, M. (2018). Margin based pu learning. In AAAI conference on artificial intelligence.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(Mar), 723–773.
MathSciNet MATH Google Scholar
Gretton, A., Smola, A. J., Huang, J., Schmittfull, M., Borgwardt, K. M., Schölkopf, B., Candela, Q., Sugiyama, M., Schwaighofer, A., Lawrence, N. D. et al. (2009). Covariate shift by kernel mean matching. In Dataset shift in machine learning, MIT Press, pp. 131–160.
Huang, J., Gretton, A., Borgwardt, K. M., Schölkopf, B., & Smola, A. J. (2007). Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pp. 601–608.
Kato, M., Teshima, T., & Honda, J. (2019). Learning from positive and unlabeled data with a selection bias. In International conference on learning representations. URL https://openreview.net/forum?id=rJzLciCqKm.
Kiryo, R., Niu, G., Plessis, Du Marthinus C., & Sugiyama, M. (2017). Positive-unlabeled learning with non-negative risk estimator. In Advances in neural information processing systems, pp. 1675–1685.
Li, X., & Liu, B. (2003). Learning to classify texts using positive and unlabeled data. In Proceedings of the 18th international joint conference on artificial intelligence, Morgan Kaufmann Publishers Inc., pp. 587–592.
Li, X.-L., & Liu, B. (2005). Learning from positive and unlabeled examples with different data distributions. In European conference on machine learning, pp. 218–229.
Lin, Y. (2002). Support vector machines and the bayes rule in classification. Data Mining and Knowledge Discovery, 6(3), 259–275.
Article MathSciNet Google Scholar
Liu, B., Dai, Y., Li, X., Lee, W. S., & Yu, P. S. (2003). Building text classifiers using positive and unlabeled examples. In Third IEEE international conference on data mining, 2003, ICDM 2003, IEEE, pp. 179–186.
Liu, B., Lee, W. S., Yu, P. S., & Li, X. (2002). Partially supervised classification of text documents. In International conference on machine learning, Vol. 2, Citeseer, pp. 387–394.
Müller, A. (1997). Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2), 429–443.
Article MathSciNet Google Scholar
Natarajan, N., Dhillon, I. S., Ravikumar, P. K., & Tewari, A. (2013). Learning with noisy labels. In Advances in neural information processing systems, pp. 1196–1204.
Niu, G., Du Plessis, M. C., Sakai, T., Ma, Y., & Sugiyama, M. (2016). Theoretical comparisons of positive-unlabeled learning against positive-negative learning. In Advances in neural information processing systems, pp. 1199–1207.
Oh, C. Y., Gavves, E., & Welling, M. (2018). Bock: Bayesian optimization with cylindrical kernels. arXiv preprintarXiv:1806.01619.
Patrini, G., Nielsen, F., Nock, R., & Carioni, M. (2016). Loss factorization, weakly supervised learning and label noise robustness. In International conference on machine learning, pp. 708–717.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12, 2825–2830.
MathSciNet MATH Google Scholar
Ramaswamy, H., Scott, C., & Tewari, A. (2016). Mixture proportion estimation via kernel embeddings of distributions. In International conference on machine learning, pp. 2052–2060.
Sakai, T., Du Plessis, M. C., Niu, G., & Sugiyama, M. (2017). Semi-supervised classification based on classification from positive and unlabeled data. In International conference on machine learning, pp. 2998–3006.
Sansone, E., De Natale, F. G. B., & Zhou, Z. (2019). Efficient training for positive unlabeled learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(11), 2584–2598.
Article Google Scholar
Scott, C., & Blanchard, G. (2009). Novelty detection: Unlabeled data definitely help. In Artificial intelligence and statistics, pp. 464–471.
Smale, S., & Zhou, D.-X. (2003). Estimating the approximation error in learning theory. Analysis and Applications, 1(01), 17–41.
Article MathSciNet Google Scholar
Sriperumbudur, B. K., Fukumizu, K., Gretton, A., Schölkopf, B., & Lanckriet, G. R. G. (2012). On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6, 1550–1599.
Article MathSciNet Google Scholar
Sriperumbudur, B. K., Fukumizu, K., & Lanckriet, G. (2010a). On the relation between universality, characteristic kernels and rkhs embedding of measures. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 773–780.
Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., & Lanckriet, G. R. G. (2010b). Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr), 1517–1561.
MathSciNet MATH Google Scholar
Steinwart, I., & Christmann, A. (2008). Support vector machines. Berlin: Springer.
MATH Google Scholar
Tolstikhin, I., Bousquet, O., Gelly, S., & Schoelkopf, B. (2018). Wasserstein auto-encoders. In International conference on learning representations.
Ward, G., Hastie, T., Barry, S., Elith, J., & Leathwick, J. R. (2009). Presence-only data and the EM algorithm. Biometrics, 65(2), 554–563.
Article MathSciNet Google Scholar
Xiao, Y., Liu, B., Yin, J., Cao, L., Zhang, C., & Hao, Z. (2011). Similarity-based approach for positive and unlabelled learning. In Twenty-second international joint conference on artificial intelligence.
Yan, H., Ding, Y., Li, P., Wang, Q., Xu, Y., & Zuo, W. (2017). Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In 2017 IEEE conference on computer vision and pattern recognition (CVPR), IEEE, pp. 945–954.
Yang, P., Li, X., Chua, H.-N., Kwoh, C.-K., & Ng, S.-K. (2014). Ensemble positive unlabeled learning for disease gene identification. PLoS ONE, 9(5), e97079.
Article Google Scholar
Yang, P., Li, X.-L., Mei, J.-P., Kwoh, C.-K., & Ng, S.-K. (2012). Positive-unlabeled learning for disease gene identification. Bioinformatics, 28(20), 2640–2647.
Article Google Scholar
Zhang, J., Wang, Z., Yuan, J., Tan, Y.-P. (2017). Positive and unlabeled learning for anomaly detection with multi-features. In Proceedings of the 2017 ACM on multimedia conference, ACM, pp. 854–862.
Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32, 56–85.
Article MathSciNet Google Scholar
Zuluaga, M. A., Hush, D., Delgado, E., Leyton, J. F., Hoyos, M. H., & Orkisz, M. (2011). Learning from only positive and unlabeled data to detect lesions in vascular ct images. In International conference on medical image computing and computer-assisted intervention, Springer, pp. 9–16.

Download references

Acknowledgements

YK, WK, and MCP were supported by the National Research Foundation of Korea under Grant NRF-2017R1A2B4008956. MS was supported by JST CREST JPMJCR1403.

Author information

Authors and Affiliations

Department of Statistics, Seoul National University, Seoul, South Korea
Yongchan Kwon, Wonyoung Kim & Myunghee Cho Paik
Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
Masashi Sugiyama
Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
Masashi Sugiyama

Authors

Yongchan Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Wonyoung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Masashi Sugiyama
View author publications
You can also search for this author in PubMed Google Scholar
Myunghee Cho Paik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Myunghee Cho Paik.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Kee-Eung Kim and Jun Zhu.

Proofs and implementation details

All the proofs, implementation details, and additional experiments are provided in the extended version due to the page limit (https://arxiv.org/abs/1901.09503). However, we have no objection to the inclusion of the proofs here when the paper get accepted.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kwon, Y., Kim, W., Sugiyama, M. et al. Principled analytic classifier for positive-unlabeled learning via weighted integral probability metric. Mach Learn 109, 513–532 (2020). https://doi.org/10.1007/s10994-019-05836-9

Download citation

Received: 03 May 2019
Revised: 07 July 2019
Accepted: 06 September 2019
Published: 04 November 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10994-019-05836-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Principled analytic classifier for positive-unlabeled learning via weighted integral probability metric

Abstract

Similar content being viewed by others

Some Proposal of the High Dimensional PU Learning Classification Procedure

Robust model selection for positive and unlabeled learning with constraints

Positive-unlabeled classification under class-prior shift: a prior-invariant approach based on density ratio estimation

1 Introduction

2 Preliminaries

2.1 Problem settings of PU learning

2.2 L-risk minimization in PU learning

3 Weighted integral probability metric and L-risk

3.1 Relation between IPM and L-risk in supervised binary classification

3.2 Extension to WIPM and L-risk in PU learning

Theorem 1

3.3 Theoretical properties of empirical WIPM optimizer

Theorem 2

Proposition 1

Corollary 1

4 WIPM optimizer with reproducing kernel Hilbert space

4.1 An analytic classifier via WMMD optimizer

Proposition 2

4.2 Explicit excess risk bound of WMMD classifier

Lemma 1

Lemma 2

Theorem 3

5 Related work

6 Numerical experiments

6.1 Synthetic data analysis

6.2 Real data analysis

7 Concluding remarks

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Proofs and implementation details

Proofs and implementation details

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation