1 Introduction

Supervised binary classification assumes that all the training data are labeled as either being positive or negative. However, in many practical scenarios, collecting a large number of labeled samples from the two categories is often costly, difficult, or not even possible. In contrast, unlabeled data are relatively cheap and abundant. As a consequence, semi-supervised learning is used for partially labeled data (Chapelle et al. 2006). In this paper, as a special case of semi-supervised learning, we consider Positive-Unlabeld (PU) learning, the problem of building a binary classifier from only positive and unlabeled samples (Denis et al. 2005; Li and Liu 2005). PU learning provides a powerful framework when negative labels are impossible or very expensive to obtain, and thus has frequently appeared in many real-world applications. Examples include document classification (Elkan and Noto 2008; Xiao et al. 2011), image classification (Zuluaga et al. 2011; Gong et al. 2018), gene identification (Yang et al. 2012, 2014), and novelty detection (Blanchard et al. 2010; Zhang et al. 2017).

Several PU learning algorithms have been developed over the last 2 decades. Liu et al. (2002) and Li and Liu (2003) considered a two-step learning scheme: in Step 1, assigning negative labels to some unlabeled observations believed to be negative, and in Step 2, learning a binary classifier with existing positive samples and the negatively labeled samples from Step 1. Liu et al. (2003) pointed out that the two-step learning scheme is based on heuristics, and suggested fitting a biased support vector machine by regarding all the unlabeled observations as being negative.

Scott and Blanchard (2009) and Blanchard et al. (2010) suggested a modification of supervised Neyman–Pearson classification, whose goal is to find a classifier minimizing the false positive rate keeping the false negative rate low. To circumvent the problem of lack of negative samples, they tried to build a classifier minimizing the marginal probability of being classified as positive while keeping the false negative rate low. Solving the empirical version of this constrained optimization problem is challenging, but the authors did not present an explicit algorithm.

Recently, many PU learning algorithms based on the empirical risk minimization principle have been studied. Du Plessis et al. (2014) proposed the use of the ramp loss and provided an algorithm that requires solving a non-convex optimization problem. Du Plessis et al. (2015) formulated a convex optimization problem by using the logistic loss or double hinge loss. However, all the aforementioned approaches involve solving a non-linear programming problem. This causes massive computational burdens for calculating the large Gram matrix when the sample size is large. Kiryo et al. (2017) suggested a stochastic algorithm for large-scale datasets with a non-negative risk estimator. However, to execute the algorithm, several hyperparameters are required, and choosing the optimal hyperparameter may demand substantial trials of running the algorithm (Oh et al. 2018), causing heavy computation costs.

In supervised binary classification, Sriperumbudur et al. (2012) proposed a computationally efficient algorithm building a closed-form binary discriminant function. The authors showed that their function estimator obtained by evaluating the negative of the empirical integral probability metric (IPM) is the minimizer of the empirical risk using the specific loss defined in Sect. 3.1. They further showed that a closed form can be derived as the result of restricting a hypothesis space to a closed unit ball in reproducing kernel Hilbert space (RKHS).

In this paper, capitalizing on the properties shown in the supervised learning method by Sriperumbudur et al. (2012), we extend it to PU learning settings. In addition, we derive new theoretical results on excess risk bounds. We first define a weighted version of IPM between two probability measures and call it the weighted integral probability metric (WIPM). We show that computing the negative of WIPM between the unlabeled data distribution and the positive data distribution is equivalent to minimizing the hinge risk. Based on this finding, we propose a binary discriminant function estimator that computes the negative of the empirical WIPM, and then derive associated upper bounds of the estimation error and the excess risk. Under a mild condition, our obtained upper bounds are shown to be sharper than the existing ones because of using Talagrand’s inequality over McDiarmid’s inequality (Kiryo et al. 2017). Moreover, we pay special attention to the case where the hypothesis space is a closed ball in RKHS and propose a closed-form classifier. We show that the associated excess risk bound has an explicit form that converges to zero as the sample sizes increase. To the best of our knowledge, this is the first result to explicitly show the excess risk bound in PU learning.

As a summary, our main contributions are:

  • We formally define WIPM and establish a link with the infimum of the hinge risk (Theorem 1). We derive an estimation error bound and show that it is sharper than existing results (Theorem 2 and Proposition 1).

  • The proposed algorithm produces a closed-form classifier when the underlying hypothesis space is a closed ball in RKHS (Proposition 2). Furthermore, we obtain a novel excess risk bound that converges to zero as sample sizes increase (Theorem 3).

  • Numerical experiments using both synthetic and real datasets show that our method is comparable to or better than existing PU learning algorithms in terms of accuracy, scalability, and robustness in the case of unknown class-priors.

2 Preliminaries

In this section, we describe the L-risk for binary classification and present its PU representation. We briefly review several PU learning algorithms based on the L-risk minimization principle. We first introduce problem settings and notations.

2.1 Problem settings of PU learning

Let X and Y be random variables for input data and class labels, respectively, whose range is the product space \({\mathcal {X}} \times \{ \pm \, 1 \} \subseteq {\mathbb {R}}^{d} \times \{\pm \, 1 \}\). The d is a positive integer. We denote the joint distribution of (XY) by \(P_{X,Y}\) and the marginal distribution of X by \(P_{X}\). The distributions of positive and negative data are defined by conditional distributions, \({P}_{X \mid Y=1}\) and \({P}_{X\mid Y=-1}\), respectively. Let \(\pi _{+} := P_{X,Y}(Y=1)\) be the marginal probability of being positive and set \(\pi _{-}= 1-\pi _{+}\). We follow the two samples of data scheme (Ward et al. 2009; Niu et al. 2016). That is, let \({\mathcal {X}}_{\mathrm{p}} =\{x_i ^{\mathrm{p}} \}_{i=1} ^{n_{\mathrm{p}}}\) and \({\mathcal {X}}_{\mathrm{u}} =\{x_i ^{\mathrm{u}} \}_{i=1} ^{n_{\mathrm{u}}}\) be observed sets of independently identically distributed samples from the positive data distribution \({P}_{X \mid Y=1}\) and the marginal distribution \({P}_X\), respectively. Here, the \(n_{\mathrm{p}}\) and \(n_{\mathrm{u}}\) are the number of positive and unlabeled data points, respectively. Note that the unlabeled data distribution is the marginal distribution.

Let \({\mathcal {U}}\) be a class of real-valued measurable functions defined on \({\mathcal {X}}\). A function \(f \in {\mathcal {U}}\), often called a hypothesis, can be understood as a binary discriminant function and we classify an input x with the sign of a discriminant function, \({\mathrm{sign}}(f(x))\). Define \({\mathcal {M}} = \{f: {\mathcal {X}} \rightarrow {\mathbb {R}} \mid ||{f} ||_{\infty } \le 1 \} \subseteq {\mathcal {U}}\), where \(||{f} ||_{\infty } = \sup _{x \in {\mathcal {X}}} | f(x) |\) is the supremum norm. We restrict our attention to a class \({\mathcal {F}} \subseteq {\mathcal {M}}\) and call \({\mathcal {F}}\) a hypothesis space. Throughout this paper, we assume that the hypothesis space is symmetric, i.e., \(f \in {\mathcal {F}}\) implies \(-f \in {\mathcal {F}}\). In PU learning, the main goal is to construct a classifier \({\mathrm{sign}}(f(x))\) only from the positive dataset \({\mathcal {X}}_{\mathrm{p}}\) and the unlabeled dataset \({\mathcal {X}}_{\mathrm{u}}\) with \(f \in {\mathcal {F}}\).

In this paper, the quantity \(\pi _{+}\), often called the class-prior, is assumed to be known as in the literature (Kiryo et al. 2017; Kato et al. 2019) to focus on theoretical and practical benefits of our proposed algorithm. We examine the performance when \(\pi _{+}\) is unknown in Experiment 3 of Sects. 6.1 and 6.2.

2.2 L-risk minimization in PU learning

In supervised binary classification, the L-risk is defined by

$$\begin{aligned} R_L(f)&:= \int _{{\mathcal {X}} \times \{\pm \, 1\}} L(y, f(x)) dP_{X,Y}(x,y) \nonumber \\&= \pi _{+} \int _{{\mathcal {X}}} L(1, f(x)) dP_{X \mid Y=1}(x) + \pi _{-} \int _{{\mathcal {X}}} L(-\,1, f(x)) dP_{X \mid Y=-\,1}(x), \end{aligned}$$
(1)

for a loss function \(L : \{\pm \, 1\} \times {\mathbb {R}} \rightarrow {\mathbb {R}}\) (Steinwart and Christmann 2008, Section 2.1). We denote the margin-based loss function by \(\ell (yt) := L(y, t)\) if a loss function L(yt) can be represented as a function of margin yt, the product of a label y and a score t for all possible \(y \in \{ \pm \, 1 \}\) and \(t \in {\mathbb {R}}\).

Under the PU learning framework, however, the right-hand side of Eq. (1) cannot be directly estimated due to lack of negatively labeled observations. To circumvent this problem, many studies in the field of PU learning exploited the relationship \(P_X = \pi _{+} P_{X \mid Y =1}+\pi _{-} P_{X \mid Y =-1}\) and replaced \(P_{X \mid Y =-1}\) in Eq. (1) with \((P_X - \pi _{+} P_{X \mid Y =1})/\pi _{-}\) (Du Plessis et al. 2014; Sakai et al. 2017). That is, the L-risk can be alternatively expressed as:

$$\begin{aligned} R_L(f) = \int _{{\mathcal {X}}} L(-1, f(x)) dP_X(x) + \pi _{+} \int _{{\mathcal {X}}} L(1, f(x))- L(-1, f(x)) dP_{X \mid Y=1}(x). \end{aligned}$$
(2)

Now the right-hand side of Eq. (2) can be empirically estimated by the positive dataset \({\mathcal {X}}_{\mathrm{p}}\) and the unlabeled dataset \({\mathcal {X}}_{\mathrm{u}}\). However, the L-risk \(R_L (f)\) is not convex with respect to f in general, and minimizing an empirical estimator for \(R_L (f)\) is often formulated as a complicated non-convex optimization problem.

There have been several approaches to resolving the computational difficulty by modifying loss functions. Du Plessis et al. (2014) proposed to use non-convex loss functions satisfying the symmetric condition, \(L(1, f(x)) + L(-\,1, f(x))=1\). They proposed to optimize the empirical risk based on the ramp loss \(\ell _{\mathrm {ramp}}(yt)=0.5 \times \max ( 0,\)\(\min (2, 1-yt))\) via the concave-convex procedure (Collobert et al. 2006). Du Plessis et al. (2015) converted the problem to convex optimization through the linear-odd condition, \(L(1, f(x)) - L(-\,1, f(x)) = -\,f(x)\). They showed that the logistic loss \(\ell _{\mathrm {log}}(yt) = \log (1+\exp (-yt))\) and the double hinge loss \(\ell _{\mathrm {dh}}(yt) = \max (0, \max (-\,yt, (1-yt)/2))\) satisfy the linear-odd condition. However, all the aforementioned methods utilized a weighted sum of \(n_{\mathrm{p}} + n_{\mathrm{u}}\) predefined basis functions as a binary discriminant function, which triggered calculating the \((n_{\mathrm{p}}+n_{\mathrm{u}})\)\(\times \)\((n_{\mathrm{p}}+n_{\mathrm{u}})\) Gram matrix. Hence, executing algorithms is not scalable and can be intractable when \(n_{\mathrm{p}}\) and \(n_{\mathrm{u}}\) are large (Sansone et al. 2019). Our first goal in this paper is to overcome this computational problem by providing a computationally efficient method.

3 Weighted integral probability metric and L-risk

In this section, we formally define WIPM, a key tool for constructing the proposed algorithm, and build a link with the L-risk in Theorem 1 below. Based on the link, we propose a new binary discriminant function estimator and present its theoretical properties in Theorem 2. We first introduce the earlier work by Sriperumbudur et al. (2012) that provided a closed-form classifier in supervised binary classification.

3.1 Relation between IPM and L-risk in supervised binary classification

Müller (1997) introduced an IPM for any two probability measures P and Q defined on \({\mathcal {X}}\) and a class \({\mathcal {F}}\) of bounded measurable functions, given by

$$\begin{aligned} {\mathrm{IPM}}({P}, {Q}; {\mathcal {F}}) := \sup _{f \in {\mathcal {F}}} \left| \int _{{\mathcal {X}}} f(x) d{P} (x) - \int _{{\mathcal {X}}} f(x) d{Q}(x) \right| . \end{aligned}$$

IPM has been studied as either a metric between two probability measures (Sriperumbudur et al. 2010a; Arjovsky et al. 2017; Tolstikhin et al. 2018) or a hypothesis testing tool (Gretton et al. 2012).

Under the supervised binary classification setting, Sriperumbudur et al. (2012) showed that calculating IPM between \(P_{X \mid Y=1}\) and \(P_{X \mid Y=-\,1}\) is negatively related to minimizing the risk with a loss function, i.e., \({\mathrm{IPM}}(P_{X\mid Y=1}, P_{X\mid Y=-\,1}; {\mathcal {F}})=-\,\inf _{f \in {\mathcal {F}}} R_{L_{\mathrm {c}}}(f)\), where \(L_{\mathrm {c}}(1, t)= -\,t/\pi _{+}\) and \(L_{\mathrm {c}}(-\,1, t)= t/\pi _{-}\) for all \(t \in {\mathbb {R}}\). They further showed that a discriminant function minimizing the \(L_{\mathrm {c}}\)-risk can be obtained analytically when \({\mathcal {F}}\) is a closed unit ball in RKHS. This result cannot be directly extended to PU learning due to absence of negatively labeled observations. In the next subsection, we define a generalized version of IPM and extend the previous results for supervised binary classification to PU learning.

3.2 Extension to WIPM and L-risk in PU learning

Let \({\mathcal {F}}\) be a given class of bounded measurable functions and let \({\tilde{w}}: {\mathcal {X}} \rightarrow {\mathbb {R}}\) be a weight function such that \(||{{\tilde{w}}} ||_{\infty } < \infty \). We define WIPMFootnote 1 between two probability measures P and Q with a function class \({\mathcal {F}}\) and a weight function \({\tilde{w}}\) by

$$\begin{aligned} {\mathrm{WIPM}}({P}, {Q}; {\tilde{w}}, {\mathcal {F}}) := \sup _{f \in {\mathcal {F}}} \left| \int _{{\mathcal {X}}} f(x) d{P} (x) - \int _{{\mathcal {X}}} {\tilde{w}}(x) f(x) d{Q}(x) \right| . \end{aligned}$$
(3)

Note that WIPM reduces to IPM if \({\tilde{w}}(x)=1\) for all \(x \in {\mathcal {X}}\). Other special cases of Eq. (3) have been discussed in many applications. In the covariate shift problem, Huang et al. (2007) and Gretton et al. (2009) proposed to minimize WIPM with respect to \({\tilde{w}}\) when \({\mathcal {F}}\) is the unit ball in RKHS and PQ are empirical distributions of test and training data, respectively. In unsupervised domain adaptation, Yan et al. (2017) regarded PQ as empirical distributions of target and source data, respectively, where in this case, \({\tilde{w}}\) is a ratio of two class-prior distributions.

We pay special attention to the case where \({\tilde{w}}(x)\) is constant, \(w \in {\mathbb {R}}\), for every input value and denote WIPM by \({\mathrm{WIPM}}({P}, {Q}; w, {\mathcal {F}})\),

$$\begin{aligned} {\mathrm{WIPM}}({P}, {Q}; w, {\mathcal {F}}) := \sup _{f \in {\mathcal {F}}} \left| \int _{{\mathcal {X}}} f(x) d{P} (x) - w \int _{{\mathcal {X}}} f(x) d{Q}(x) \right| . \end{aligned}$$

In the following theorem, we establish a link between \({\mathrm{WIPM}}(P_{X}, P_{X \mid Y = 1} ; 2\pi _{+}, {\mathcal {F}})\) and the infimum of the \(\ell _{\mathrm {h}}\)-risk over \({\mathcal {F}}\) for the hinge loss \(\ell _{\mathrm {h}}(yt) = \max (0, 1-yt)\).

Theorem 1

(Relationship between \(\ell _{\mathrm {h}}\)-risk and WIPM) Let \({\mathcal {F}}\) be a symmetric hypothesis space in \({\mathcal {M}}\) and \(\ell _{\mathrm {h}}(yt) = \max (0, 1-yt)\) be the hinge loss. Then, we have

$$\begin{aligned} \inf _{f \in {\mathcal {F}}} R_{\ell _{\mathrm {h}}} (f) = 1 - {\mathrm{WIPM}} (P_{X}, P_{X \mid Y = 1} ; 2\pi _{+}, {\mathcal {F}}). \end{aligned}$$

Moreover, if \(g_{{\mathcal {F}}}\) satisfies

$$\begin{aligned} {\mathrm{WIPM}} (P_{X}, P_{X \mid Y = 1} ; 2\pi _{+}, {\mathcal {F}}) = \int _{{\mathcal {X}}} g_{{\mathcal {F}}}(x) dP_X(x) - 2\pi _{+} \int _{{\mathcal {X}}} g_{{\mathcal {F}}}(x) dP_{X \mid Y=1}(x), \end{aligned}$$

then \(\inf _{f \in {\mathcal {F}}} R_{\ell _{\mathrm {h}}}(f)\)\(=R_{\ell _{\mathrm {h}}} (-\,g_{{\mathcal {F}}})\).

Theorem 1 shows that the infimum of the \(\ell _{\mathrm {h}}\)-risk over a hypothesis space \({\mathcal {F}}\) equals the negative WIPM between the unlabeled data distribution \(P_X\) and the positive data distribution \(P_{X \mid Y=1}\) with the same hypothesis space \({\mathcal {F}}\) and the weight \(2\pi _{+}\) up to addition by constant. Furthermore, by negating the WIPM optimizer \(g_{{\mathcal {F}}}\), we obtain the minimizer of the \(\ell _{\mathrm {h}}\)-risk over the hypothesis space \({\mathcal {F}}\). Here, we define a WIPM optimizer \(g_{{\mathcal {F}}}\) as a function that attains the supremum, i.e., \({\mathrm{WIPM}} (P_{X}, P_{X \mid Y = 1} ; 2\pi _{+}, {\mathcal {F}}) = \int _{{\mathcal {X}}} g_{{\mathcal {F}}}(x)\)\(dP_X(x)\)\(-2\pi _{+} \int _{{\mathcal {X}}} g_{{\mathcal {F}}}(x) dP_{X \mid Y=1}(x)\) and we set \(f_{{\mathcal {F}}} = -\,g_{{\mathcal {F}}}\) for later notational convenience. Sriperumbudur et al. (2012) derived a similar result to Theorem 1 by showing \({\mathrm{IPM}}(P_{X\mid Y=1}, P_{X\mid Y=-1}; {\mathcal {F}})\)\(=\)\(-\,\inf _{f \in {\mathcal {F}}} R_{L_{\mathrm {c}}}(f)\) in supervised binary classification. However, as we mentioned in Sect. 3.1, their method is only applicable to supervised binary classification settings.

3.3 Theoretical properties of empirical WIPM optimizer

We denote the empirical distributions of \({P}_{X \mid Y=1}\) and \({P}_{X}\) by \({P}_{X \mid Y=1, n_{\mathrm{p}}}\) and \({P}_{X, n_{\mathrm{u}}}\), respectively. Let \({P}_{X \mid Y=1, n_{\mathrm{p}}} = n_{\mathrm{p}} ^{-1} \sum _{i=1} ^{n_{\mathrm{p}}} \delta _{x_i ^{\mathrm{p}} }\) and \({P}_{X, n_{\mathrm{u}}} = n_{\mathrm{u}} ^{-1} \sum _{i=1} ^{n_{\mathrm{u}}} \delta _{x_{i} ^{\mathrm{u}} }\), where \(\delta (\cdot )\) defined on \({\mathcal {X}}\) is the Dirac delta function and \(\delta _x (\cdot ) := \delta (\cdot -x)\) for \(x \in {\mathcal {X}}\). The empirical Rademacher complexity of \({\mathcal {F}}\) given a set \(S= \{z_1, \dots , z_{m}\} \) is defined by \({\mathfrak {R}}_{S}({\mathcal {F}} ) := {\mathbb {E}}_{\sigma } \left( \frac{1}{m} \sup _{f \in {\mathcal {F}}} \left| \sum _{i=1} ^{m} \sigma _i f(z_i) \right| \right) \). Here, \(\{\sigma _i\}_{i=1} ^{m}\) is a set of independent Rademacher random variables taking 1 or \(-\,1\) with probability 0.5 each and \({\mathbb {E}}_{\sigma }(\cdot )\) is the expectation operator over the Rademacher random variables (Bartlett and Mendelson 2002). Denote a maximum by \(a \vee b := \max (a,b)\), a minimum by \( a \wedge b := \min (a,b)\). For a probability measure Q defined on \({\mathcal {X}}\), denote the expectation of a discriminant function f by \({\mathbb {E}}_{Q}(f) := \int _{{\mathcal {X}}} f(x) d{Q}(x)\) and the variance by \({\mathrm {Var}}_{{Q}} (f) := {\mathbb {E}}_{Q}(f^2) - ({\mathbb {E}}_{Q}(f))^2\).

The empirical estimator for \({\mathrm{WIPM}}({P}_{X}, {P}_{X \mid Y=1}; w, {\mathcal {F}})\) is given by plugging in the empirical distributions,

$$\begin{aligned}&{\mathrm{WIPM}}({P}_{X, n_{\mathrm{u}}}, {P}_{X \mid Y=1, n_{\mathrm{p}}}; w, {\mathcal {F}}) = \sup _{f \in {\mathcal {F}}} \left| \frac{1}{n_{\mathrm{u}}} \sum _{i=1} ^{n_{\mathrm{u}}} f( x_i ^{\mathrm{u}}) - \frac{w}{n_{\mathrm{p}}} \sum _{i=1} ^{n_{\mathrm{p}}} f( x_i ^{\mathrm{p}}) \right| , \end{aligned}$$

and we define an empirical WIPM optimizer \({\hat{g}}_{{\mathcal {F}}} \in {\mathcal {F}}\) that satisfies the following equation,

$$\begin{aligned} {\mathrm{WIPM}}({P}_{X, n_{\mathrm{u}}}, {P}_{X \mid Y=1, n_{\mathrm{p}}}; w, {\mathcal {F}}) = \frac{1}{n_{\mathrm{u}}} \sum _{i=1} ^{n_{\mathrm{u}}} {\hat{g}}_{{\mathcal {F}}} ( x_i ^{\mathrm{u}}) - \frac{w}{n_{\mathrm{p}}} \sum _{i=1} ^{n_{\mathrm{p}}} {\hat{g}}_{{\mathcal {F}}} ( x_i ^{\mathrm{p}}). \end{aligned}$$
(4)

We set \({\hat{f}}_{{\mathcal {F}}}= -{\hat{g}}_{{\mathcal {F}}}\) for notational convenience as in Sect. 3.2.

We analyze the estimation error \(R_{\ell _{\mathrm {h}}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {F}} } R_{\ell _{\mathrm {h}}} (f)\) in the following theorem. To begin, let \(\chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (w)=w/\sqrt{n_{\mathrm{p}}} + 1/\sqrt{n_{\mathrm{u}}}\) and \(\chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (w) = 2(w/n_{\mathrm{p}} + 1/n_{\mathrm{u}})\).

Theorem 2

(Estimation error bound for general function space) Let \({\hat{g}}_{{\mathcal {F}}}\) be an empirical WIPM optimizer defined in Eq. (4) and set \({\hat{f}}_{{\mathcal {F}}} = -\,{\hat{g}}_{{\mathcal {F}}}\). Let \({\mathcal {F}}\) be a symmetric hypothesis space such that \(||{f} ||_{\infty } \le \nu \le 1\), \({\mathrm {Var}}_{{P}_{X \mid Y=1}} (f) \le \sigma _{X \mid Y=1} ^2\), and \({\mathrm {Var}}_{{P}_{X}} (f) \le \sigma _{X} ^2\). Denote \(\rho ^2 = \sigma _{X \mid Y=1} ^2 \vee \sigma _{X} ^2\). Then, for all \(\alpha , \tau >0\), the following holds with probability at least \(1-e^{-\tau }\),

$$\begin{aligned} R_{\ell _{\mathrm {h}}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {F}} } R_{\ell _{\mathrm {h}}} (f)&\le C_{\alpha } ({\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) ) + 2 \pi _{+} {\mathbb {E}}_{P_{X \mid Y =1 } ^{n_{\mathrm{p}}} }( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}({\mathcal {F}}) ) ) \nonumber \\&\quad +\, C_{\tau , \rho ^2}^{(1)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) + C_{\tau , \nu , \alpha }^{(2)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (2\pi _{+}), \end{aligned}$$
(5)

where \(C_{\alpha }=4(1+\alpha )\), \(C_{\tau , \rho ^2}^{(1)} =2\sqrt{ 2 \tau \rho ^2 }\), \(C_{\tau , \nu , \alpha }^{(2)} = 2\tau \nu \left( \frac{2}{3} + \frac{1}{\alpha } \right) \).

Due to Talagrand’s inequality, Theorem 2 provides a sharper bound than the existing result based on McDiarmid’s inequality. Specifically, Kiryo et al. (2017, Theorem 4) utilized McDiarmid’s inequality and showed that for \(\tau >0\) and some \(\varDelta >0\) the following holds with probability at least \(1-e^{-\tau }\),

$$\begin{aligned} R_{\ell _{\mathrm {h}}}({\hat{f}}) - \inf _{f \in {\mathcal {F}} } R_{\ell _{\mathrm {h}}} (f)&\le 8 ( {\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) ) + 2\pi _{+} {\mathbb {E}}_{P_{X \mid Y =1} ^{n_{\mathrm{p}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}({\mathcal {F}}) ) ) \nonumber \\&\quad +\, \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) (1 + \nu ) \sqrt{2\tau } + \varDelta . \end{aligned}$$
(6)

The following proposition shows that the proposed upper bound (5) is sharper than the upper bound (6) under a certain condition.

Proposition 1

With the notations defined in Theorem 2, suppose that the following holds,

$$\begin{aligned} \frac{1+\nu }{2} - \frac{ 5 \sqrt{2 \tau } \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)}(2\pi _{+}) \nu }{ 6 \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+})} \ge \rho . \end{aligned}$$
(7)

Then, the proposed upper bound (5) is sharper than the previous result (6) proposed by Kiryo et al. (2017).

It is noteworthy that the second term in the left-hand side of (7) converges to zero as \(n_{\mathrm{p}}\) and \(n_{\mathrm{u}}\) increase because \(\chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+})\)\(=\)\(O_{ {P}_{X \mid Y=1}, {P}_{X}} ( (n_{\mathrm{p}} \wedge n_{\mathrm{u}})^{-1/2} )\) and \(\chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)}(2\pi _{+})\)\(=\)\(O_{ {P}_{X \mid Y=1}, {P}_{X}} ( (n_{\mathrm{p}} \wedge n_{\mathrm{u}})^{-1} )\). Due to \((1+\nu )/2 \ge \nu \ge \rho \), the condition (7) is quite reasonable if the upper bounds of the variances, \(\sigma _{X} ^2\) and \(\sigma _{X \mid Y=1} ^2\), are sufficiently small.

In binary classification, one ultimate goal is to find a classifier minimizing the misclassification error, or equivalently, minimizing the excess risk. Bartlett et al. (2006) showed that there is an invertible function \(\psi : [-\,1,1] \rightarrow [0,\infty )\) such that the excess risk \(R_{\ell _{01}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f)\) is bounded above by \(\psi ^{-1} (R_{\ell } ({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell }(f) )\) if the margin-based loss \(\ell \) is classification-calibrated. In particular, Zhang (2004) showed that the excess risk is bounded above by the excess \(\ell _{\mathrm {h}}\)-risk, i.e., \(R_{\ell _{01}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f)\)\(\le \)\(R_{\ell _{\mathrm {h}}} ({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{\mathrm {h}}}(f)\). This implies that an excess risk bound can be obtained by analyzing the excess \(\ell _{\mathrm {h}}\)-risk bound with Theorem 2. The following corollary provides the excess risk bound.

Corollary 1

(Excess risk bound for general function space) With the notations defined in Theorem 2, for all \(\alpha , \tau >0\), the following holds with probability at least \(1-e^{-\tau }\),

$$\begin{aligned} R_{\ell _{01}}({\hat{f}}_{{\mathcal {F}}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f)&\le \inf _{f \in {\mathcal {F}} } R_{\ell _{\mathrm {h}}} (f) -\inf _{f \in {\mathcal {U}} } R_{\ell _{\mathrm {h}}}(f) \\&\quad + \, C_{\alpha } ({\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) ) + 2 \pi _{+} {\mathbb {E}}_{P_{X \mid Y =1 } ^{n_{\mathrm{p}}} }( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}({\mathcal {F}}) ) ) \\&\quad +\, C_{\tau , \rho ^2}^{(1)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) + C_{\tau , \nu , \alpha }^{(2)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (2\pi _{+}). \end{aligned}$$

4 WIPM optimizer with reproducing kernel Hilbert space

In this section, we provide a computationally efficient PU learning algorithm which builds an analytic classifier when a hypothesis space is a closed ball in RKHS. In addition, unlike the excess risk bound in Corollary 1, we explicitly derive the bound that converges to zero when the sample sizes \(n_{\mathrm{p}}\) and \(n_{\mathrm{u}}\) increase.

4.1 An analytic classifier via WMMD optimizer

To this end, we assume that \({\mathcal {X}} \subseteq [0,1]^{d}\) is compact. Let \(k: {\mathcal {X}} \times {\mathcal {X}} \rightarrow {\mathbb {R}}\) be a reproducing kernel defined on \({\mathcal {X}}\) and \({\mathcal {H}}_k\) be the associated RKHS with the inner product \(\langle \cdot , \cdot \rangle _{{\mathcal {H}}_k} : {\mathcal {H}}_k \times {\mathcal {H}}_k \rightarrow {\mathbb {R}}\). We denote the induced norm by \(||{ \cdot } ||_{{\mathcal {H}}_k}\). Denote a closed ball in RKHS \({\mathcal {H}}_k\) with a radius \(r>0\), by \({\mathcal {H}}_{k,r} = \{f: ||{f} ||_{{\mathcal {H}}_k} \le r \}\). We define the weighted maximum mean discrepancy (WMMD) between two probability measures P and Q with a weight w and a closed ball \({\mathcal {H}}_{k,r}\) by \({\mathrm{WMMD}}_k ({P}, {Q}; w, r) \)\( := \)\({\mathrm{WIPM}} ({P}, {Q};w, {\mathcal {H}}_{k,r})\). The name of WMMD comes from the maximum mean discrepancy (MMD), a popular example of the IPM whose function space is the unit ball \({{\mathcal {H}}}_{k,1}\), i.e., \(\mathrm{MMD}_k ({P}, {Q}) := {\mathrm{IPM}} ({P}, {Q};{\mathcal {H}}_{k,1})\) (Sriperumbudur et al. 2010a, b). As defined in Eq. (4), let \({\hat{g}}_{{\mathcal {H}}_{k,r}} \in {\mathcal {H}}_{k,r}\) be the empirical WMMD optimizer such that

$$\begin{aligned} {\mathrm{WMMD}}_k ({P}_{X, n_{\mathrm{u}}}, {P}_{X \mid Y=1, n_{\mathrm{p}}}; w, r) = \frac{1}{n_{\mathrm{u}}} \sum _{i=1} ^{n_{\mathrm{u}}} {\hat{g}}_{{\mathcal {H}}_{k,r}} ( x_i ^{\mathrm{u}}) - \frac{w}{n_{\mathrm{p}}} \sum _{i=1} ^{n_{\mathrm{p}}} {\hat{g}}_{{\mathcal {H}}_{k,r}} ( x_i ^{\mathrm{p}}). \end{aligned}$$

In addition, we set \({\hat{f}}_{{\mathcal {H}}_{k,r}} = -\,{\hat{g}}_{{\mathcal {H}}_{k,r}}\), which leads the corresponding classification rule to \({\mathrm{sign}}({\hat{f}}_{{\mathcal {H}}_{k,r}}(z))\). In the following proposition, we show that this classification rule has an analytic expression by exploiting the reproducing property \(f(x) = \langle f, k(\cdot , x) \rangle _{{\mathcal {H}}_k}\) and the Cauchy-Schwarz inequality.

Proposition 2

Let \(k: {\mathcal {X}} \times {\mathcal {X}} \rightarrow {\mathbb {R}} \) be a bounded reproducing kernel. Then, the classification rule has a closed-form expression given by

$$\begin{aligned} {\mathrm{sign}} ({\hat{f}}_{{\mathcal {H}}_{k,r}}(z) ) = {\left\{ \begin{array}{ll} +\,1 &{}\quad {\mathrm{if}} \quad (2\pi _{+})^{-1} < {\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }(z), \\ -\,1 &{}\quad otherwise, \end{array}\right. } \end{aligned}$$
(8)

where

$$\begin{aligned} {\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }(z) = \frac{ n_{\mathrm{p}} ^{-1} \sum _{i=1 } ^{n_{\mathrm{p}}} k(z, x_i ^{\mathrm{p}})}{ n_{\mathrm{u}} ^{-1} \sum _{i=1 } ^{n_{\mathrm{u}}} k(z, x_i ^{\mathrm{u}})}. \end{aligned}$$

We call the classifier defined in Eq. (8) the WMMD classifier and the score \({\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }(z)\) the WMMD score for z. One strength of the WMMD classifier is that the classification rule has a closed-form expression, resulting in computational efficiency. Furthermore, the WMMD score \({\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }\) is independent of the class-prior \(\pi _{+}\), and thus we can obtain the score function without prior knowledge of the class-prior.

4.2 Explicit excess risk bound of WMMD classifier

Since the empirical WMMD optimizer \({\hat{g}}_{{\mathcal {H}}_{k,r}}\) is a special case of the empirical WIPM optimizer, we have an excess risk bound from the result of Corollary 1. However, without knowing convergence rates of the Rademacher complexities, \({\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) )\) and \({\mathbb {E}}_{P_{X \mid Y=1} ^{n_{\mathrm{p}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}({\mathcal {F}}) )\), and the approximation error, the consistency of the classifier remains unclear. In this subsection, we establish an explicit excess risk bound that vanishes. We first derive an explicit estimation error bound in the following lemma.

Lemma 1

(Explicit estimation error bound) With the notations defined in Theorem 2, assume that a reproducing kernel k defined on a compact space \({\mathcal {X}}\) is bounded. Let \(r_1^{-1}=\sup _{x \in {\mathcal {X}}} \sqrt{k(x,x)}\). Then, we have \({\mathcal {H}}_{k, r_1} \subseteq {\mathcal {M}}\). Moreover, for all \(\alpha , \tau >0\), the following holds with probability at least \(1-e^{-\tau }\),

$$\begin{aligned} {R_{\ell _{\mathrm {h}}} ({\hat{f}}_{{\mathcal {H}}_{k,r_1}}) - \inf _{f \in {\mathcal {H}}_{k,r_1} } R_{\ell _{\mathrm {h}}} (f)} \le ( C_{\alpha } + C_{\tau , \rho ^2}^{(1)} ) \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) + C_{\tau , \nu , \alpha }^{(2)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (2\pi _{+}). \end{aligned}$$

While the bound in Theorem 2 is expressed in terms \({\mathbb {E}}_{P_{X} ^{n_{\mathrm{u}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{u}}}({\mathcal {F}}) )\) and \({\mathbb {E}}_{P_{X \mid Y=1} ^{n_{\mathrm{p}}} } ( {\mathfrak {R}}_{{\mathcal {X}}_{\mathrm{p}}}( {\mathcal {F}}) )\), these are evaluated in terms of \(n_p\) and \(n_u\) in the upper bound in Lemma 1, giving an explicit estimation error bound with \(O((n_{\mathrm{p}} \wedge n_{\mathrm{u}})^{-1/2})\) convergence rate. The key idea is to use reproducing property \(f(x) = \langle f, k(\cdot , x) \rangle _{{\mathcal {H}}_k}\) and the Cauchy-Schwarz inequality to obtain an upper bound for the Rademacher complexity.

In the following lemma, we elaborate on the approximation error bound. To begin, for any \(0< \beta \le 1\), let \(\beta {\mathcal {M}} := \{ \beta f : f \in {\mathcal {M}} \}\). Set \(f_1^* (x) = {\mathrm{sign}} ( P(Y=1 \mid X=x) - \frac{1}{2} )\).

Lemma 2

(Approximation error bound over uniformly bounded hypothesis space) With the notations defined in Lemma 1, we have

$$\begin{aligned} \inf _{f \in {\mathcal {H}}_{k,r_1} } {R_{\ell _{\mathrm {h}}} ( f ) - \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{\mathrm {h}}} (f)} \le \beta \inf _{g \in {\mathcal {H}}_{k,{r_1 /\beta }}} ||{ g- f_1^*} ||_{L_2(P_X)}, \end{aligned}$$

for any \(0< \beta \le 1\).

When \(\beta =1\), Lemma 2 implies that the approximation error \( \inf _{f \in {\mathcal {H}}_{k,r_1} } R_{\ell _{\mathrm {h}}} ( f ) \)\( - \inf _{f \in {\mathcal {U}} } R_{\ell _{\mathrm {h}}} (f) \) is bounded above by \( \inf _{g \in {\mathcal {H}}_{k,{r_1}}} ||{ g- f_1^*} ||_{L_2(P_X)}\) due to \(\inf _{f \in {\mathcal {U}} } R_{\ell _{\mathrm {h}}}(f) \)\(= \inf _{f \in {\mathcal {M}} } R_{\ell _{\mathrm {h}}}(f)\) (Lin 2002). Hence, a naive substitution to Corollary 1 will give a sub-optimal bound because \(\inf _{g \in {\mathcal {H}}_{k,{r_1}}} ||{ g- f_1^*} ||_{L_2(P_X)}\) is non-zero in general. In the following theorem, we rigorously establish the explicit excess risk bound which vanishes as \(n_{\mathrm{p}}\) and \( n_{\mathrm{u}}\) increase. We provide the pointer to a proof and the conditions (C1) and (C2) in “Appendix A”.

Theorem 3

(Explicit excess risk bound) With the notations defined in Theorem 2 and in Lemma 1, assume that the Gaussian kernel \(k(x,y) = \exp (-||{x-y} ||_2 ^2 / 2)\) is used. Then, \(r_1 =1\). Furthermore, under the conditions (C1) and (C2), for all \(0<s<1\), \(\alpha >0\), and \(\tau >0\), the following holds with probability at least \(1-e^{-\tau }\):

$$\begin{aligned}&R_{\ell _{01}}({\hat{f}}_{{\mathcal {H}}_{k,1}}) - \inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f) \\&\quad \le ( C_{\alpha } + C_{\tau , \rho ^2}^{(1)} ) \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) ^{1-s} + C_{\tau , \nu , \alpha }^{(2)} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) ^{-s} \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(2)} (2\pi _{+}) \\&\qquad + C_d ||{\frac{dP_{X}}{d\lambda }} ||_{L_2 (\lambda )} ||{f_1 ^*} ||_{1/4,2} \{-\, s \ln \chi _{n_{\mathrm{p}}, n_{\mathrm{u}}} ^{(1)} (2\pi _{+}) \}^{-1/16}, \end{aligned}$$

where

$$\begin{aligned} C_{d}=\left\{ \left( \frac{\sqrt{2}\pi }{\log 2}\sqrt{d}\right) ^{1/8}+64\sqrt{d}\left( \frac{8}{\pi }\right) ^{d}\right\} . \end{aligned}$$

Compared to Corollary 1, the excess risk bound in Theorem 3 has an explicit form, thus obviates the approximation error term. In addition, this bound converges to zero when \(n_{\mathrm{p}}\) and \(n_{\mathrm{u}}\) increase and the convergence rate is \(O( \{\ln (n_{\mathrm{p}} \wedge n_{\mathrm{u}})\}^{-1/16} )\). To derive the excess risk bound, we first note that the misclassification error is determined by the sign of a discriminant function alone, i.e., \(\inf _{f \in {\mathcal {U}} } R_{\ell _{01}}(f)= \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{01}}(f)\) for any \(0< \beta \le 1\). Next, we show that \(R_{\ell _{01}}(g) - \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{01}}(f) \le \{ R_{\ell _{\mathrm {h}}} (g) - \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{\mathrm {h}}}(f) \}/ \beta \) by modifying Bartlett et al. (2006). We next obtain bounds for \(\{ R_{\ell _{\mathrm {h}}}({\hat{f}}_{{\mathcal {H}}_{k,r_1}}) - \inf _{f \in {\mathcal {H}}_{k,r_1} } R_{\ell _{\mathrm {h}}}(f) \} /\beta \) using Lemma 1 and \( \{ \inf _{f \in {\mathcal {H}}_{k,r_1} } R_{\ell _{\mathrm {h}}}(f) - \inf _{f \in \beta {\mathcal {M}} } R_{\ell _{\mathrm {h}}}(f) \} /\beta \) using Lemma 2 and the previous result by Smale and Zhou (2003, Proposition 1.1) in terms of \(\beta \). A carefully chosen \(\beta \) provides the explicit excess risk bound.

Niu et al. (2016) provided the excess risk bound expressed as a function of \(n_{\mathrm{p}}, n_{\mathrm{u}}\). However, their bound included combined terms of the approximation error and the Rademacher complexity, as in Corollary 1. To the best of our knowledge, we are the first to explicitly derive the excess risk bound with convergence rate in terms of a function of \(n_{\mathrm{p}}, n_{\mathrm{u}}\) in PU learning.

5 Related work

Excess risk bound in noisy label literature PU learning can be considered as a special case of classification with asymmetric label noise, and many studies in this literature have shown consistency results similar to Theorem 3 (Natarajan et al. 2013). Patrini et al. (2016) derived an explicit estimation error when \({\mathcal {F}}\) is a set of linear hypotheses and Blanchard et al. (2016) showed a consistency result of the excess risk bound when the hypothesis space is RKHS with universal kernels. While the two studies assumed the one sample of data scheme, the proposed bound is based on the two samples of data scheme. Therefore, our proposed excess risk bound is expressed in \(n_{\mathrm{p}}\) and \(n_{\mathrm{u}}\), giving a new consistency theory.

Closed-form classifier Blanchard et al. (2010) suggested a score function similar to the WMMD score by using different bandwidth hyperparameters for the denominator and the numerator. However, with these differences, our method gains theoretical justification while their score function does not. Du Plessis et al. (2015) derived a closed-form classifier based on the squared loss. They estimated \(P(Y=1 \mid X)- P(Y=-\,1 \mid X)\) and showed the consistency of the estimation error bound in the two samples of data scheme. However, the classifier is not scalable because it requires to compute the inverse of a \((n_{\mathrm{p}}+n_{\mathrm{u}})\)\(\times \)\((n_{\mathrm{p}}+n_{\mathrm{u}})\) matrix.

6 Numerical experiments

In this section, we empirically analyze the proposed algorithm to demonstrate its practical efficacy using synthetic and real datasets. Pytorch implementation for the experiments is available at https://github.com/eraser347/WMMD_PU.

6.1 Synthetic data analysis

We first visualize the effect of increasing the sample sizes \(n_{\mathrm{p}}\) and \(n_{\mathrm{u}}\) on the discriminant ability of the proposed algorithm (Experiment 1). Then we compare performance with (i) the logistic loss \(\ell _{\mathrm {log}}\), denoted by LOG, (ii) the double hinge loss \(\ell _{\mathrm {dh}}\), denoted by DH, both proposed by Du Plessis et al. (2015), (iii) the non-negative risk estimator method, denoted by NNPU, proposed by Kiryo et al. (2017), (iv) the threshold adjustment method, denoted by tADJ, proposed by Elkan and Noto (2008), and (v) the proposed algorithm, denoted by WMMD (Experiments 2, 3, and 4).

Fig. 1
figure 1

The illustration of the decision boundaries of the WMMD classifier using the two_moons dataset with the increases in the size of positive and unlabeled samples. The true means of the positive and negative data distributions are plotted by blue and red lines respectively. The gray ‘\(+\)’ points and the gray ‘−’ points refer to the unlabeled positive and unlabeled negative training data, respectively (Color figure online)

Experiment 1 In this case, we used the two_moons dataset whose underlying distributions are

$$\begin{aligned} X|Y=y, U&\sim N\left( \begin{bmatrix}2(1+y) - 4y\cos (\pi U) \\ (1+y)-4y\sin (\pi U) \end{bmatrix},\begin{bmatrix}0.4^{2}&0\\ 0&0.4^{2} \end{bmatrix}\right) , \end{aligned}$$

where U refers to a uniform random variable ranges from 0 to 1 and \(N(\mu ,\varSigma )\) is the normal distribution with mean \(\mu \) and covariance \(\varSigma \). We used the ‘make_moons’ function in the Python module ‘sklearn.datasets’ (Pedregosa et al. 2011) to generate the datasets.

Figure 1 illustrates the decision boundaries of WMMD using the two_moons dataset. The first row displays the case where the unlabeled sample size is small, \(n_{\mathrm{u}}=50\), and the second row displays the case where the unlabeled sample size is large, \(n_{\mathrm{u}}=400\). The first and second columns display the case where the positive sample sizes are \(n_{\mathrm{p}}=5\) and \(n_{\mathrm{p}}=10\), respectively. The class-prior is fixed to \(\pi _{+}=0.5\), and we assumed that the class-prior is known. We visualize the true mean function of the positive and negative data distributions with blue and red lines, respectively. The positive data are represented by blue diamond points, and the unlabeled data are represented by gray points. The decision boundaries of the WMMD classifier tend to correctly separate the two clusters as \(n_{\mathrm{p}}\) and \(n_{\mathrm{u}}\) increase.

In Experiments 2, 3, and 4, we evaluate: (i) the accuracy and area under the receiver operating characteristic curve (AUC) as \(n_{\mathrm{u}}\) and \(\pi _{+}\) change when the class-prior is known (Experiment 2) and unknown (Experiment 3); (ii) the elapsed training time (Experiment 4). In these experiments, we set up the underlying joint distribution as follows:

$$\begin{aligned} X\mid Y=y \sim N\left( y\frac{{\varvec{1}}_{2}}{\sqrt{2}}, I_{2}\right) , Y \sim 2 \times {\mathrm {Bern}} (\pi _{+}) -1, \end{aligned}$$
(9)

where \({\mathrm {Bern}}(p)\) is the Bernoulli distribution with mean p, \({\varvec{1}}_{2} = (1,1)^T\) is the 2 dimensional vector of all ones and \(I_{2}\) is the identity matrix of size 2.

Experiment 2 In this experiment, we compare the accuracy and AUC of the five PU learning algorithms when the true class-prior \(\pi _{+}\) is known. Figure 2a, c show the accuracy and AUC on various \(n_{\mathrm{u}}\). The training sample size for the positive data is \(n_{\mathrm{p}}=100\) and the class prior is \(\pi _{+}=0.5\). The unlabeled sample size changes from 40 to 500 by 20. We repeat a random generation of training and test data 100 times. For comparison purposes, we add the 1-\(\hbox {Bayes}\) risk for each unlabeled sample size. In terms of accuracy, the proposed WMMD tends to be closer to the 1-\(\hbox {Bayes}\) risk as the \(n_{\mathrm{u}}\) increases. Compared with other PU learning algorithms, WMMD achieves higher accuracy in every \(n_{\mathrm{u}}\) and achieves comparable to or better AUC.

Figure 2b, d show a comparison of accuracy and AUC as \(\pi _{+}\) changes. The training sample size for the positive and unlabeled data are \(n_{\mathrm{p}}=100\) and \(n_{\mathrm{u}}=400\), respectively. The class-prior \(\pi _{+}\) changes from 0.05 to 0.95 by 0.05. The test sample size is \(10^3\). Training and test data are repeatedly generated 100 times with different random seeds. In terms of accuracy, the proposed WMMD performs comparably with LOG and NNPU, showing advantages over DH and tADJ. When the true class-prior is less than equal to 0.8, WMMD performs better in terms of AUC, except for tADJ. The tADJ achieves the highest AUC because \(P(Y=1 \mid X=x)\) is proportional to \(P( \{ x \text { is from the positive dataset} \} \mid X=x)\). This empirically shows that WMMD has a comparable discriminant ability to the other algorithms for a wide range of class-priors.

Fig. 2
figure 2

The comparison of the accuracy and AUC of the five PU learning algorithms when each of \(n_{\text {u}}\) and \(\pi _{+}\) changes. The dashed curve represents the 1-\(\hbox {Bayes}\) risk. The curve and the shaded region represent the average and the standard error, respectively, based on 100 replications

Fig. 3
figure 3

The comparison of the accuracy and AUC of the five PU learning algorithms when each of \(n_{\text {u}}\) and \(\pi _{+}\) changes under the situation where \({\pi }_{+}\) is unknown. The dashed curve represents the 1-\(\hbox {Bayes}\) risk. The curve and the shaded region represent the average and the standard error, respectively, based on 100 replications. LOG, DH, and NNPU use the estimate of the class-prior from the ‘KM1’ method

Experiment 3 The main goal of this subsection is to show the robustness of the proposed classifier in the case of unknown class-prior \({\pi }_{+}\). In PU learning literature, \(\pi _{+}\) has been frequently assumed to be known (Du Plessis et al. 2015; Niu et al. 2016; Kiryo et al. 2017; Kato et al. 2019). However, this assumption can be considered to be strong in real-world applications, and to correctly execute existing PU learning algorithms, an accurate estimate of \(\pi _{+}\) is necessary. In this experiment, we compare the accuracy and AUC when the class-prior \({\pi }_{+}\) is unknown. For the WMMD classifier, we used a density-based method for the class-prior estimation which can be obtained as a byproduct of the proposed algorithm. We provide the pointer to a description in “Appendix A”. The results of LOG, DH, and NNPU are given for completeness sake using the ‘KM1’ methodFootnote 2 by Ramaswamy et al. (2016). We take these estimates as true values and repeat the same comparative numerical experiments in Experiment 2.

Since the objective functions of the LOG, DH, and NNPU algorithms depend on the estimate \({\hat{\pi }}_{+}\), we anticipate that both the accuracy and AUC rely on the quality of the estimation. On the other hand, the tADJ algorithm does not depend on the class-prior, so the performance is not affected. Also, as the proposed score function does not depend on the class-prior \(\pi _{+}\), and since \(\pi _{+}\) is used only to determine a cutoff, the AUC of the proposed algorithm is less affected by the estimation of \(\pi _{+}\).

Figure 3a, c compare the accuracy and AUC as a function of \(n_{\mathrm{u}}\). WMMD performs worse than LOG, DH, and NNPU, while AUC is higher. Though tADJ shows poor accuracy in a wide range, it achieves high AUC comparable to WMMD. As we anticipated, WMMD is more robust than LOG, DH, and NNPU in AUC. This is possibly because our score function \({\hat{\lambda }}_{n_{\mathrm{p}},n_{\mathrm{u}} }\) does not depend on \(\pi _{+}\). A similar trend can be found in Figure 3b, d. We note that the ‘KM1’ method is not scalable and thus may not be used for large-scale datasets.

Experiment 4 In this experiment, we compare the elapsed training time, including hyperparameter optimization, of the five PU learning algorithms. The data are generated from the distributions described in Eq. (9), and we set \(n_{\mathrm{p}}=100, n_{\mathrm{u}}=400\), and \(\pi _{+} = 0.5\). The elapsed time is measured with 20 Intel®Xeon®E5-2630 v4@2.20GHz CPU processors.

Table 1 compares the elapsed training time and its ratio relative to that of WMMD. WMMD takes the shortest time among the five baseline methods. In particular, the training time for WMMD is at least about 300 times shorter than that of the LOG and DH methods. This is because the WMMD classifier has an analytic form while the LOG and DH methods require solving a non-linear programming problem.

6.2 Real data analysis

We demonstrate the practical utility of the proposed algorithm using the eight real binary classification datasets from the LIBSVMFootnote 3 (Chang and Lin 2011). Since some observations from the raw datasets are not completely recorded, we removed such observations and construct the dataset with fully recorded data. Next, to investigate the effect of varying \(\pi _{+}\), we artificially reconstructed \({\mathcal {X}}_{\mathrm{p}}\) and \({\mathcal {X}}_{\mathrm{u}}\) through a random sampling from the fully recorded datasets. For the three datasets australian_scale, breast-cancer_scale and skin_nonskin, we reconstructed the data so that the resulting class-prior \(\pi _{+}\) ranges from 0.15 to 0.79. We add the suffix 2 for those datasets. We randomly resampled data 100 times for the seven small datasets and 10 times for the four big datasets: skin_nonskin, skin_nonskin2, epsilon_normalized, and HIGGS. Table 2 summarizes statistics for the eleven real datasets. We conduct two comparative numerical experiments when \(\pi _{+}\) is known and unknown.

Table 1 A summary of elapsed training time and its ratio for the five PU learning algorithms based on 100 replications
Table 2 A summary of the eleven binary classification datasets

Table 3 shows the average and the standard error of the accuracy and AUC when the class-prior \(\pi _{+}\) is known. LOG and DH fail to compute the \((n_{\mathrm{p}}+n_{\mathrm{u}})\)\(\times \)\((n_{\mathrm{p}}+n_{\mathrm{u}})\) Gram matrix due to out of memory in the 12 GB GPU memory limit. WMMD achieves comparable to or better accuracy and AUC than LOG, DH, and tADJ on most datasets. Compared to NNPU, WMMD performs comparably on the small datasets. However, NNPU achieves higher accuracy on skin_nonskin, epsilon_normalized, and HIGGS. The neural network used in NNPU fits well to the complicated and high-dimensional structure of data and shows high accuracy.

Table 3 Accuracy and AUC comparison using the real datasets when the class-prior \(\pi _{+}\) is known

Table 4 compares the average and the standard error of the accuracy and AUC when the class-prior \(\pi _{+}\) is unknown. As in Experiment 3 in Sect. 6.1, we estimate \(\pi _{+}\) using the ‘KM1’ method for LOG, DH, and NNPU, and using the density-based method for WMMD. The LOG, DH, and NNPU algorithms are implemented on the seven small-scale datasets alone because the method by Ramaswamy et al. (2016) is not feasible with the large-scale datasets (Bekker and Davis 2018). Overall, WMMD shows comparable to or better performances than other PU learning algorithms on most datasets. Compared to Table 3, WMMD and tADJ show robustness to unknown \(\pi _{+}\) in terms of AUC. This is because WMMD and tADJ do not require estimation of \(\pi _{+}\) to construct score functions. In contrast, the other methods require an estimate \({\hat{\pi }}_{+}\), and we observe a substantial drop in accuracy and AUC when the ‘KM1’ method estimate is used.

Table 4 Accuracy and AUC comparison using the real datasets when the class-prior \(\pi _{+}\) is unknown

7 Concluding remarks

Existing methods use different objective functions and hypothesis spaces, and as a consequence, different optimization algorithms. Hence, there is no reason that one method outperforms uniformly for all scenarios. It is possible that one particular method may outperform in one scenario, for example, NNPU proposed by Kiryo et al. (2017) would perform better in complicated data settings because of the expressive power of neural networks. However, the proposed method has a clear computational advantage due to the closed-form as well as theoretical strength in terms of the explicit excess risk bound. Further, the proposed method works reasonably well in both cases in which \(\pi _{+}\) is known or unknown. In this regard, we believe the proposed method can be used as a principled and easy-to-compute baseline algorithm in PU learning.