1 Introduction

The “best” treatment for the current patient must be learned from the treatment(s) of previous patients. However, no two patients are ever exactly alike, so the learning process must involve learning the ways in which the current patient is alike to previous patients—i.e., has the same or similar features—and which of those features are relevant to the treatment(s) under consideration. This already complicated learning process is further complicated because the history of previous patients records only outcomes actually experienced from treatments actually received—not the outcomes that would have been experienced from alternative treatments—the counterfactuals. And this learning process is complicated still further because the treatments received by previous patients were (typically) chosen according to some protocol that might or might not be known but was almost surely not random—so the observed data is biased.

The same complications arise in many other settings. Which mode of advertisement would be most effective for a given product? Which materials would best promote learning/performance for a given student? Which investment strategy would yield higher returns or lower risk in a particular macroeconomic environment? As in the medical setting, choosing the ”best” policy in these settings (and in others too numerous to mention) requires learning which features of each context are relevant for the decision/action at hand and learning about the consequences of decisions/actions not taken in previous contexts—the counterfactuals; such learning is especially complicated because the observed data may be biased (because it was created by an existing—perhaps less effective—policy) and because each observed instance and action may be informed by a vast array of features. (Counterfactuals are seldom seen in observed data. One possible way to obtain counterfactual information would be to conduct controlled experiments—but in many contexts, experimentation will be impractical or even impossible. Absent controlled experiments, counterfactuals must be inferred.)

This paper proposes a novel approach to addressing such problems. We construct an algorithm that learns a nonlinear policy to recommend an action for each (new) instance. During the training phase, our algorithm learns the action-dependent relevant features and then uses a feedforward neural network to optimize a nonlinear stochastic policy the output of which is a probability distribution over the actions given the relevant features. When we apply the trained algorithm to a new instance, we choose the action which has the highest probability. In the settings mentioned above our algorithm constructs: (in the medical context) a personalized plan of patient treatment; (in the advertising context) a product-specific plan of advertisement; (in the educational context) a student-specific plan of instruction; (in the financial context) a situationally-specific investment strategy. We use actual data to demonstrate that our algorithm is significantly superior to existing state-of-the-art algorithms. We emphasize that our methods and the algorithms we develop are widely applicable to an enormous range of settings, from healthcare to advertisement to education to finance to recommender systems to smart cities. (See Athey and Imbens (2015), Hoiles and van der Schaar (2016) and Bottou et al. (2013), for just a few examples.)

As we have noted, our methods and algorithms apply in many settings, each of which comes with specific features, actions and rewards. In the medical context, typical features are items available in the electronic health record (laboratory tests, previous diagnoses, demographic information, etc.), typical actions are choices of treatments (perhaps including no treatment at all), and typical rewards are recovery rates or 5-year survival rates. In the advertising context, typical features are the characteristics of a particular website and user, typical actions are the placements of an advertisement on a webpage, and typical rewards are click-rates. In the educational context, typical features are previous coursework and grades, typical actions are materials presented or subsequent courses taken, and typical rewards are final grades or graduation rates. In the financial context, typical features are aspects of the macroeconomic environment (interest rates, stock market information, etc.), typical actions are the timing of particular investment choices, and typical rewards are returns on investment.

For a simple but striking example from the medical context, consider the problem of choosing the best treatment for a patient with kidney stones. Such patients are usually classified by the size of the stones: small or large; the most common treatments are Open Surgery and Percutaneous Nephrolithotomy. Table 1 summarizes the results. Note that Open Surgery performs better than Percutaneous Nephrolithotomy for patients with small stones and for patients with large stones but Percutaneous Nephrolithotomy performs better overall.Footnote 1 Of course this would be impossible if the subpopulations that received the two treatments were identical—but they were not. And in fact we do not know the policy that created these subpopulations by assigning patients to treatments. We do know that patients are distinguished by a vast array of features in addition to the size of stones—age, gender, weight, kidney function tests, etc.—but we do not know which of these features is relevant. And of course we know the result of the treatment actually received by each patient—but we do not know what the result of the alternative treatment would have been (the counterfactual).

Table 1 Success rates of two treatments for kidney stones (Bottou et al. 2013)

Three more points should be emphasized. Although Table 1 shows only two actions, in fact there are a number of other possible actions for kidney stones: they could be treated using any of a number of different medications, they could be treated by ultrasound, or they could not be treated at all. This is important for several reasons. The first is that a number of existing methods assume that there are only two actions (corresponding to treat or not-treat); but as this example illustrates, in many contexts (and in the medical context in particular), it is typically the case that there are many actions, not just two—and, as the papers themselves note, these methods simply do not work when there are more than two actions; see Johansson et al. (2016). The second is that the features that are relevant for predicting the success of a particular action typically depend on the action: different features will be found to be relevant for different actions. (The treatment of breast cancer, as discussed in Yoon et al. (2017), illustrates this point well. The issue is not simply whether or not to apply a regime of chemotherapy, but which regime of chemotherapy to apply. Indeed, there are at least six widely used regimes of chemotherapy to treat breast cancer, and the features that are relevant for predicting success of a given regime are different for different regimes.) The third is that we go much further than the existing literature by allowing for nonlinear policies. To do this, we use a feedforward neural network, rather than relying on familiar algorithms such as POEM (Swaminathan and Joachims 2015a). To determine the best treatment, the bias in creating the populations, the features that are relevant for each action and the policy must all be learned. Our methods are adequate to this task.

The remainder of the paper is organized as follows. In Sect. 2, we describe some related work and highlight the differences with respect to our work. In Sect. 3, we describe the observational data on which our algorithm operates. In Sect. 4, we begin with an informal overview, then give the formal description of our algorithm (including substantial discussion). Section 5 gives the pseudo-code for the algorithm. Some extensions are discussed in Sect. 6. In Sect. 7, we demonstrate the performance of our algorithm on a variety of real datasets. Section 8 concludes. Proofs are in the Appendix.

2 Related work

From a conceptual point of view, the paper most closely related to ours—at least among recent papers—is perhaps Johansson et al. (2016) which treats a similar problem: learning relevance in an environment in which the counterfactuals are missing, data is biased and each instance may have many features. The approach taken there is somewhat different from ours in that, rather than identifying the relevant features, they transfer the features to a new representation space. [This process is referred as domain adaptation (Johansson et al. 2016).] A more important difference from our work is that it assumes that there are only two actions: treat and don’t treat. As we have discussed in the Introduction, the assumption of two actions is unrealistic; in most situations there will be many (possible) actions. It states explicitly that the approach taken there does not work when there are more than two actions and offers the multi-action setting as an obvious but difficult challenge. One might think of our work as “solving” this challenge—but we stress that the “solution” is not at all a routine extension. Moreover, in addition to this obvious challenge, there is a more subtle—but equally difficult—challenge: when there are more than two actions, it will typically be the case that some features will be relevant for some actions and not for others, and—as discussed in the Introduction—it will be crucial to learn which features are relevant for which actions.

From a technical point of view, our work is perhaps most closely related to Swaminathan and Joachims (2015a) in that we use similar methods (IPS-estimates and empirical Bernstein inequalities) to learn counterfactuals. However, it does not treat observational data in which the bias is unknown and does not learn/identify relevant features. Another similar work on policy optimization from observational data is Strehl et al. (2010).

The work in Wager and Athey (2015) treats the related (but somewhat different) problem of estimating individual treatment effects. The approach there is through causal forests as developed by Athey and Imbens (2015), which are variations on the more familiar random forests. However, the emphasis in this work is on asymptotic estimates, and in the many situations for which the number of (possibly) relevant features is large the datasets will typically not be large enough that asymptotic estimates will be of more than limited interest. There are many other works focusing on estimating treatment effects; some include Tian et al. (2012), Alaa and van der Schaar (2017), Shalit et al. (2016).

More broadly, our work is related to methods for feature selection and counterfactual inference. The literature on feature selection can be roughly divided into categories according to the extent of supervision: supervised feature selection (Song et al. 2012; Weston et al. 2003), unsupervised feature selection (Dy and Brodley 2004; He et al. 2005) and semi-supervised feature selection (Xu et al. 2010). However, our work does not fall into any of these categories; instead we need to select features that are informative in determining the rewards of each action. This problem was addressed in Tekin and van der Schaar (2014) but in an online Contextual Multi-Armed Bandit (CMAB) setting in which experimentation is used to learn relevant features. In the present paper, we treat the logged CMAB setting in which experimentation is impossible and relevant features must be learned from the existing logged data. As we have already noted, there are many circumstances in which experimentation is impossible. The difference between the settings is important—and the logged setting is much more difficult—because in the online setting it is typically possible to observe counterfactuals, while in the current logged setting it is typically not possible to observe counterfactuals, and because in the online setting the decision-maker controls the observations so whatever bias there is in the data is known.

With respect to learning, feature selection methods can be divided into three categories—filter models, wrapper models, and embedded models (Tang et al. 2014). Our method is most similar to filter techniques in which features are ranked according to a selected criterion such as a Fisher score (Duda et al. 2012), correlation based scores (Song et al. 2012), mutual information based scores (Koller and Sahami 1996; Yu and Liu 2003; Peng et al. 2005), Hilbert–Schmidt Independence Criterion (HSIC) (Song et al. 2012) and Relief and its variants (Kira and Rendell 1992; Robnik-Šikonja and Kononenko 2003) etc., and the features having the highest ranks are labeled as relevant. However, these existing methods are developed for classification problems and they cannot easily handle datasets in which the rewards of actions not taken are missing.

The literature on counterfactual inference can be categorized into three groups: direct, inverse propensity re-weighting and doubly robust methods. The direct methods compute counterfactuals by learning a function mapping from feature-action pair to rewards (Prentice 1976; Wager and Athey 2015). The inverse propensity re-weighting methods compute unbiased estimates by weighting the instances by their inverse propensity scores (Swaminathan and Joachims 2015a; Joachims and Swaminathan 2016). The doubly robust methods compute the counterfactuals by combining direct and inverse propensity score reweighing methods to compute more robust estimates (Dudík et al. 2011; Jiang and Li 2016). With respect to this categorization, our techniques might be view as falling into doubly robust methods.

Our work can be seen as building on and extending the work of Swaminathan and Joachims (2015a, b), which learn linear stochastic policies. We go much further by learning a non-linear stochastic policy. Our work can also be seen as an off-line variant of the on-line REINFORCE algorithm (Williams 1992).

We should also note two papers that were written after the current paper was originally submitted. The work of Joachims et al. (2018) extends the earlier work of Swaminathan and Joachims (2015a, b) to non-linear policies. Our own (preliminary) work (Atan et al. 2018) propose a different approach for learning a representation function and a policy. Unlike the present paper, our more recent work uses a loss function that embodies both a policy loss (similar to, but slightly different than, the policy loss used in the present paper) and a domain loss (which quantifies the divergence between the logging policy and the uniform policy under the representation function). The advantage of these changes is that they make it possible to learn the representation function and the policy in an end-to-end fashion.

3 Data

We consider logged contextual bandit data: that is, data for which we know the features of each instance, the action taken and the reward realized in that instance—but not the reward that would have been realized had a different action been taken. We assume that the data has been logged according to some policy which we may not know, but which is not necessarily random and so the data is biased. Each data point consists of a feature, an action and a reward. A feature is a vector \((x_1, \ldots , x_d)\) where each \(x_i \in {\mathcal {X}}_i\) is a feature type. The space of all feature types is \({\mathcal {F}} = \{1, \ldots , d\}\), the space of all features is \({\mathcal {X}} = \varPi _{i=1}^d {\mathcal {X}}_i\) and the set of actions is \({\mathcal {A}}\). We assume that the sets of feature types and actions are finite; we write \(b_i = |{\mathcal {X}}_i| \) for the cardinality of \({\mathcal {X}}_i\) and \({\mathcal {A}} = \{1,2, \ldots , k \}\) for the set of actions. For \({\varvec{x}} \in {\mathcal {X}}\) and \({\mathcal {S}} \subset {\mathcal {F}}\) we write \({\varvec{x}}_{\mathcal {S}}\) for the restriction of \({\varvec{x}}\) to \({\mathcal {S}}\); i.e. for the vector of feature types whose indices lie in \({\mathcal {S}}\). It will be convenient to abuse notation and view \({\varvec{x}}_{{\mathcal {S}}}\) both as a vector of length \(|{{\mathcal {S}}}|\) or as a vector of length \(d = |{{\mathcal {F}}}|\) which is 0 for feature types not in \({\mathcal {S}}\). A reward is a real number; we normalize so that rewards lie in the interval [0, 1]. In some cases, the reward will be either 1 or 0 (success or failure; good or bad outcome); in other cases the reward may be interpreted as the probability of a success or failure (good or bad outcome).

We are given a data set

$$\begin{aligned} {\mathcal {D}}^n = \{({\varvec{X}}_1, A_1, R^{\text {obs}}_1), \ldots , ({\varvec{X}}_n, A_n, R^{\text {obs}}_n) \} \end{aligned}$$

We assume that the jth instance/data point \(({\varvec{X}}_j, A_j, R^{\text {obs}}_j)\) is generated according to the following process:

  1. 1.

    The instance is described by a feature vector \({\varvec{X}}_j\) that arrives according to the fixed but unknown distribution \(\Pr ({\mathcal {X}})\); \({\varvec{X}}_j \sim \Pr ({\mathcal {X}})\).

  2. 2.

    The action taken was determined by a policy that draws actions at random according to a (possibly unknown) probability distribution \(p_0({\mathcal {A}} | {\varvec{X}}_j)\) on the action space \({\mathcal {A}}\). (Note that the distribution of actions taken depends on the vector of features).

  3. 3.

    Only the reward of the action actually performed is recorded into the dataset, i.e., \(R_j^{\text {obs}} \equiv R_j(A_j)\).

  4. 4.

    For every action a, either taken or not taken, the reward \(R_j(a) \sim \varPhi _a(\cdot | {\varvec{X}}_j)\) that would have been realized had a actually been taken is generated by a random draw from an unknown family \(\{ \varPhi _a( \cdot | {\varvec{x}})\}_{ {\varvec{x}} \in {\mathcal {X}}, a \in {\mathcal {A}}}\) of reward distributions with support \(\left[ 0,1\right] \).

The logging policy corresponds to the choices made by the existing decision-making procedure and so will typically create a biased distribution on the space of feature-action pairs.

We make two natural assumptions about the rewards and the logging policy; taken together they enable us to generate unbiased estimates of the variables of the interest. The first assumption guarantees that there is enough information in the data-generating process so that counterfactual information can be inferred from what is actually observed.

Assumption 1

(Common support) \(p_0(a | {\varvec{x}}) > 0\) for all action-feature pairs \((a, {\varvec{x}})\).

The second assumption is that the logging policy depends only on the observed features—and not on the observed rewards.

Assumption 2

(Unconfoundness) For each feature vector \({\varvec{X}}\), the rewards of actions \(\{ R(a) \}_{a \in {\mathcal {A}}}\) are statistically independent of the action actually taken; .

These assumptions are universal in the counterfactual inference literature—see Johansson et al. (2016), Athey and Imbens (2015) for instance—although they can be criticized on the grounds that their validity cannot be determined on the basis of what is actually observed.

4 The algorithm

It seems useful to begin with a brief overview; more details and formalities follow below. Our algorithm consists of a training phase and an execution phase; the training phase consists of three steps.

  1. A.

    In the first step of the training phase, the algorithm either inputs the true propensity scores (if they are known) or uses the logged data to estimate propensity scores (when the true propensity scores are not known); this (partly) corrects the bias in the logged data.

  2. B.

    In the second step of the training phase, the algorithm uses the known or estimated propensity scores to compute, for each action and each feature, an estimate of relevance for that feature with respect to that action. The algorithm then retains the more relevant features—those for which the estimate is above a threshold—and discards the less relevant features—those for which the estimate is below the threshold. (For reasons that will be discussed below, the threshold used depends on both the action and the feature type.)

  3. C.

    In the third step of the training phase, the algorithm uses the known or estimated propensity scores and the features identified as relevant, and trains a feedforward neural network model to learn a non-linear stochastic policy that minimizes the “corrected” cross entropy loss.

In the execution phase, the algorithm is presented with a new instance and uses the policy derived in the training phase to recommend an action for this new instance on the basis of the relevant features of that instance.

Not surprisingly, the setting in which the propensity scores are known is simpler than the setting in which the propensity scores must be estimated. In the latter case, in addition to the complication of the estimation itself, we shall need to be careful about estimated propensity scores that are “too small”—this will require a correction—and our error estimates will be less good. Because clarity of exposition seems more importance than compactness, we therefore present first the algorithm for the case in which true propensity scores are known and then circle back to present the necessary modifications for the case in which true propensity scores are not known but must be estimated.

4.1 True propensities

We begin with the setting in which propensities of the randomized algorithm are actually tracked and available in the dataset. This is often the case in the advertising context, for example. In this case, for each j, set \(p_{0,j} = p_0(A_j|X_j)\), and write \({{\varvec{P}}}_0 = [{p}_{0,j}]_{j=1}^n\); this is the vector of true propensities.

4.2 Relevance

It might seem natural to define the set \({\mathcal {S}}\) of feature types to be irrelevant (for a particular action) if the distribution of rewards (for that action) is independent of the features in \({\mathcal {S}}\), and to define the set \({\mathcal {S}}\) to be relevant otherwise. In theoretical terms, this definition has much to recommend it. In operational terms, however, this definition is not of much use. That is because finding irrelevant sets of feature types would require many observations (to determine the entire distribution of rewards) and intractable calculations (to examine all sets of feature types). Moreover, this notion of irrelevance will often be too strong because our interest will often be only in maximizing expected rewards (or more generally some statistical function of rewards), as it would be in the medical context if the reward is five-year survival rate, or in the advertising or financial settings, if the reward is expected revenue or profit and the advertiser or firm is risk-neutral.

Given these objections, we take an alternative approach. We define a measure of how relevant a particular feature type is for the expected reward of a particular action, learn/estimate this measure from observed data, retain features for which this measure is above some endogenously derived threshold (the most relevant features) and discard other features (the least relevant features). Of course, this approach has drawbacks. Most obviously, it might happen that two feature types are individually not very relevant but are jointly quite relevant. (We leave this issue for future work.) However, as we show empirically, this approach has the virtue that it works: the algorithm we develop on the basis of this approach is demonstrably superior to existing algorithms.

4.2.1 True relevance

To begin formalizing our measure of relevance, fix an action a, a feature vector x and a feature type i. Define expected rewards and marginal expected rewards as follows:

$$\begin{aligned}&{\bar{r}}(a, {\varvec{x}}) = {\mathbb {E}}\left[ R(a) | {\varvec{X}} = {\varvec{x}}\right] \nonumber \\&{\bar{r}}(a, {\varvec{x}}_i) ={\mathbb {E}}_{{\varvec{X}}_{-i}} [{\bar{r}}(a, {\varvec{X}}) \bigg | {\varvec{X}}_{i} = {\varvec{x}}_{i}] \nonumber \\&{\bar{r}}(a) = {\mathbb {E}}_{{\varvec{X}}} \left[ {\bar{r}}(a, {\varvec{X}})\right] \end{aligned}$$
(1)

We define the true relevance of feature type i for action a by

$$\begin{aligned} g(a, i) = {\mathbb {E}}\left[ \ell \left( {\bar{r}}(a, X_i) - {\bar{r}}(a)\right) \right] , \end{aligned}$$
(2)

where the expectation is taken with respect to the arrival probability distribution of \(X_i\) and \(\ell (\cdot )\) denotes the loss metric. (Keep in mind that the true arrival probability distribution of \(X_j\) is unknown and must be estimated from the data.) Our results hold for an arbitrary loss function, assuming only that it is strictly monotonic and Lipschitz; i.e. there is a constant B such that \(\left| \ell (r) - \ell (r')\right| \le B |r - r'|\). These conditions are satisfied by a large class of loss functions including \(l_1\) and \(l_2\) losses. The relevance measure g expresses the weighted difference between the expected reward of a given action conditioned on the feature type i and the unconditioned expected reward; \(g(a,i) = 0\) exactly when feature type i does not affect the expected reward of action a.Footnote 2

We refer to g as true relevance because it is computed using the true arrival distribution—but the true arrival distribution is unknown. Hence, even when the true propensities are known, relevance must be estimated from observed data. This is the next task.

4.2.2 Estimated relevance

We now derive estimates of relevance based on observed data (continuing to assume that true propensities are known). To do so, we first need to estimate \({\bar{r}}(a)\) and \({\bar{r}}(a, x_i)\) for \(x_i \in {\mathcal {X}}_i\), \(i \in {\mathcal {F}}\) and \(a \in {\mathcal {A}}\) from available observational data. An obvious way to do this is through classical supervised learning based estimators; most obviously, the sample mean estimators for \({\bar{r}}(a)\) and \({\bar{r}}(a, x_i)\). However using straightforward sample mean estimation would be wrong because the logging policy introduces a bias into observations. Following the idea of Inverse Propensity Scores (Rosenbaum and Rubin 1983), we correct this bias by using Importance Sampling.

Write N(a), \(N(x_i)\), \(N(a, x_i)\) for the number of observations (in the given data set) with action a, with feature \(x_i\), and with the pair consisting of action a and feature \(x_i\), respectively. We can rewrite our previous definitions as:

$$\begin{aligned}&{\bar{r}}(a, x_i) = {\mathbb {E}}_{({\varvec{X}}, A, R^{\text {obs}}) \sim p_0} \left[ \frac{{\mathbb {I}}(A = a)R^{\text {obs}} }{p_0(A | {\varvec{X}})} \bigg | X_i = x_i \right] \nonumber \\&{\bar{r}}(a) = {\mathbb {E}}_{({\varvec{X}}, A, R^{\text {obs}}) \sim p_0} \left[ \frac{{\mathbb {I}}(A = a) R^{\text {obs}}}{p_0(A | {\varvec{X}})} \right] \end{aligned}$$
(3)

where \({\mathbb {I}}(\cdot )\) is the indicator function. (Note that we are taking expectations with respect to the true propensities.)

Let \({\mathcal {J}}(x_i)\) denote the time indices in which feature type-i is \(x_i\), i.e., \({\mathcal {J}}(x_i) = \{ j \subseteq \{1,2, \ldots ,n\} : X_{i,j} = x_i \}\). The Importance Sampling approach provides unbiased estimates of \({\bar{r}}(a)\) and \({\bar{r}}(a, x_i)\) as

$$\begin{aligned}&{\widehat{R}}(a, x_i ; {\varvec{P}}_0) = \frac{1}{N(x_i)}\sum _{j \in {\mathcal {J}}(x_i)} \frac{ {\mathbb {I}}(A_j = a) R^{\text {obs}}_j}{p_{0,j}}, \nonumber \\&{\widehat{R}}(a ; {\varvec{P}}_0) = \frac{1}{n}\sum _{j=1}^{n} \frac{{\mathbb {I}}(A_j = a) R^{\text {obs}}_j }{p_{0,j}} , \end{aligned}$$
(4)

(We include the propensities \({\varvec{P}}_0\) in the notation as a reminder that these estimators are using the true propensity scores.)

We now define the estimated relevance of feature type i for action a as

$$\begin{aligned} {\widehat{G}}(a, i; {\varvec{P}}_0) = \frac{1}{n}\sum _{x_i \in {\mathcal {X}}_i} N(x_i) \ell \left( {\widehat{R}}(a, x_i ; {\varvec{P}}_0) - {\widehat{R}}(a ; {\varvec{P}}_0) \right) . \end{aligned}$$
(5)

(Note that we have abused terminology/notation by suppressing reference to the particular sample that was observed.)

4.2.3 Thresholds

By definition, \({\widehat{G}}\) is an estimate of relevance so the obvious way to select relevant features is to set a threshold \(\tau \), identify a feature i as relevant for action a exactly when \({\widehat{G}}(a, i; {\varvec{P}}_0) > \tau \), retain the features that are relevant according to this criterion and discard other features.

However, this approach is a bit too naive for (at least) two reasons. The first is that our empirical estimate of relevance \({\widehat{G}}\) may in fact be far from the true relevance g. The second is that some features may be highly (positively or negatively) correlated with the remaining features, and hence convey less information. To deal with these objections, we construct thresholds \(\tau (a,i)\) as a weighted sum of an empirical estimate of the error in using \({\widehat{G}}\) instead of g and the (average absolute) correlation of feature type i with other feature types.

To define the first term we need an empirical (data-dependent bound) on \(|{\widehat{G}} - g|\). To derive such a bound we use the empirical Bernstein inequality (Maurer and Pontil 2009; Audibert et al. 2009). (We emphasize that our bound depends on the empirical variance of the estimates.) To simplify notation, define random variables \(U(a; {\varvec{P}}_0) \equiv \frac{{\mathbb {I}}(A = a) R^{\text {obs}}}{p_0(A | {\varvec{X}})}\) and \(U_j(a; {\varvec{P}}_0) \equiv \frac{{\mathbb {I}}(A_j = a) R_j}{p_{0,j}}\). The sample means and variances are:

$$\begin{aligned}&{\mathbb {E}}_{({\varvec{X}}, A, R^{\text {obs}}) \sim p_0}[ U(a; {\varvec{P}}_0) ] = {\bar{r}}(a), \\&{\mathbb {E}}_{({\varvec{X}}, A, R^{\text {obs}}) \sim p_0}[ U(a; {\varvec{P}}_0) \big | X_i = x_i ] = {\bar{r}}(a, x_i) \\&{\widehat{U}}(a; {\varvec{P}}_0) = {\widehat{R}}(a ; {\varvec{P}}_0) \\&= \frac{1}{n}\sum _{j=1}^n U_j(a; {\varvec{P}}_0), \\&{\widehat{U}}(a, x_i; {\varvec{P}}_0) = {\widehat{R}}(a, x_i ; {\varvec{P}}_0) \\&= \frac{1}{N(x_i)}\sum _{j \in {\mathcal {J}}(x_i)} U_j(a; {\varvec{P}}_0), \\&V_n(a ; {\varvec{P}}_0) = \frac{1}{n-1}\sum _{j=1}^n \left( U_j(a ; {\varvec{P}}_0) - {\widehat{U}}(a; {\varvec{P}}_0)\right) ^2, \\&V_n(a, x_i ; {\varvec{P}}_0) = \frac{1}{N(x_i)-1}\sum _{j \in {\mathcal {J}}(x_i)} \left( U_j(a; {\varvec{P}}_0) - {\widehat{U}}(a, x_i ; {\varvec{P}}_0)\right) ^2. \end{aligned}$$

The weighted average sample variance is:

$$\begin{aligned} {\bar{V}}_n(a, i ; {\varvec{P}}_0) = \sum _{x_i \in {\mathcal {X}}_i} \frac{N(x_i) V_n(a, x_i ; {\varvec{P}}_0)}{n} \end{aligned}$$
(6)

Our empirical (data-dependent) bound is given in Theorem 1.

Theorem 1

For every \(n >0\), every \(\delta \in \left[ 0,\frac{1}{3}\right] \), and every pair, \((a,i) \in \left( {\mathcal {A}}, {\mathcal {D}} \right) \), with probability at least \(1 - 3 \delta \) we have:

$$\begin{aligned} |{\widehat{G}}(a,i ; {\varvec{P}}_0) - g(a,i)|\le & {} B \Bigg ( \ \sqrt{\frac{2 b_i \ln (3/\delta ) {\bar{V}}_n(a,i; {\varvec{P}}_0)}{n}} \\&+ \sqrt{\frac{2 \ln (3/\delta ) V_n(a; {\varvec{P}}_0) }{n}} \\&+ \frac{M \left( b_i + 1\right) \ln 3/\delta }{n} \ \Bigg ) \\&+ \sqrt{\frac{2 \left( \ln 1/\delta + b_i \ln 2\right) }{n}}, \end{aligned}$$

where \(M = \max _{a \in {\mathcal {A}}} \max _{{\varvec{x}} \in {\mathcal {X}}} 1/p_0(a | {\varvec{x}})\).

The error bound given by Theorem 1 consists of four terms: The first term arises from estimation error of \({\widehat{R}}(a, x_i)\). The second term arises from estimation error of \({\widehat{R}}(a)\). The third term arises from estimation error of feature arrival probabilities. The fourth term arises from randomness of the logging policy.

Now write \(\rho _{i,j}\) for the Pearson correlation coefficient between two feature types i and j. (Recall that \(\rho _{i,j} = +1\) if ij are perfectly positively correlated, \(\rho _{i,j} = -1\) if ij are perfectly negatively correlated, and \(\rho _{i,j} = 0\) if ij are uncorrelated.) Then the average absolute correlation of feature type i with other features is

$$\begin{aligned} \bigg (\frac{1}{d-1}\bigg ) \bigg (\sum _{j \in {\mathcal {F}} \setminus \{i\}} \left| \rho _{i,j}\right| \bigg ) \end{aligned}$$

We now define the thresholds as

$$\begin{aligned} \tau (a,i) = \lambda _1 \sqrt{\frac{ b_i {\bar{V}}_n(a,i; {\varvec{P}}_0) }{n}} \ + \ \lambda _2 \bigg (\frac{1}{d-1}\bigg ) \bigg (\sum _{j \in {\mathcal {F}} \setminus \{i\}} \left| \rho _{i,j}\right| \bigg ) \end{aligned}$$

where \(\lambda _1, \lambda _2\) are weights (hyper-parameters) to be chosen. Notice that the first term is the dominant term in the error bound given in Theorem 1, and is used to set a higher bar for the feature types that are creating the logging policy bias. The statistical distributions of those features within the the action population and the whole population will be different. By setting the threshold as above, we trade-off between three objective: (1) selecting the features that are relevant for the rewards of the actions, (2) eliminating the features which create the logging policy bias, (3) minimizing the redundancy in the feature space.

4.2.4 Relevant feature types

Finally, we identify the set of feature types that are relevant for an action a as

$$\begin{aligned} \widehat{{\mathcal {R}}}(a) = \left\{ i \ : \ {\widehat{G}}(a,i; {{\mathbf {P}}}_0) > \tau (a,i) \right\} \end{aligned}$$
(7)

Set \(\widehat{\varvec{{\mathcal {R}}}} = \left[ \widehat{{\mathcal {R}}}(a) \right] _{a \in {\mathcal {A}}}\). Let \({\varvec{f}}_a\) denote a d dimensional vector whose \(j\mathrm{th}\) element is 1 if j is contained in the set \({\mathcal {R}}(a)\) and 0 otherwise.

4.3 Policy optimization

We now build on the identified family of relevant features to construct a policy. By definition, a (stochastic) policy is a map \(h : {\mathcal {X}} \rightarrow \triangle (A)\) which assigns to each vector of features a probability distribution \(h(\cdot | {\varvec{x}})\) over actions.

A familiar approach to the construction of stochastic policies is to use the POEM algorithm (Swaminathan and Joachims 2015a). POEM considers only linear stochastic policies; among these, POEM learns one that minimizes risk, adjusted by a variance term. Our approach is substantially more general because we consider arbitrary non-linear stochastic policies. We use a novel approach that uses a feedforward neural network to find a non-linear policy that minimizes the loss, adjusted by a regularization term. Note that we allow for very general loss and regularization terms so that our approach includes many policy optimizers. If we restricted to a neural network with no hidden layers and a specific regularization term, we would recover POEM.

Fig. 1
figure 1

Neural network architecture

We propose a feedforward neural network for learning a policy \(h^{*}(\cdot | {\varvec{x}})\); the architecture of our neural network is depicted in Fig. 1. Our feedforward neural network consists of policy layers (\(L_p\) hidden layers with \(h_p^{(l)}\) units in the \(l{\text {th}}\) layer) that use the output of the concatenation layer to generate a policy vector \(\varPhi ({\varvec{x}}, a)\), and a softmax layer that turns the policy vector into a stochastic policy.

For each action a, the concatenation layer takes the feature vector \({\varvec{x}}\) as an input and generates a action-specific representations \(\phi ({\varvec{x}}, a)\) according to:

$$\begin{aligned}&{\varvec{x}}_{\widehat{{\mathcal {R}}}(a)} = {\varvec{x}} \odot {\varvec{f}}_a \nonumber \\&\phi ({\varvec{x}}, a) = [{\varvec{x}}_{\widehat{{\mathcal {R}}}({\tilde{a}})} {\mathbb {I}}({\tilde{a}} = a)]_{{\tilde{a}} \in {\mathcal {A}}} \end{aligned}$$

Note that our action-specific representation \(\phi ({\varvec{x}}, a)\) is a \(d \times k\) dimensional vector where only the parts corresponding to action a is non-zero and equals to \({\varvec{x}}_{\widehat{{\mathcal {R}}}({\tilde{a}})}\). For each action a, the policy layers uses the action-specific representation \(\phi ({\varvec{x}}, a)\) generated by the concenation layers and generates the output vector \(\varPhi ({\varvec{x}},a)\) according to:

$$\begin{aligned} \varPhi ({\varvec{x}}, a) = \rho \left( \ldots \rho \left( {\varvec{W}}_1^{(p)} \phi ({\varvec{x}}, a) + {\varvec{b}}_1^{(p)} \right) \ldots + {\varvec{b}}_{L_p}^{(p)} \right) \end{aligned}$$

where \({\varvec{W}}_l^{(p)}\) and \({\varvec{b}}_l^{(p)}\) are the weights and bias vectors of the \(l{\text {th}}\) layer accordingly. The outputs of the policy layers are used to generate a policy by a softmax layer:

$$\begin{aligned} h(a | {\varvec{x}}) = \frac{\exp ( {\varvec{w}}^T \varPhi ({\varvec{x}}, a) )}{\sum _{a' \in {\mathcal {A}}} \exp ( {\varvec{w}}^T \varPhi ({\varvec{x}}, a') )}. \end{aligned}$$

Then, we choose the parameters of the policy to minimize an objective of the following form: \(\text {Loss}(h^{*}; {\mathcal {D}}) + \lambda _3 {\mathcal {R}}(h^{*}; {\mathcal {D}})\); where \(\text {Loss}(h^{*}; {\mathcal {D}})\) is the loss term, \({\mathcal {R}}(h^{*}; {\mathcal {D}})\) is a regularization term and \(\lambda _3 >0\) represents the trade-off between loss and regularization. The loss function can be either the negative IPS estimate or the corrected cross entropy loss introduced in the next section. Depending on the choice of the loss function and regularizer, our policy optimizer can include a wide-range of objectives including the POEM objective (Swaminathan and Joachims 2015a).

In the next subsection, we propose a new objective, which we refer to as the Policy Neural Network (PONN) objective.

4.4 Policy neural network (PONN) objective

Our PONN objective is motivated by the cross-entropy loss used in the standard multi-class classification setting. In the usual classification setting, usual loss function used to train the neural network is the standard cross entropy:

$$\begin{aligned} \widehat{\text {Loss}}_{c}(h) = \frac{1}{n} \sum _{j=1}^n \sum _{a \in {\mathcal {A}}} -R_j(a) \log h(a | {\varvec{X}}_j). \end{aligned}$$

However, this loss function is not applicable in our setting, for two reasons. The first is that only the rewards of the action taken by the logging policy are recorded in the dataset, not the counterfactuals. The second is that we need to correct the bias in the dataset by weighting the instances by their inverse propensities. Hence, we use the following modified cross entropy loss function:

$$\begin{aligned} \widehat{\text {Loss}}_{b}(h; {\varvec{P}}_0)= & {} \frac{1}{n} \sum _{j=1}^n \sum _{a \in {\mathcal {A}}} \frac{- R_j(a) \log h(a | {\varvec{X}}_j) {\mathbb {I}} (A_j =a)}{p_{0,j}} \nonumber \\= & {} \frac{1}{n} \sum _{j=1}^n \frac{- R^{\text {obs}}_j \log h(A_j | {\varvec{X}}_j)}{p_{0,j}}. \end{aligned}$$
(8)

Note that this loss function is an unbiased estimate of the expected cross entropy loss, that is \({\mathbb {E}}_{({\varvec{X}}, A, R) \sim p_0}\left[ \widehat{\text {Loss}}_{b}(h^{*}; {\varvec{P}}_0) \right] = {\mathbb {E}} \left[ \widehat{\text {Loss}}_{c}(h^{*}) \right] \). We train our neural network to minimize the regularized loss by Adam optimizer:

$$\begin{aligned} h^{*}= \mathop {\hbox {arg min}}\limits _{h \in {\mathcal {H}}}\; \widehat{\text {Loss}}_{b}(h; \widehat{{\varvec{P}}}_0)+ \lambda _3 {\mathcal {R}}(h), \end{aligned}$$

where \({\mathcal {R}}(h)\) is the regularization term to avoid overfitting and \(\lambda _3\) is the hyperparameter to trade-off between the loss and regularization.

4.5 Unknown propensities

As we have noted, in most settings the logging policy is unknown and hence the actual propensities are also unknown so we must estimate propensities from the dataset and use the estimated propensities to correct the bias. In general, this can be accomplished by any supervised learning technique.

For our purposes we estimate propensities by fitting the multinomial logistic regression model:

$$\begin{aligned} \ln ( \Pr \left( A = a \right) ) = \varvec{\beta }_{0,a}^T {\varvec{X}} - \ln Z \end{aligned}$$
(9)

where \(Z = \sum _{a \in {\mathcal {A}}} \exp \left( \varvec{\beta }_{0,a}^T {\varvec{X}} \right) \). The estimated propensities are

$$\begin{aligned} {\widehat{p}}_{0,j} \equiv \frac{\exp (\varvec{\beta }_{0,A_j}^T {\varvec{X}}_j)}{Z_j} \end{aligned}$$

where we have written \(Z_j = \sum _{a \in {\mathcal {A}}} \exp (\varvec{\beta }_{0,a}^T {\varvec{X}}_j)\). Write \(\widehat{{\varvec{P}}}_0 = [{\widehat{p}}_{0,j}]_{j=1}^n\) for the vector of estimated propensities.

In principle, we could use these estimated propensities in place of known propensities and proceed exactly as we have done above. However, there are two problems with doing this. The first is that if the estimated propensities are very small (which might happen because the data was not completely representative of the true propensities), the variance of the estimate \({\widehat{G}}\) will be too large. The second is that the thresholds we have constructed when propensities are known may no longer be appropriate when propensities must be estimated.

To avoid the first problem, we follow Ionides (2008) and modify the estimated rewards by truncating the importance sampling weights. This leads to “truncated” estimated rewards as follows:

$$\begin{aligned}&{\widehat{R}}_m(a, x_i; \widehat{{\varvec{P}}}_0) = \frac{1}{N(x_i)}\sum _{j \in {\mathcal {J}}(x_i)} \min \left( \frac{{\mathbb {I}}(A_j = a)}{{\widehat{p}}_{0,j}}, m \right) R_j^{\text {obs}}, \\&{\widehat{R}}_m(a; \widehat{{\varvec{P}}}_0) = \frac{1}{n}\sum _{j =1}^n \min \left( \frac{{\mathbb {I}}(A_j = a)}{{\widehat{p}}_{0,j}}, m \right) R_j^{\text {obs}}. \end{aligned}$$

Given these “truncated” estimated rewards, we define a “truncated” estimator of relevance by

$$\begin{aligned} {\widehat{G}}_m(a, i ; \widehat{{\varvec{P}}}_0) = \sum _{x_i \in {\mathcal {X}}_i} \frac{N(x_i)}{n} l \left( {\widehat{R}}_m(a, x_i ; \widehat{{\varvec{P}}}_0) - {\widehat{R}}_m(a; \widehat{{\varvec{P}}}_0) \right) \end{aligned}$$

From this point on, we proceed exactly as before, using the “truncated” estimator \({\widehat{G}}_m\) instead of \({\widehat{G}}\).

Note that \({\widehat{R}}_m(a, x_i; \widehat{{\varvec{P}}}_0)\) and \({\widehat{R}}_m(a; \widehat{{\varvec{P}}}_0)\) are not unbiased estimators of \({\bar{r}}(a, x_i)\) and \({\bar{r}}(a)\). The bias is due to using estimated truncated propensity scores which may deviate from true propensities. Let \({\text {bias}}({\widehat{R}}_m(a ; \widehat{{\varvec{P}}}_0))\) denote the bias of \({\widehat{R}}_m(a ; \widehat{{\varvec{P}}}_0)\), which is given by

$$\begin{aligned} {\text {bias}}({\widehat{R}}_m(a ; \widehat{{\varvec{P}}}_0)) = {\bar{r}}(a) - {\mathbb {E}}\left[ {\widehat{R}}_m(a ; \widehat{{\varvec{P}}}_0)\right] . \end{aligned}$$

In the Appendices, we show the effect of this bias on the learning process.

figure a
figure b

5 Pseudo-code for the algorithm PONN-B

Below, we provide the pseudo-code for our algorithm which we call PONN-B (because it uses the PONN objective and Step B) exactly as discussed above. The first three steps constitute the offline training phase; the fourth step is the online execution phase. Within the training phase the steps are: Step A: Input propensities (if they are known) or estimate them using a logistic regression (if they are not known). Step B: Construct estimates of relevance (truncated if propensities are estimated), construct thresholds (using given hyper-parameters) and identify the relevant features as those for which the estimated relevance is above the constructed thresholds. Step C: Use the Adam optimizer to train neural network parameters. In the execution phase: Input the features of the new instance, apply the optimal policy to find a probability distribution over actions, and draw a random sample action from this distribution.

6 Extension: relevant feature selection with fine gradations

Our algorithm might be inefficient when there are many features of a particular type—in particular, if one or more feature types are continuous. In that setting, we can modify our algorithm to create bins that consist of similar feature values and treat all the values in a single bin identically. In order to conveniently formalize this problem, we assume that the feature space is actually continuous; for simplicity we assume each feature type is \({\mathcal {X}}_i = \left[ 0,1\right] \) (or a bounded subset). In this case, we can partition the feature space into subintervals (bins), view features in each bin as identical, and apply our algorithm to the finite set of bins.Footnote 3 To offer a theoretical justification for this procedure, we assume that similar features yield similar expected rewards. We formalize this as a Lipschitz condition.

Assumption 3

There exists \(L > 0\) such that for all \(a \in {\mathcal {A}}\), all \(i \in {\mathcal {F}}\) and all \(x_i \in {\mathcal {X}}_i\), we have \(|{\bar{r}}(a, x_i) - {\bar{r}}(a, {\tilde{x}}_i)| \le L |x_i - {\tilde{x}}_i|\).

(In the Multi-Armed Bandit literature (Slivkins 2014; Tekin and van der Schaar 2014) this assumption is commonly made and sometimes referred to as similarity.)

For convenience, we partition each feature type \(X_i\) into s equal subintervals (bins) of length 1 / s. If s is small, the number of bins is small so, given a finite data set, the number of instances that lie in each bin is relatively large; this is useful for estimation. However, when s is small the size 1 / s of each bin is relatively large so the (true) variation of expected rewards in each bin is relatively large. Because we are free to choose the parameter s, we can balance the trade-off implicit between choosing few large bins or choosing many small bins; a useful trade-off is achieved by taking \(s = \left\lceil {n^{1/3}}\right\rceil \).

So begin by fixing \(s = \left\lceil {n^{1/3}}\right\rceil \) and partition each \({\mathcal {X}}_i =\left[ 0,1\right] \) into s intervals of length 1 / s. Write \({\mathcal {C}}_{i}\) for the sets in the partition of \(X_i\) and write \(c_i\) for a typical element of \({\mathcal {C}}_{i}\) . For each sample j, let \(c_{i,j}\) denote the set in which the feature \(x_{i,j}\) belongs. Let \({\mathcal {J}}(c_i)\) be the set of indices for which \(x_{i,j} \in c_i\); \({\mathcal {J}}(c_i) = \{ j \in \{1,2,\ldots ,n\}: X_{i,j} \in c_i \}\). We define truncated IPS estimate as

$$\begin{aligned} {\bar{r}}_m(a, c_i; \widehat{{\varvec{P}}}_0)= & {} {\mathbb {E}}\left[ U(a; \widehat{{\varvec{P}}}_0) | X_i \in c_i \right] \\= & {} {\mathbb {E}}\left[ \min \left( \frac{{\mathbb {I}}(A = a)}{{\widehat{p}}_0(A | {\varvec{X}})}, m\right) R^{\text {obs}} \bigg | X_i \in c_i \right] , \\ {\widehat{R}}_m(a, c_i ; \widehat{{\varvec{P}}}_0)= & {} \frac{1}{N(c_i)} \sum _{j \in {\mathcal {J}}(c_i)} \min \left( \frac{{\mathbb {I}}(A_j = a) }{{\widehat{p}}_{0,j}}, m \right) R_j^{\text {obs}}, \end{aligned}$$

where \(N(c_i) =|{\mathcal {J}}(c_i)|\). In this case, we define estimated information gain as

$$\begin{aligned} {\widehat{G}}_m(a, i) = \sum _{c_i \in {\mathcal {C}}_i} \frac{N(c_i)}{n} l \left( {\widehat{R}}_m(a, c_i ; \widehat{{\varvec{P}}}_0) - {\widehat{R}}_m(a ; \widehat{{\varvec{P}}}_0) \right) . \end{aligned}$$

We define the following sample mean and variance :

$$\begin{aligned}&{\widehat{U}}(a, c_i ; \widehat{{\varvec{P}}}_0) = {\widehat{R}}_m(a, c_i ; \widehat{{\varvec{P}}}_0) = \frac{1}{N(x_i)}\sum _{j \in {\mathcal {J}}(c_i)} U_j(a; \widehat{{\varvec{P}}}_0), \\&V_n(a, c_i ; \widehat{{\varvec{P}}}_0) = \frac{1}{n-1}\sum _{j \in {\mathcal {J}}(c_i)} (U_j(a,c_i; \widehat{{\varvec{P}}}_0)- {\widehat{U}}(a, c_i ;\widehat{{\varvec{P}}}_0))^2. \end{aligned}$$

Let \({\bar{V}}_n(a, i ; \widehat{{\varvec{P}}}_0) = \sum _{c_i \in {\mathcal {C}}_i} \frac{N(c_i) V_n(a, c_i ; \widehat{{\varvec{P}}}_0) }{n}\) denote the weighted average sample variance.

Given these definitions, we establish a data-dependent bound analogous to Theorem 1.

Theorem 2

For every \(n \ge 1\) and \(\delta \in \left[ 0,\frac{1}{3}\right] \), if \(s = \left\lceil {n^{1/3}}\right\rceil \), then with probability at least \(1 - 3 \delta \) we have, for all pairs \((a,i) \in \left( {\mathcal {A}}, {\mathcal {D}} \right) \),

$$\begin{aligned} |{\widehat{G}}_m(a,i ; \widehat{{\varvec{P}}}_0) - g(a,i)|\le & {} B \Bigg ( \frac{\sqrt{4 \ln 3/\delta }}{n^{1/3}} \left( \sqrt{{\bar{V}}_n(a,i ; \widehat{{\varvec{P}}}_0)} + \sqrt{V_n(a ;\widehat{{\varvec{P}}}_0)}\right) + \frac{L}{n^{1/3}} \nonumber \\&+ \left| {\text {bias}}({\widehat{R}}_m(a ; \widehat{{\varvec{P}}_0}))\right| + {\mathbb {E}} \left| {\text {bias}}({\widehat{R}}_m(a, X_i ; \widehat{{\varvec{P}}_0}))\right| \Bigg ) \nonumber \\&+ \ \frac{4m B \ln 3/\delta + \sqrt{2 \ln 1/\delta + \ln 2}}{n^{2/3}}. \end{aligned}$$

There are two main differences between Theorems 1 and 2. The first is that the estimation error is decreasing as \(n^{1/3}\) (Theorem 2) rather than as \(n^{1/2}\) (Theorem 1). The second is that there is an additional error in Theorem 2 arising from the Lipschitz bound.

Theorem 2 suggests a different choice of thresholds, namely:

$$\begin{aligned} \tau (a,i)= & {} \lambda _1 n^{-1/3}\sqrt{V_n(a,i; \widehat{{\varvec{P}}}_0)} + \lambda _2\left( \frac{1}{d-1}\right) \left( \sum _{l \in {\mathcal {F}} \setminus \{i\}} \left| \rho _{i,l}\right| \right) . \end{aligned}$$

With this change we proceed exactly as before.

7 Numerical results

Here we describe the performance of our algorithm on some real datasets. Note that it is difficult (perhaps impossible) to validate and test the algorithm on the basis of actual logged CMAB data unless the counterfactual action rewards for each instance are available—which would (almost) never be the case. One way to validate and test our algorithm is to use a multi-class classification dataset, generate a biased CMAB dataset for training by “forgetting” (stripping out) the counterfactual information, apply the algorithm, and then test the predictions of the algorithm against the actual data (Beygelzimer et al. 2009). This is the route we follow in the first experiment below. Another way to validate and test our algorithm is to use an alternative accepted procedure to infer counterfactuals and to test the prediction of our algorithm against this alternative accepted procedure. This is the route we follow in the second experiment below.

Table 2 Data summary

7.1 Multi-class classification

For this experiment we use existing multi-class classification datasets from the well-known UCI Machine Learning Repository.

  • In the Pendigits and Optdigits datasets, each instance is described by a collection of pixels extracted from the image of a handwritten digit 0-9; the objective is to identify the digit from the features.

  • In the Satimage dataset, each instance is described by an array of features extracted from a satellite image of a plot of ground; the objective is to identify the true description of the plot (barren soil, grass, cotton crop, etc.) from the features.

These datasets have in common that that they have many instances, many feature types and many labels, so they are extremely useful for training and testing.

In supervised learning systems, we assume that features and labels are generated by an i.i.d. process, i.e., \(\left( {\varvec{X}}, Y\right) \sim Z\) where \({\varvec{X}} \in {\mathcal {X}}\) is the feature space and \(Y \in \{1,2,\ldots , k \}\) is the label space. The supervised learning data with n-samples is denoted as \({\mathcal {D}}^n = \left( {\varvec{X}}_j, Y_j\right) _{j=1}^n\). In our simulation setup, we treat each class as an action. We also included 16 irrelevant features in addition to actual features in the dataset, drawn randomly from normal distribution. The reward of an action is given by \(R_j(a) = {\mathbb {I}}(Y_j = a)\). A complete dataset then is \({\mathcal {D}}^n_{\text {com}} = \left( {\varvec{X}}_j, R_j(1), \ldots , R_j(k)\right) \). A summary of the data is given in Table 2.

7.2 Comparisons

We compare the performance of our algorithm (PONN-B) with

  • PONN is PONN-B but without Step B (feature selection).

  • POEM is the standard POEM algorithm (Swaminathan and Joachims 2015a).

  • POEM-B applies Step B of our algorithm, followed by the POEM algorithm.

  • POEM-L1 is the POEM algorithm with the addition of \(L_1\) regularization.

  • Multilayer Perceptron with\(L_1\)regularization (MLP-L1) is the MLP algorithm on concatenated input \(({\varvec{X}}, A)\) with \(L_1\) regularization.

  • Logistic Regression with\(L_1\)regularization (LR-L1) is the separate LR algorithm on input \({\varvec{X}}\) on each action a with \(L_1\) regularization.

  • Logging is the logging policy performance.

(In all cases, the objective is optimized with the Adam Optimizer.)

7.2.1 Simulation setup

We generate artificially biased dataset by the following logistic model. We first draw weights for each label from an multivariate Gaussian distribution, that is \(\theta _{0,y} \sim {\mathcal {N}}(0, \kappa I)\). We then use the logistic model to generate an artificially biased logged off-policy dataset \({\mathcal {D}}^n = \left( {\varvec{X}}_j, A_j, R_j^{\text {obs}} \right) _{j=1}^n\) by first drawing an action \(A_j \sim p_0( \cdot | {\varvec{X}}_j)\), then setting the observed reward as \(R_j^{\text {obs}} \equiv R_j(A_j)\). (We use \(\kappa = 0.25\) for pendigits and \(\kappa =0.5\) for satimage and optdigits.) This bandit generation process makes the learning very challenging as the generated off-policy dataset has less number of observed labels.

We randomly divide the datasets into \(70\%\) training and \(30\%\) testing sets. We also randomly sequester \(30\%\) of the training set as a validation set. We train all algorithms for various parameter sets on the training set, validate the hyper parameters on the validation set and test on the testing set. We evaluate our algorithm with \(L_r = 2\) representation layers, and \(L_p = 2\) policy layers with 50 hidden units for representation layers and 100 hidden units (sigmoid activation) with policy layers. We implemented/trained all algorithms in a Tensorflow environment using Adam Optimizer.

For jth instance in testing data, let \(h_g^{*}\) denote the optimized policy of algorithm g. Let \({\mathcal {J}}_{test}\) denote the instances in testing set and \(N_{test} = |{\mathcal {J}}_{test}|\) denote number of instances in testing dataset. We define (absolute) accuracy of an algorithm g as

$$\begin{aligned} {\text {Acc}}(g) = \frac{1}{N_{test}} \sum _{j \in {\mathcal {J}}_{test}} \sum _{a \in {\mathcal {A}}} h_g^{*}(a | {\varvec{X}}_j) R_j(a). \end{aligned}$$

We select the parameters \(\lambda _1^{*} \in [0.005, 0.1], \lambda _2^{*} \in [0,0.01]\) and \(\lambda _3^{*} \in [0.0001, 0.1]\) that minimize the loss given in (8) estimated from the samples in the validation set. In the testing dataset, we use the full dataset to compute the accuracy of each algorithm.

Table 3 Absolute accuracy in the UCI experiment (with \(95\%\) CI)

In the next subsection, we describe the performance of each algorithm on the third publicly available datasets. In each case, we run 25 iterations, following the procedure described above; we report the average of the iterations with \(95\%\) confidence intervals.

7.2.2 Results

In order to present a tough challenge to our algorithm we assume that the true propensities are not known and so must be estimated. Table 3 describes the absolute accuracy of each algorithm on each dataset. As can be seen, our algorithm outperforms all the benchmarks in each dataset within \(95\%\) confidence levels.

We define loss with respect to the “perfect” algorithm that would predict accurately all of the time, so the loss of the algorithm g is \(1 -\text {Acc}(g)\). We evaluate the improvement of our algorithm over each other algorithm as the ratio of the actual loss reduction to the possible loss reduction, expressed as a percentage:

$$\begin{aligned} \text {Improvement Score}(g) = \frac{\text {Acc}(\mathrm{PONN}\mathrm{-}\mathrm{B}) - \text {Acc}(g)}{1 - \text {Acc}(g)} \end{aligned}$$

The Improvement Score of each algorithm g with respect to our algorithm is presented in Table 4. Note that our algorithm achieves significant Improvement Scores in all three datasets.

Table 4 Improvement scores in the UCI experiment

7.3 Chemotherapy regimens for breast cancer patients

In this subsection, we apply our algorithm to the choice of recommendations of chemotherapy regimen for breast cancer patients. We evaluate our algorithm on a dataset of 10,000 records of breast cancer patients participating in the National Surgical Adjuvant Breast and Bowel Project (NSABP) by Yoon et al. (2017). Each instance consists of the following information about the patient: age, menopausal, race, estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2 (HER2NEU), tumor stage, tumor grade, Positive Axillary Lymph Node Count(PLNC), WHO score, surgery type, Prior Chemotherapy, prior radiotherapy and histology. The treatment is a choice among six chemotherapy regimes AC, ACT, AT, CAF, CEF, CMF. The outcomes for these regimens were derived based on 32 references from PubMed Clinical Queries. The rewards for these regimens were derived based on 32 references from PubMed Clinical Queries; this is a medically accepted procedure. The details are given in Yoon et al. (2017).

Using these derived rewards, we construct a dataset. In this dataset, an instance is described by a triple \(({\varvec{X}}, A, R)\), where \({\varvec{X}}\) is the 15-dimensional feature vector encoding the information about the particular patient, A is a chemotherapy regime, and R is the reward (survival/non-survival) for that chemotherapy regime for that patient. In the dataset, A is a chemotherapy regime generated in the same way as in the first experiment (with \(\kappa = 0.25\)) and R is the reward derived by Yoon et al. (2017).Footnote 4

As in the previous experiment, in comparing algorithms, we consider absolute accuracy and the improvement score. In this context, we define the absolute accuracy of an algorithm g as the probability that its recommendation matches the chemotherapy regimen with the highest reward (according to best medical practice); i.e.

$$\begin{aligned} Acc(g) = \frac{1}{N_{test}} \sum _{j \in {\mathcal {J}}_{test}} \sum _{a \in {\mathcal {A}}} h_g^{*}(a | {\varvec{X}}_j) {\mathbb {I}}(a = A^*_j) \end{aligned}$$

As before, we define the Improvement Score with respect to relative loss.

Table 5 Performance in the breast cancer experiment
Fig. 2
figure 2

Effect of the hyperparameter on the accuracy of our algorithm

Table 5 describes absolute accuracy and the Improvement Scores of the our algorithm. Our algorithm achieves significant Improvement Scores with respect to all benchmarks. There are two main reasons for these improvements. The first is that using Step B (feature selection) reduces over-fitting; this can be seen by the improvement of PONN-B over PONN and by the fact that PONN-B improves more over POEM (which does not use Step B) than over POEM-B (which does use feature selection). The second is that PONN-B allows for non-linear policies, which reduces model misspecification.

Note that our action-dependent relevance discovery is also important for interpretability. The selected relevant features given by our algorithm with \(\lambda _1 = 0.03\) is as follows: age, tumor stage, tumor grade for AC treatment action, age, tumor grade, lymph node status for ACT treatment action, menopausal status and surgery type for CAF treatment action, age and estrogen receptor for CEF treatment action and estrogen receptor and progesterone receptor for CMF treatment action.

Figure 2 shows the accuracy of our algorithm for different choices of the hyper parameter \(\lambda _1\). As expected—and seen in Fig. 2—if \(\lambda _1\) is too small then there is overfitting; if it is too large then a lot of relevant features are discarded. We have chosen the value of \(\lambda _1\) that maximizes accuracy.

8 Conclusion

This paper introduces a new approach and algorithm for the construction of effective policies when the dataset is biased and does not contain counterfactual information. The heart of our method is the ability to identify a small number of (most) relevant features—despite the bias and missing counterfactuals. When tested on a wide variety of data, the algorithm we introduce achieves significant improvement over state-of-the-art methods.