1 Introduction

A promising avenue of research suggests to make inference about the data generating process by analyzing machine learning models using Interpretable Machine Learning (IML). The Partial Dependence Plot (PDP) (Friedman et al. 1991) and Permutation Feature Importance (PFI) (Breiman 2001) are model-agnostic tools (working for all kinds of machine learning models) that have been used for scientific discoveries. Applications range from medicine (Boulesteix et al. 2020; Stiglic et al. 2020; Pintelas et al. 2020) and the social sciences (Stachl et al. 2020; Zhao et al. 2020) to ecology (Bair et al. 2013; Esselman et al. 2015; Obringer and Nateghi 2018). PDP and PFI are used to study effect and importance of features: The PDP visualizes how a change in a feature, on average, changes the predicted outcome; the PFI ranks the features based on how much they contribute to the model performance.

Both PDP and PFI rely on marginal sampling of feature values. A range of work argues that marginal-sampling based interpretation techniques, including PDP and PFI, are not suitable for learning about the data generating process (Hooker and Mentch 2019; Frye et al. 2020; Chen et al. 2020; Freiesleben et al. 2022). The reason is that marginal-sampling based techniques ignore dependencies between the features and as a consequence may explain the model’s behaviour in unlikely or even unrealistic regions of the feature space.

As a solution, conditional-sampling based techniques, such as conditional permutation feature importance (cPFI) and conditional partial dependence plots (cPDP) were proposed which only evaluate the model within the joint distribution (Strobl et al. 2008; Apley and Zhu 2016; Hooker and Mentch 2019). Given loss-optimal models, they allow insights into the data generating process. More specifically, cPFI allows to quantify whether knowing a feature is required to achieve the same predictive performance, such that nonzero cPFI can be linked with conditional dependence in the data (König et al. 2020). cPDPs visualize the relationships in the data (through the model’s perspective), i.e. they describe how the conditional expectation of the outcome varies with the feature of interest (Freiesleben et al. 2022).

Although theoretically appealing, conditional-sampling based methods are more difficult to apply than marginal-sampling based methods. Existing proposals for cPFI require sampling from the conditional distribution of the feature of interest given the remaining features, which is challenging. The estimation of cPDP is especially challenging, since sampling from the multivariate conditional of the remaining features given the feature of interest is required.

Contributions: Instead of modeling the conditional distribution, we suggest to learn a tree-based partitionioning of the feature space into blocks within which the feature of interest is not (or at least less) correlated with the remaining features. This partitioning can be leveraged in several ways to derive interpretations that allow interesting insight. First of all, we can compute the well-established global cPFI by computing the PFI for each subgroup and aggregating the result. Leveraging the flexibility of tree-based learners, this approach allows the computation of cPFI for mixed continuous and categorical data. Secondly, in situations where the partitioning requires only a few splits, the partitioning itself is interpretable. We can then leverage the partitioning to (a) get insight into the dependence structure in the data and (b) derive subgroup specific versions of PFI and PDP, to also understand under which circumstances variables are relevant or have a certain effect. For instance, by applying PFI in each subgroup, we find that temperature is not predictive of bike rentals given that we know it’s summer, but highly predictive if we know that it’s winter. Furthermore, by looking at the PDP within each subgroup we can understand how the conditional expectation varies with temperature given that we know that it’s winter.

The paper is structured as follows: We introduce our notation in Sect. 2 and discuss related work in Sect. 3. We motivate and formally introduce the conditional subgroup approach in Sect. 4. We demonstrate the usefulness of the method on benchmarks with synthetic and real data (Sect. 5) and illustrate its interpretation in a real-world application (Sect. 6).

2 Notation and background

We consider ML prediction functions \({\hat{f}}:\mathbb {R}^p \mapsto \mathbb {R}\), where \({\hat{f}}({\varvec{x}})\) is a model prediction and \({\varvec{x}}\in \mathbb {R}^p\) is a p-dimensional feature vector. We use \({\textbf{x}}_j\in \mathbb {R}^n\) to refer to an observed feature (vector) and \(X_j\) to refer to the j-th feature as a random variable. With \({\textbf{x}}_{-j}\) we refer to the complementary feature values \({\textbf{x}}_{\{1, \ldots , p\} {\setminus } \{j\}} \in \mathbb {R}^{n\times (p-1)}\) and with \(X_{-j}\) to the corresponding random variables. We refer to the value of the j-th feature from the i-th instance as \(x_{j}^{(i)}\) and to the tuples \({\mathcal {D}}= \{\left( {\textbf{x}}^{(i)}, y^{(i)}\right) \}_{i=1}^n\) as data.

The Permutation feature importance (PFI) (Breiman 2001; Fisher et al. 2019) is defined as the increase in loss when feature \(X_j\) is permuted:

$$\begin{aligned} PFI_j = \mathbb {E}[L(Y, {\hat{f}}({\tilde{X}}_j, X_{-j}))] - \mathbb {E}[L(Y, {\hat{f}}(X_j, X_{-j}))] \end{aligned}$$
(1)

The theoretical PFI for a feature \(X_j\) is the difference between the expected loss when the feature is permuted and the original loss. If the random variable \({\tilde{X}}_j\) has the same marginal distribution as \(X_j\) (e.g., permutation), the estimate yields the marginal PFI. If \({\tilde{X}}_j\) follows the conditional distribution \({\tilde{X}}_j \sim X_j | X_{-j}\), we speak of the conditional PFI. The PFI is estimated with the following formula:

$$\begin{aligned} {\widehat{PFI}}_j = \frac{1}{n} \sum _{i=1}^n\left( \frac{1}{M}\sum _{m=1}^M \left( {\tilde{L}}^{(i)}_m - L^{(i)}\right) \right) \end{aligned}$$
(2)

where \(L^{(i)}= L(y^{(i)}, {\hat{f}}({\varvec{x}}^{(i)}))\) is the loss for the i-th observation and \({\tilde{L}}^{(i)}_m=L(y^{(i)}, {\hat{f}}({\tilde{x}}_{j}^{(i)},{\varvec{x}}_{-j}^{(i)}))\) is the loss where \(x_{j}^{(i)}\) was replaced by the m-th sample of \({\tilde{x}}_{j}^{(i)}\). The latter refers to the i-th feature value obtained by a sample of \({\textbf{x}}_j\). The sample can be repeated M-times for a more stable estimation of \({\tilde{L}}^{(i)}\). Numerous variations of this formulation exist. Breiman (2001) proposed the PFI for random forests, which is computed from the out-of-bag samples of individual trees. Subsequently, Fisher et al. (2019) introduced a model-agnostic PFI version.

The marginal partial dependence plot (PDP) (Friedman et al. 1991) describes the average effect of the j-th feature on the prediction.

$$\begin{aligned} PDP_j(x) = \mathbb {E}[{\hat{f}}(x, X_{-j})] \end{aligned}$$
(3)

The theoretical PDP is a marginalized version of the prediction function. All features with the exception of \(X_j\) are integrated out, and the p-dimensional prediction function becomes a 1-dimensional function, the PDP. There are two options: Integrate with respect to the marginal distribution \(\P _{X_{-j}}\) or the conditional distribution \(\P _{X_{-j}|X_j}\). If the expectation is conditional on \(X_j\), \(\mathbb {E}[{\hat{f}}(x, X_{-j})| X_j = x]\), we speak of the conditional PDP. The marginal PDP evaluated at feature value x is estimated using Monte Carlo integration.

$$\begin{aligned} {\widehat{PDP}}_{j}(x)=\frac{1}{n}\sum _{i=1}^n {\hat{f}}\left( x,{\varvec{x}}^{(i)}_{-j}\right) \end{aligned}$$
(4)

In other words, at any given position x along the range of \(X_j\), the PDP can be estimated by taking the data, setting \(X_j = x\) for all observations and averaging the results.

3 Related work

In this section, we review conditional variants of PDP and PFI and other approaches that try to avoid extrapolation.

3.1 Related work on conditional PDP

The conditional PDP (M-Plot) (Apley and Zhu 2016) averages the predictions locally on the feature grid and mixes effects of dependent features. Apley and Zhu (2016) also address the interpretation problem that conditional PDP is influenced by feature effects of correlated features. The authors proposed accumulated local effect (ALE) plots, which reduce extrapolation by accumulating the finite differences computed within intervals of the feature of interest. By definition, interpretations of ALE plots are thus only valid locally within the intervals. Furthermore, there is no straightforward approach to derive ALE plots for categorical features, since ALE requires ordered feature values. Our proposed approach can handle categorical features.

Hooker (2007) proposed a functional ANOVA decomposition with hierarchically orthogonal components, based on integration using the joint distribution of the data, which in practice is difficult to estimate.

Another PDP variant based on stratification was proposed by Parr and Wilson (2019). However, this stratified PDP describes only the data and is independent of the model.

Individual conditional expectation (ICE) curves by Goldstein et al. (2015) can be used to visualize the interactions underlying a PDP, but they also suffer from the extrapolation problem. The “conditional” in ICE refers to conditioning on individual observations and not on certain features. As a solution, Hooker and Mentch (2019) suggested to visually highlight the areas of the ICE curves in which the feature combinations are more likely.

3.2 Related work on conditional PFI

We review approaches that modify the PFI (Breiman 2001; Fisher et al. 2019) in presence of dependent features by using a conditional sampling strategy.

Strobl et al. (2008) proposed the conditional variable importance for random forests (CVIRF), which is a conditional PFI variant of Breiman (2001). CVIRF was further analyzed and extended by Debeer and Strobl (2020). Both CVIRF and our approach rely on permutations based on partitions of decision trees. However, there are fundamental differences. CVIRF is specifically developed for random forests and relies on the splits of the underlying individual trees of the random forest for the conditional sampling. In contrast, our cs-PFI approach trains decision trees for each feature using \(X_{-j}\) as features and \(X_j\) as the target. Therefore, the subgroups for each feature are constructed from their conditional distributions (conditional on the other features) in a separate step, which is decoupled from the machine learning model to be interpreted. Our cs-PFI approach is model-agnostic, independent of the target to predict and not specific to random forests.

Hooker and Mentch (2019) made a general suggestion to replace feature values by estimates of \(\mathbb {E}[X_j|X_{-j}]\).

Fisher et al. (2019) suggested to use matching and imputation techniques to generate samples from the conditional distribution. If \(X_{-j}\) has few unique combinations, they suggested to group \(x_{j}^{(i)}\) by unique \({\varvec{x}}_{-j}^{(i)}\) combinations and permute them for these fixed groups. For discrete and low-dimensional feature spaces, they suggest non-parametric matching and weighting methods to replace \(X_j\) values. For continuous or high-dimensional data, they suggest imputing \(X_j\) with \(\mathbb {E}[X_j|X_{-j}]\) and adding residuals (under the assumption of homogeneous residuals). Our approach using permutation in subgroups can be seen as a model-driven, binary weighting approach extended to continuous features.

Knockoffs (Candes et al. 2018) are random variables which are “copies” of the original features that preserve the joint distribution but are independent of the prediction target conditional on the remaining features. Knockoffs can be used to replace feature values for conditional feature importance computation. Watson and Wright (2021) developed a testing framework for PFI based on knockoff samplers such as Model-X knockoffs (Candes et al. 2018). Our approach is complementary since Watson and Wright (2021) is agnostic to the sampling strategy that is used. Others have proposed to use generative adversarial networks for generating knockoffs (Romano et al. 2019). Knockoffs are not transparent with respect to how they condition on the features, while our approach creates interpretable subgroups.

Conditional importance approaches based on model retraining have been proposed (Hooker and Mentch 2019; Lei et al. 2018; Gregorutti et al. 2017). However, retraining the model can be expensive, and answers a fundamentally different question, often related to feature selection and not based on a fixed set of features. Hence, we focus on approaches that compute conditional PFI for a fixed model without retraining.

None of the existing approaches makes the dependence structures between the features explicit. It is unclear which of the features in \(X_{-j}\) influenced the replacement of \(X_j\) the most and how. Furthermore, little attention has been paid on evaluating how well different sampling strategies address the extrapolation problem. We address this gap with an extensive data fidelity experiment on the OpenML-CC18 benchmarking suite. To the best of our knowledge, our paper is also the first to conduct experiments using ground truth for the conditional PFI. Our approach works with any type of feature, be it categorical, numerical, ordinal and so on, since we rely on decision trees to find the subgroups used for conditioning. Further we are the first to discuss the trade-off between conditional and marginal PFI and PDP in depth. The differences between the different (conditional) PDP and PFI approaches ultimately boil down to how they sample from the conditional distribution. Table 1 lists different sampling strategies of model-agnostic interpretation methods and summarizes their assumptions to preserve the joint distribution.

Table 1 Sampling strategies for model-agnostic interpretation techniques

4 Conditional subgroups

In this section, we propose a subgroup-based approach that allows us to (1) estimate the cPFI and to (2) introduce novel subgroup-specific versions of PDP and PFI that allow novel insight into model and data.

More specifically, we suggest to leverage tree-based learners to partition the feature space into groups \(G_j\) within which \(X_j\) is independent of the remaining features \(X_{-j}\) (Sect. 4.1). Permuting observations within such groups does not lead to extrapolation, because in each group the marginal and the conditional distribution coincide. We illustrate the idea in Fig. 1.

As a consequence, we can compute the cPFI by applying the PFI in each subgroup and aggregating the results (Sect. 4.2). Furthermore, if the data allow for a human-intelligible partitioning, we can also interpret the subgroup-wise PFI and PDP to gain novel insight about the circumstances given which variables are relevant or have a certain effect on the prediction (Sects. 4.2 and 4.3).

Fig. 1
figure 1

Features \(X_2 \sim U(0,1)\) and \(X_1 \sim N(0, 1)\), if \(X_2<0.5\), else \(X_1 \sim N(4,4)\) (black dots). Top left: The crosses are permutations of \(X_1\). For \(X_2<0.5\), the permutation extrapolates. Bottom left: Marginal density of \(X_1\). Top right: Permuting \(X_1\) within subgroups based on \(X_2\) (\(X_2<0.5\) and \(X_2\ge 0.5\)) reduces extrapolation. Bottom right: Densities of \(X_1\) conditional on the subgroups

4.1 Learning conditional subgroups

In order to learn the grouping \(G_j\), any algorithm can be used that splits the data in \(X_{-j}\) so that the distribution of \(X_j\) becomes more homogeneous within a group and more heterogeneous between groups. We consider decision tree algorithms for this task, which predict \(X_j\) based on splits in \(X_{-j}\). Decision tree algorithms directly or indirectly optimize splits for heterogeneity of some aspects of the distribution of \(X_j\) in the splits. The partitions in a decision tree can be described by decision rules that lead to that terminal leaf. We leverage this partitioning to construct groups \({\mathcal {G}}^1_j, \dots , {\mathcal {G}}^K_j\) based on random variable \(G_j\) for a specific feature \(X_j\). The new variable can be calculated by assigning every observation the indicator of the partition that it lies in (meaning for observation i with \(x_{-j}^{(i)} \in {\mathcal {G}}^k_j\) the group variable’s value is defined as \(g_j^{(i)}:=k\)).

Transformation trees (trtr) (Hothorn and Zeileis 2017) are able to model the conditional distribution of a variable. This approach partitions the feature space so that the distribution of the target (here \(X_j\)) within the resulting subgroups \({\mathcal {G}}^k_j\) is homogeneous, which means that the group-wise parameterization of the modeled distribution is independent of \(X_{-j}\). Transformation trees directly model the target’s distribution \(\P (X_j \le x) = F_Z(h(x))\), where \(F_Z\) is the chosen (cumulative) distribution function and h a monotone increasing transformation function (hence the name transformation trees). The transformation function is defined as \({\textbf{a}}(y)^T \varvec{\theta }\) where \({\textbf{a}}:\mathbb {R}\mapsto \mathbb {R}^k\) is a basis function of polynomials or splines. The task of estimating the distribution is reduced to estimating \(\varvec{\theta }\), and the trees are split based on hypothesis tests for differences in \(\varvec{\theta }\) given \(X_{-j}\), and therefore differences in the distribution of \(X_j\). For more detailed explanations of transformation trees please refer to Hothorn and Zeileis (2017).

In contrast, a simpler approach would be to use classification and regression trees (CART) (Breiman et al. 1984), which, for regression, minimizes the variance within nodes, effectively finding partitions with different means in the distribution of \(X_j\). However, CART’s split criterion only considers differences in the expectation of the distribution of \(X_j\) given \(X_{-j}\): \(\mathbb {E}[X_j|X_{-j}]\). This means CART could only make \(X_j\) and \(X_{-j}\) independent if the distribution of \(X_j\) only depends in its expectation on \(X_{-j}\) (and if the dependence can be modeled by partitioning the data). Any differences in higher moments of the distribution of \(X_j\) such as the variance of \(X_j | X_{-j}\) cannot be detected.

We evaluated both trtr, which are theoretically well equipped for splitting distributions and CART, which are established and well-studied. For the remainder of this paper, we have set the default minimum number of observations in a node to 30 for both approaches. For the transformation trees, we used the Normal distribution as target distribution and we used Bernstein polynomials of degree five for the transformation function. Higher-order polynomials do not seem to increase model fit further (Hothorn 2018).

We denote the subgroups by \({\mathcal {G}}^k_j \subset \mathbb {R}^{p-1}\), where \(k \in \{1,\ldots ,K_j\}\) is the k-th subgroup for feature j, with \(K_j\) groups in total for the j-th feature. The subgroups per feature are disjoint: \({\mathcal {G}}^l_j \cap {\mathcal {G}}^k_j = \emptyset , \forall l \ne k\) and \(\bigcup _{k=1}^K {\mathcal {G}}^k_j = \mathbb {R}^{p-1}\). Let \(({\textbf{y}}^k_j, {\textbf{x}}^k_j)\) be a subset of \(({\textbf{y}}, {\textbf{x}})\) that refers to the data subset belonging to the subgroup \({\mathcal {G}}^k_j\). Each subgroup can be described by the decision path that leads to the respective terminal node.

4.1.1 Remarks

Continuous dependencies For conditional independence \(X_j \perp X_{-j} | G_j^k\) to hold, the chosen decision tree approach has to capture the (potentially complex) dependencies between \(X_j\) and \(X_{-j}\). CART can only capture differences in the expected value of \(X_j|X_{-j}\) but are insensitive to changes in, for example, the variance. Transformation trees are in principle agnostic to the specified distribution and the default transformation family of distributions is very general, as empirical results suggest (Hothorn and Zeileis 2017). However, the approach is based on the assumption that the dependence can be modeled with a discrete grouping. For example, in the case of linear Gaussian dependencies, the corresponding optimal variable would be linear Gaussian itself, and would be in conflict with our proposed interpretable grouping approach. Even in these settings the approach allows an approximation of the conditional distribution. In the case of simple linear Gaussian dependencies, partitioning the feature space will still reduce extrapolation. But we never get rid of it completely, unless there are only individual data points left in each partition, see Fig. 2.

Fig. 2
figure 2

Left: Simulation of features \(X_1 \sim N(0,1)\) and \(X_2 \sim N(0,1)\) with a covariance of 0.9. Middle: Unconditional permutation extrapolates strongly. Right: Permuting on partitions found by CART (predicting \(X_2\) from \(X_1\)) has greatly reduced extrapolation, but cannot get rid of it completely. \(x_1\) and \(x_2\) remain correlated in the partitions

Sparse subgroups Fewer subgroups are generally desirable for two reasons: (1) for a good approximation of the marginal distribution within a subgroup, a sufficient number of observations per group is required, which might lead to fewer subgroups, and (2) a large number of subgroups leads to more complex groups, which reduces their human-intelligibility and therefore forfeits the added value of the local, subgroup-wise interpretations. As we rely on decision trees, we can adjust the granularity of the grouping using hyperparameters such as the maximum tree depth. By controlling the maximum tree depth, we can control the trade-off between the depth of the tree (and hence its interpretability) and the homogeneity of the distribution within the subgroups.

4.2 Conditional subgroup permutation feature importance (cs-PFI)

We estimate the cs-PFI of feature \(X_j\) within a subgroup \({\mathcal {G}}^k_j\) as:

$$\begin{aligned} PFI_j^k = \frac{1}{n_k}\sum _{i: {\textbf{x}}^{(i)}\in {\mathcal {G}}^k_j} \left( \frac{1}{M}\sum _{m=1}^M L\left( y^{(i)}, {\hat{f}}\left( {\tilde{x}}^{(i)}_{j,m}, {\textbf{x}}^{(i)}_{-j}\right) \right) - L\left( y^{(i)}, {\hat{f}}\left( {\textbf{x}}^{(i)}\right) \right) \right) , \nonumber \\ \end{aligned}$$
(5)

where \({\tilde{x}}^{(i)}_{j,m}\) refers to a feature value obtained from the m-th permutation of \(x_j\) within the subgroup \(k_j\). This estimation is exactly the same as the marginal PFI [Eq. (2)], except that it only includes observations from the given subgroup. Algorithm 1 describes the estimation of the cs-PFIs for a given feature on unseen data.

figure a

The algorithm has two outcomes: We get local importance values for feature \(X_j\) for each subgroup (\(\text {cs-PFI}^k_j\); Algorithm 1, line 8) and a global conditional feature importance (\(\text {cs-PFI}_j\); Algorithm 1, line 9). The latter is equivalent to the weighted average of subgroup importances regarding the number of observations within each subgroup (see proof in “Appendix Appendix A)”.

$$\begin{aligned} \text {cs-PFI}_j = \frac{1}{n}\sum _{k=1}^{K_j} n_k PFI^k_j \end{aligned}$$

The cs-PFIs needs the same amount of model evaluations as the PFI (O(nM)). On top of that comes the cost for training the respective decision trees and making predictions to assign a subgroup to each observation.

Theorem 1

When feature \(X_j\) is independent of features \(X_{-j}\) for a given dataset \({\mathcal {D}}\), each \(\text {cs-PFI}_j^k\) has the same expectation as the marginal PFI, and an \(n/n_k\)-times larger variance, where n and \(n_k\) are the number of observations in the data and the subgroup \({\mathcal {G}}^k_j\).

The proof of Theorem 1 is shown in “Appendix Appendix B”. Theorem 1 has the practical implication that even in the case of applying cs-PFI to an independent feature, we will retrieve the marginal PFI, and not introduce any problematic interpretations. Equivalence in expectation and higher variance under the independence of \(X_j\) and \(X_{-j}\) holds true even if the partitions \({\mathcal {G}}_j^k\) would be randomly chosen. Theorem 1 has further consequences regarding overfitting: Assuming a node has already reached independence between \(X_j\) and \(X_{-j}\), then further splitting the tree based on noise will not change the expected cs-PFIs.

4.3 Conditional subgroup partial dependence plots (cs-PDPs)

A range of work argues that PDPs are not suitable for inference if features are dependent (Hooker and Mentch 2019; Freiesleben et al. 2022). Conditional PDPs have been suggested as an alternative, but they are difficult to estimate, since they require sampling from the multivariate conditional of the remaining feature \(P(X_{-j}|X_j)\). For settings where a human-intelligible partitioning can be learned, we suggest an alternative that does not require to sample from \(P(X_{-j}|X_j)\): Instead of computing the global cPDP, we suggest to compute the \(\text {cs-PDP}_j^k\) for each subgroup \({\mathcal {G}}^k_j\) using the marginal PDP formula in Eq. (4).

$$\begin{aligned} \text {cs-PDP}^k_j (x) = \frac{1}{n_k}\sum _{i:{\textbf{x}}^{(i)}\in {\mathcal {G}}^k_j} {\hat{f}}\left( x, {\textbf{x}}^{(i)}_{-j}\right) \end{aligned}$$

This results in multiple cs-PDPs per feature, which can be displayed together in the same plot as in Fig. 9. The cs-PDPs allow interesting insight into data and model. First of all, since they do not extrapolate, they allow interesting insight into the data: They describe how prediction and feature of interest covary within specific groups. Secondly, in contrast to the global cPDP, they allow interesting insight into the model: For the global cPDP even features that are not used by the model can have nonzero effects (as illustrated in Fig. 3). Our proposed cs-PDPs only show nonzero effects if the respective variable is causal for the prediction.

4.3.1 Plotting the cs-PDP

The cs-PDP can be plotted in the same way as the PDP, with the exception that we get mutiple effect curves instead of just one. For a more compact view, we propose to plot all cs-PDPs into the same plot. In addition, we suggest to plot the PDPs similar to boxplots, where the dense center quartiles are indicated with a bold line (see Fig. 4). By emphasizing the data density within the subgroups, the user can immediately see where to trust the plot more and where less. We restrict each \(\text {cs-PDP}^k_j\) to the interval \([min({\varvec{x}}_j),max({\varvec{x}}_j)], \text { with } {\varvec{x}}_j = (x_j^{(1)}, \ldots , x_j^{(n_j^k)})\).

Fig. 3
figure 3

We simulated a linear model of \(y = x_1 + \epsilon \) with \(\epsilon \sim N(0,1)\) and an additional feature \(X_2\) which is correlated with X1 (\(\approx 0.72\)). The conditional PDP (left) gives the false impression that \(X_2\) has an influence on the target. The cs-PDPs help in this regard, as the effects due to \(X_1\) (changes in intercept) are clearly separated from the effect that \(X_2\) has on the target (slope of the cs-PDPs), which is zero. Unlike the marginal PDP, the cs-PDPs reveals that for increasing \(X_2\) we expect that the prediction increases due to the correlation between \(X_1\) and \(X_2\)

Equivalently to PFI, the subgroup PDPs approximate the true marginal PDP even if the features are independent.

Theorem 2

When feature \(X_j\) is independent of features \(X_{-j}\) for a given dataset \({\mathcal {D}}\), each \(\text {cs-PDP}_j^k\) has the same expectation as the marginal PDP, and an \(n/n_k\)-times larger variance, where n and \(n_k\) are the number of observations in the data and the subgroup \({\mathcal {G}}^k_j\).

The proof of Theorem 2 is shown in “Appendix Appendix C”. Theorem 2 has the same practical implications as Theorem 1: Even if the features are independent, we will, in expectation, get the marginal PDPs. And when trees are grown deeper than needed, in expectation the cs-PDPs will yield the same curve.

Both the PDP and the set of cs-PDPs need O(nM) evaluations, since \(\sum _{k=1}^{K_j} n_k = n\) (and worst case \(O(n^2)\) if evaluated at each \(x_{j}^{(i)}\) value). Again, there is an additional cost for training the respective decision trees and making predictions.

Fig. 4
figure 4

Left: Marginal PDP. Bottom right: Boxplot showing the distribution of feature X. Top right: PDP with boxplot-like emphasis. In the x-range, the PDP is drawn from \(\pm 1.58 \cdot IQR/\sqrt{n}\), where IQR is the range between the \(25\%\) and \(75\%\) quantile. If this range exceeds \([min(x_j),max(x_j)]\), the PDP is capped. Outliers are drawn as points. The PDP is bold between the \(25\%\) and \(75\%\) quantiles

5 Experiments

Since for real data sets there are no ground truth values for cPFI and cPDP available, we targeted a diverse set of metrics in our experiments:

  • Conditional PFI Ground Truth Simulation: With this simulated experiment, we compared various cPFI methods. Since the data were simulated, we could compute the ground truth cPFI and benchmark all methods accordingly.

  • Data fidelity evaluation: This experiment used real data sets to analyze how well the different perturbation methods that underpin the various cPDP/cPFI approaches avoid extrapolation.

  • Model fidelity: This experiment evaluates how close the cPDP curves are to the real model predictions.

5.1 Training conditional sampling approaches

To ensure that sampling approaches are not overfitting, we suggest to separate training and sampling, where training covers all estimation steps that involve data. For this purpose, we refer to the training data with \({\mathcal {D}}_{train}\) and to the data for importance computation with \({\mathcal {D}}_{test}\). This section both describes how we compared the sampling approaches in the following chapters and serves as a general recommendation for how to use the sampling approaches.

For our cs-permutation, we trained the CART / transformation trees on \({\mathcal {D}}_{train}\) and permuted \(X_j\) of \({\mathcal {D}}_{test}\) within the terminal nodes of the tree. For CVIRF (Strobl et al. 2008; Debeer and Strobl 2020), which is specific to random forests, we trained the random forest on \({\mathcal {D}}_{train}\) to predict the target y and permuted \(X_j\) of \({\mathcal {D}}_{test}\) within the terminal nodes. For Model-X knockoffs (Candes et al. 2018), we fitted the second-order knockoffs on \({\mathcal {D}}_{train}\) and replaced \(X_j\) in \({\mathcal {D}}_{test}\) with its knockoffs. For the imputation approach (Fisher et al. 2019), we trained a random forest on \({\mathcal {D}}_{train}\) to predict \(X_j\) from \(X_{-j}\), and replaced values of \(X_j\) in \({\mathcal {D}}_{test}\) with their random forest predictions plus a random residual. For the interval-based sampling (Apley and Zhu 2016), we computed quantiles of \(X_j\) using \({\mathcal {D}}_{train}\) and perturbed \(X_j\) in \({\mathcal {D}}_{test}\) by moving each observation once to the left and once to the right border of the respective intervals. The marginal permutation (PFI, PDP) required no training, we permuted (i.e., shuffled) the feature \(X_j\) in \({\mathcal {D}}_{test}\).

5.2 Conditional PFI ground truth simulation

We compared our cs-PFI approach using CART (tree cart) and transformation trees (tree trtr), CVIRF (Strobl et al. 2008; Debeer and Strobl 2020), Model-X knockoffs (ko) (Candes et al. 2018) and the imputation approach (impute rf) (Fisher et al. 2019) in ground truth simulations. We simulated the following data generating process: \(y^{(i)} = f({\textbf{x}}^{(i)}) = {\textbf{x}}^{(i)}_1 \cdot {\textbf{x}}^{(i)}_2 + \sum _{j=1}^{10} x_j^{(i)} + \epsilon ^{(i)}\), where \(\epsilon ^{(i)} \sim N(0, \sigma _{\epsilon })\). All features, except feature \(X_1\) followed a Gaussian distribution: \(X_j \sim N(0, 1)\). Feature \(X_1\) was simulated as a function of the other features plus noise: \(x_1^{(i)} = h(x_{-1}^{(i)}) + \epsilon _x\). We simulated the following scenarios by changing h and \(\epsilon _x\):

  • In the independent scenario, \(X_1\) did not depend on any feature: \(h({\textbf{x}}^{(i)}_{-1}) = 0\), \(\epsilon _x \sim N(0,1)\). This scenario served as a test how the different conditional PFI approaches handle the edge case of independence.

  • The linear scenario introduces a strong correlation of \(X_1\) with feature \(X_2\): \(h({\textbf{x}}^{(i)}_{-1}) = {\textbf{x}}^{(i)}_2\), \(\epsilon _x \sim N(0,1)\).

  • In the non-linear scenario, we simulated \(X_1\) as a non-linear function of multiple features: \(h({\textbf{x}}^{(i)}_{-1}) = 3 \cdot \mathbb {1}({\textbf{x}}^{(i)}_2> 0) - 3 \cdot \mathbb {1}({\textbf{x}}^{(i)}_2 \le 0) \cdot \mathbb {1}({\textbf{x}}^{(i)}_3 > 0)\). Here also the variance of \(\epsilon _x \sim N(0, \sigma _x)\) is a function of x: \(\sigma _x({\textbf{x}}^{(i)}) = \mathbb {1}({\textbf{x}}^{(i)}_2> 0) + 2 \cdot \mathbb {1}({\textbf{x}}^{(i)}\le 0) \cdot \mathbb {1}({\textbf{x}}^{(i)}_3 > 0) + 5 \cdot \mathbb {1}({\textbf{x}}^{(i)}_2 \le 0) \cdot \mathbb {1}({\textbf{x}}^{(i)}_3 \le 0)\).

  • For the multiple linear dependencies scenario, we chose \(X_1\) to depend on many features: \(h({\textbf{x}}^{(i)}_{-1}) = \sum _{j=2}^{10} x_{j}^{(i)}\), \(\epsilon _x \sim N(0,5)\).

For each scenario, we varied the number of sampled data points \(n \in \{300, 3000\}\) and the number of features \(p \in \{9, 90\}\). To “train” each of the cPFI methods, we used \(2/3 \cdot n\) (200 or 2000) data points and the rest (100/1000) to compute the cPFI. The experiment was repeated 1000 times. We examined two settings.

  • In setting (I), we assumed that the model recovered the true model \({\hat{f}}= f\).

  • In setting (II), we trained a random forest with 100 trees (Breiman 2001).

In both settings, the true conditional distribution of \(X_1\) given the remaining features is known (function h and error distribution is known). Therefore we can compute the ground truth conditional PFI, as defined in Eq. (2), by replacing \({\hat{f}}\) with f. We generated the samples of \(X_1\) according to g to get the \({\tilde{X}}_1\) values and compute the increase in loss. The conditonal PFIs differed in settings (I) and (II) since in (I) we used the true f, and in (II) the trained random forest \({\hat{f}}\).

5.2.1 Conditional PFI ground truth results

For setting (I), the mean squared errors between the estimated conditional PFIs and the ground truth are displayed in Table 2, and the distributions of conditional PFI estimates in Fig. 5. In the independent scenario, where conditional and marginal PFI are equal, all methods performed equally well, except in the low n, high p scenario, where the knockoffs sometimes failed. As expected, the variance was higher for all methods when \(n=300\). In the linear scenario, the marginal PFI was clearly different from the conditional PFI. There was no clear best performing conditional PFI approach, as the results differ depending on training size n and number of features p. For low n and low p, knockoffs performed best. For high p, regardless of n, the cs-permutation approaches worked best, which might be due to the feature selection mechanism inherent to trees. The multiple linear dependencies scenario was the only scenario in which the cs-PFI approach was consistently outperformed by the other methods. Decision trees already need multiple splits for recovering linear relationships, and in this scenario, multiple linear relationships had to be recovered. Imputation with random forest worked well when multiple linear dependencies are present. For knockoffs, the results were mixed. As expected, the cs-PFI approach worked well in the non-linear scenario, and outperformed all other approaches. Knockoffs and imputation with random forests both overestimated the conditional PFI (except for knockoffs for \(n=300\) and \(p=90\)). In addition to this bias, they had a larger variance compared to the cs-PFI approaches.

Generally, the transformation trees performed equal to or outperformed CART across all scenarios, except for the multiple linear dependencies scenario. Our cs-PFI approaches worked well in all scenarios, except when multiple (linear) dependencies were present. Even for a single linear dependence, the cs-PFI approaches were on par with knockoffs and imputation, and clearly outperformed both when the relationship was more complex.

Table 2 MSE comparing estimated and true conditional PFI (scenario I)
Fig. 5
figure 5

Setting (I) comparing various conditional PFI approaches on the true model against the true conditional PFI (horizontal line) based on the data generating process

In setting (II), a random forest was analyzed, which allowed us to include the conditional variable importance for random forests (CVIRF) by Strobl et al. (2008) and Debeer and Strobl (2020) in the benchmark. The MSEs are displayed in “Appendix Appendix D”, Table 6, and the distribution of conditional PFI estimates in “Appendix Appendix D” in Fig. 11. The results for all other approaches are comparable to setting (I). For the low n settings, CVIRF worked as well as the other approaches in the independent scenario. It outperformed the other approaches in the linear scenario and the multiple linear scenario (when n was small). The CVIRF approach consistently underestimated the conditional PFI in all scenarios with high n, even in the independent scenario. Therefore, we would recommend to analyze the conditional PFI for random forests using cs-PFI for lower dimensional dependence structures, and imputation for multiple (linear) dependencies.

5.3 Trading interpretability for accuracy

In an additional experiment, we examined the trade-off between the depth of the trees and the accuracy with which we recover the true conditional PFI. For scenario (I), we trained decision trees with different maximal depths (from 1 to 10) and analyzed how the resulting number of subgroups influenced the conditional PFI estimate. The experiment was repeated 1000 times. The deeper the trees, the better the true conditional PFI was approximated. Also no overfitting occured, which is in line with theoretical considerations in Theorem 1. See “Appendix Appendix E” for detailed results.

5.4 Data fidelity evaluation

PDP and PFI work by data intervention, prediction, and subsequent aggregation (Scholbeck et al. 2019). Based on data \({\mathcal {D}}\), the intervention creates a new data set. In order to compare different conditional sampling approaches, we define a measure of data fidelity to quantify the ability to preserve the joint distribution under intervention. Failing to preserve the joint distribution leads to extrapolation when features are dependent. Model-X knockoffs, for example, are directly motivated by preserving the joint distribution, while others, such as accumulated local effect plots do so more implicitly.

Data fidelity is the degree to which a sample \({\tilde{X}}_j\) of feature \(X_j\) preserves the joint distribution, that is, the degree to which \(({\tilde{X}}_j,X_{-j}) \sim (X_j, X_{-j})\) In theory, any measure that compares two multivariate distributions can be used to compute the data fidelity. In practice, however, the joint distribution is unknown, which makes measures such as the Kullback-Leibler divergence impractical. We are dealing with two samples, one data set without and one with intervention.

In this classic two-sample test-scenario, the maximum mean discrepancy (MMD) can be used to compare whether two samples come from the same distribution (Fortet and Mourier 1953; Gretton et al. 2007, 2012; Smola et al. 2007). The empirical MMD is defined as:

$$\begin{aligned} \text {MMD}({\mathcal {D}}, {\tilde{{\mathcal {D}}}}) = \frac{1}{n^2}\sum _{x,z\in {\mathcal {D}}} k(x, z) - \frac{2}{nl}\sum _{x\in {\mathcal {D}}, z\in {\tilde{{\mathcal {D}}}}}k(x,z) + \frac{1}{l^2}\sum _{x,z\in {\tilde{D}}} k(x,z) \end{aligned}$$
(6)

where \({\mathcal {D}}= \{x^{(i)}_j,x_{-j}^{(i)}\}_{i=1}^n\) is the original data set and \({\tilde{{\mathcal {D}}}} = \{{\tilde{x}}^{(i)}_j,x_{-j}^{(i)}\}_{i=1}^l\) a data set with perturbed \(x_{j}^{(i)}\). For both data sets, we scaled numerical features to a mean of zero and a standard deviation of one. For the kernel k we used the radial basis function kernel for all experiments. For parameter \(\sigma \) of the radial basis function kernel, we chose the median L2-distance between data points which is a common heuristic (Gretton et al. 2012). We measure data fidelity as the negative logarithm of the MMD (\(-log(\text {MMD})\)) to obtain a more condensed scale where larger values are better.

Definition 1

(MMD-based Data Fidelity) Let \({\mathcal {D}}\) be a dataset, and \({\tilde{D}}\) be another dataset from the same distribution, but with an additional intervention. We define the data fidelity as: \(\text {Data Fidelity} = -log(\text {MMD}({\mathcal {D}}, {\tilde{{\mathcal {D}}}}))\).

We evaluated how different sampling strategies (see Table 1) affect the data fidelity measure for numerous data sets of the OpenML-CC18 benchmarking suite (Bischl et al. 2019). We removed all data sets with 7 or fewer features and data sets with more than 500 features. See “Appendix Appendix F” for an overview of the remaining data sets. For each data set, we removed all categorical features from the analysis, as the underlying sampling strategies of ALE plots and Model-X knockoffs are not well equipped to handle them. We were foremost interested in two questions:

  1. (A)

    How does cs-permutation compare with other sampling strategies w.r.t. data fidelity?

  2. (B)

    How do choices of tree algorithm (CART vs. transformation trees) and tree depth parameter affect data fidelity?

In each experiment, we selected a data set, randomly sampled a feature and computed the data fidelity of various sampling strategies as described in the pseudo-code in Algorithm 2.

figure b

For an unbiased evaluation, we split the data into three pieces: \({\mathcal {D}}_{train}\) (40% of rows), \({\mathcal {D}}_{test}\) (30% of rows) and \({\mathcal {D}}_{ref}\) (30% of rows). We used \({\mathcal {D}}_{train}\) to “train” each sampling method (e.g., train decision trees for cs-permutation, see Sect. 5.1). We used \({\mathcal {D}}_{ref}\), which we left unchanged and \({\mathcal {D}}_{test}\), for which the chosen feature was perturbed to estimate the data fidelity. For each data set, we chose 10 features at random, for which sampling was applied. Marginal permutation (which ignores the joint distribution) and “no perturbation” served as lower and upper bounds for data fidelity. For CVIRF, we only used one tree per random forest as we only compared the general perturbation strategy which is the same for each tree.

We repeated all experiments 30 times with different random seeds and therefore different data splits. All in all this produced 12,210 results (42 data sets \(\times \) (up to) 10 features \(\times \) 30 repetitions) per sampling method. All results are shown in detail in “Appendix Appendix F” (Figs. 13, 14, 15, and 16).

Since the experiments are repeated across the same data sets and the same features, the data fidelity results are not independent. Therefore, we used a random intercept model (Bryk and Raudenbush 1992) to analyze the differences in data fidelity between different sampling approaches. The target variable of the random intercept model was the MMD, the dependent variable was the perturbation method, and we used a random intercept per data and per feature (nested). So, informally: \(MMD \sim \text {perturbation method } + (1 |\text { dataset}/\text {feature})\).

We chose “Marginal Permutation” as the reference category. We fitted two random intercept models: One to compare cs-permutation with fully-grown trees (CART, trtr) with other sampling methods and another one to compare different tree depths.

5.4.1 Results (A) state-of-the-art comparison

Figure 6 shows the effect estimates of different sampling approaches modeled with a random intercept model. The results show that cs-permutation performed better than all other methods. Model-X knockoffs and the imputation approach (with random forests) came in second place and outperformed ALE and CVIRF. Knockoffs were proposed to preserve the joint distribution, but are based on multivariate Gaussian distribution. This seems to be too restrictive for the data sets in our experiments. CVIRF does not have much higher data fidelity than marginal permutation. However, results for CVIRF must be viewed with caution, since data fidelity regards all features equally – regardless of their impact on the model prediction. For example, a feature can be highly correlated with the feature of interest, but might not be used in the random forest. A more informative experiment for comparing CVIRF can be found in Sect. 5.2. Figures 13 and 14 in “Appendix Appendix F” show the individual data fidelity results for the OpenML-CC18 data sets. Not perturbing the feature at all has the highest data fidelity and serves as the upper bound. The marginal permutation serves as a lower baseline. For most data sets, cs-permutation has a higher data fidelity compared to all other sampling approaches. For all the other methods there is at least one data set on which they reach a low data fidelity (e.g., “semeion”, “qsar-biodeg” for ALE; “nodel-simulation”, “churn” for imputation; “jm1”, “pc1” for knockoffs). In contrast, cs-permutation achieves a consistently high data fidelity on all these data sets.

Additionally, we review the data fidelity rankings of the sampling methods in Table 3. The table shows the average ranking of each method according to MMD. First we computed the rank of each perturbation method per dataset, feature and repetition, with rank 1 being the best (lowest MMD). This allows another view on the performance of the perturbation methods. The rankings show a similar picture as the random intercept model estimates, except that Model-X knockoffs have a better average ranking than imputation. This could be the case since on a few data sets (bank-marketing, electricity, see Fig. 13 in “Appendix Appendix F”) Model-X knockoffs have a very low data fidelity but on most others a higher model fidelity than the imputation method.

Fig. 6
figure 6

Linear regression model coefficients and 95% confidence intervals for the effect of different sampling approaches on data fidelity, with (nested) random effects per data set and feature. A Comparing different sampling approaches. No perturbation (“none”) and permutation (“perm”) serve as upper and lower bounds. B Comparing cs-permutation using either CART or transformation trees and different tree depths (1, 2, 3, 4, 5 and 30). Marginal permutation is the reference category and therefore is at \(x=0\) and all other perturbation method estimates are relative to this reference

Table 3 Mean ranks and their standard deviation based on data fidelity of various perturbation methods over data sets, features and repetitions

5.4.2 Results (B) tree configuration

We included shallow trees with maximum depth parameter from 1 to 5 to analyze the trade-off between tree depth and data fidelity. We included trees with a maximum depth parameter of 30 (“fully-grown” trees as this was the software’s limit) as an upper bound for each decision tree algorithm. Figure 6B) shows that the deeper the trees (and the more subgroups), the higher the data fidelity. This is to be expected, since deeper trees allow for a more fine-grained separation of distributions. More importantly, we are interested in the trade-off between depth and data fidelity. Even splitting with a maximum depth of only 1 (two subgroups) strongly improves data fidelity over the simple marginal permutation for most data sets. A maximum depth of two means another huge average improvement in data fidelity, and already puts cs-permutation on par with knockoffs. A depth of three to four is almost as good as a maximum depth parameter of 30 and already outperforms all other methods, while still being interpretable due to their shortness. CART slightly outperforms transformation trees clearly when trees are shallow, which is surprising since transformation trees are, in theory, better equipped to handle changes in the distribution. Deeply grown transformation trees (max. depth of 30) slightly outperform CART. Figures 15 and 16 in “Appendix Appendix F” show data fidelity aggregated by data set.

5.5 Model fidelity

Model fidelity has been defined as how well the predictions of an explanation method approximate the ML model (Ribeiro et al. 2016). Similar to Szepannek (2019), we define model fidelity for feature effects as the mean squared error between model prediction and the prediction of the partial function \(f_j\) (which depends only on feature \(X_j\)) defined by the feature effect method, for example \(f_j(x) = PDP_j(x)\). For a given data instance with observed feature value \(x_{j}^{(i)}\), the predicted outcome of, for example, a PDP can be obtained by the value on the y-axis of the PDP at the observed \(x_j\) value.

$$\begin{aligned} \text {Model}\_\text {Fidelity}({\hat{f}}, f_j)= \frac{1}{n}\sum _{i=1}^n \left( {\hat{f}}\left( x^{(i)}\right) - f_j\left( x_{j}^{(i)}\right) \right) ^2, \end{aligned}$$
(7)

where \(f_j\) is a feature effect function such as ALE or PDP. For this definition of model fidelity, lower values are more desirable. The better the model fidelity, the closer the effect curve is to the actual model predictions. In order to evaluate ALE plots, they have to be adjusted such that they are on a comparable scale to a PDP (Apley and Zhu 2016): \(f_j^{ALE,adj} = f_j^{ALE} + \frac{1}{n}\sum _{i=1}^n {\hat{f}}(x^{(i)})\).

We trained random forests (500 trees), linear models and k-nearest neighbours models (k = 7) on various regression data sets (Table 4). 70% of the data were used to train the ML models and the transformation trees/CARTs. This ensure that results are not over-confident due to overfitting, see also Sect. 5.1. The remaining 30% of the data were used to evaluate model fidelity. For each model and each data set, we measured model fidelity between effect prediction and model prediction [Eq. (7)], averaged across observations and features.

Table 4 We selected data sets from OpenML Vanschoren et al. (2014) and Casalicchio et al. (2017) having 1000–8000 instances and a maximum of 50 numerical features

Table 5 shows that the model fidelity of ALE and PDP is similar, while the cs-PDPs have the best model fidelity (lower is better). This is an interesting result since the decision trees for the cs-PDPs are neither based on the model nor on the real target, but solely on the conditional dependence structure of the features. However, the cs-PDPs have the advantage that we obtain multiple plots. We did not aggregate the plots to a single conditional PDP, but computed the model fidelity for the PDPs within the subgroups (visualized in Fig. 9). Our cs-PDPs using trees with a maximum depth of 2 have a better model fidelity than using a maximum depth of 1. We limited the analysis to interpretable conditioning and therefore allowed only trees with a maximum depth of 2, since a tree depth of 3 already means up to 8 subgroups which is already an impractical number of PDPs to have in one plot. CART sometimes beats trtr (e.g., on the “satellite” data set) but sometimes trtr has a lower loss (e.g., on the “wind” data set). Using different models (knn or linear model) produced similar results, see “Appendix Appendix G”.

Table 5 Mean model fidelity averaged over features in a random forest for various data sets, and the variance across features

6 Application

In the following application, we demonstrate that cs-PDPs and cs-PFI are valuable tools to understand model and data beyond insights given by PFI, PDPs, or ALE plots. We trained a random forest to predict daily bike rentals (Dua and Graff 2017) with given weather and seasonal information. The data (\(n=731\), \(p=9\)) was divided into 70% training and 30% test data. The features are not independent (see “Appendix Appendix H”)

6.1 cs-PDPs and cs-PFI

To construct the subgroups, we used transformation trees with a maximum tree depth of 2 which limited the number of possible subgroups to 4. We chose transformation trees because they are theoretically more sound and don’t require the assumption that the conditional distributions only differ in the means of the other features.

Fig. 7
figure 7

Conditional feature importance by increasing maximum depth of the trees

Figure 7 shows that for most features the biggest change in the estimated conditional PFI happens when moving from a maximum depth of 0 (\(=\) marginal PFI) to a depth of 2. This makes a maximum depth of 2 a reasonable trade-off between limiting the number of subgroups and accurately approximating the conditional PFI. We compared the marginal and conditional PFI for the bike rental predictions, see Fig. 8.

Fig. 8
figure 8

Left: Comparison of PFI and cs-PFI for a selection of features. For cs-PFI we also show the features that constitute the subgroups. Right: Local cs-PFI of temperature within subgroups. The temperature feature is important in spring, fall and winter, but neglectable on summer days, especially humid ones

The most important features, according to (marginal) PFI, were temperature and year. For the year feature, the marginal and conditional PFI are the same. Temperature is less important when we condition on season and humidity. The season already holds a lot of information about the temperature, so this is not a surprise. When we know that a day is in summer, it is not as important to know the temperature to make a good prediction. On humid summer days, the PFI of temperature is zero. However, in all other cases, it is important to know the temperature to predict how many bikes will be rented on a given day. The disaggregated cs-PFI in a subgroup can be interpreted as “How important is the temperature, given we know the season and the humidity”.

We compare PDP, ALE and cs-PDP in Fig. 9. Both ALE and PDP show a monotone increase of predicted bike rentals up until a temperature of 25 \(^{\circ }\)C and a decrease beyond that. The PDP shows a weaker negative effect of very high temperatures which might be caused by extrapolation: High temperature days are combined with e.g. winter. A limitation of the ALE plot is that we should only interpret it locally within each interval that was used to construct the ALE plot. In contrast, our cs-PDP is explicit about the subgroup conditions in which the interpretation of the cs-PDP is valid and shows the distributions in which the feature effect may be interpreted. The local cs-PDPs in subgroups reveal a more nuanced picture: For humid summer days, the temperature has no effect on the bike rentals, and the average number of rentals are below that of days with similar temperatures in spring, fall and drier summer days. The temperature has a slightly negative effect on the predicted number of bike rentals for dry summer days (humidity below 67.3). The change in intercepts of the local cs-PDP can be interpreted as the effect of the grouping feature (season). The slope can be interpreted as the temperature effect within a subgroup.

Fig. 9
figure 9

Effect of temperature on predicted bike rentals. Left: PDP and ALE plot. Right: cs-PDPs for 4 subgroups

We also demonstrate the local cs-PDPs for the season, a categorical feature. Figure 10 shows both the PDP and our local cs-PDPs. The normal PDP shows that on average there is no difference between spring, summer and fall and only slightly less bike rentals in winter. The PDP with four subgroups conditional on temperature shows that the marginal PDP is misleading. The PDP indicates that in spring, summer and fall, around 4500 bikes are rented and in winter around 1000 fewer. The cs-PDPs in contrast show that, conditional on temperature, the differences between the seasons are much greater, especially for low temperatures. Only at high temperatures is the number of rented bikes similar between seasons.

Fig. 10
figure 10

Effect of season on predicted rentals. Left: PDP. Right: Local cs-PDPs. The cs-PDPs are conditioned on temperature, in which the tree split at 21.5 and at 9.5

7 Discussion

We proposed the cs-PFIs and cs-PDPs, wich are variants of PFI and PDP that work when features are dependent. Both cs-PFIs and cs-PDPs rely on permutations in subgroups based on decision trees. The approach is simple: Train a decision tree to predict the feature of interest and compute the (marginal) PFI/PDP in each terminal node defined by the decision tree.

Compared to other approaches, cs-PFIs and cs-PDPs enable a human comprehensible grouping, which carries information how dependencies affect feature effects and importance. As we showed in various experiments, our methods are on par or outperform other methods in many dependence settings. We therefore recommend using cs-PDPs and cs-PFIs to analyze feature effects and importances when features are dependent. However, due to their construction with decision trees, cs-PFIs and cs-PDPs do not perform well when the feature of interest depends on many other features, but only if it depends on a few features. Especially the interpretability suffers if the tree has to rely on many features. We recommend analyzing the dependence structure beforehand, using the imputation approach with random forests in the case of multiple dependencies, and cs-PFIs in all other cases.

Our framework is flexible regarding the choice of partitioning and we leave the evaluation of the rich selection of possible decision tree and decision rules approaches to future research.

Reproducibility All experiments were conducted using mlr (Lang et al. 2019) and R (R Core Team 2017). We used the iml package (Molnar et al. 2018) for ALE and PDP, party/partykit (Hothorn and Zeileis 2015) for CVIRF and knockoff (Patterson and Sesia 2020) for Model-X knockoffs. The code for all experiments is available at https://github.com/christophM/paper_conditional_subgroups.