1 Introduction

Complex machine learning (ML) models are extensively applied to solve supervised learning problems in many different fields and settings: cancer prognosis (Kourou et al. 2015), credit scoring (Kvamme et al. 2018), impact sensitivity of energetic crystals (Lansford et al. 2022), and money laundering detection (Jullum et al. 2020). The ML methods are often very complex, containing thousands, millions, or even billions of tuneable model parameters. Thus, understanding the complete underlying decision-making process of the ML algorithms is infeasible (for us humans). The use of ML methods is based on them having the potential to generate more accurate predictions than established statistical models, but this may come at the expense of model interpretability, as discussed by Johansson et al. (2011), Guo et al. (2019), Luo et al. (2019). Rudin (2019) conjectures that equally accurate but interpretable models exist across domains even though they might be hard to find.

The lack of understanding of how the input features of the ML model influence the model’s output is a major drawback. Hence, to remedy the absence of interpretation, the fields of explainable artificial intelligence (XAI) and interpretable machine learning (IML) have become active research fields in recent years (Adadi and Berrada 2018; Molnar 2022; Covert et al. 2021). Various explanation frameworks have been developed to extract hidden knowledge about the underlying data structure captured by the black-box model, making the model’s decision-making process more transparent. Model transparency is essential for, e.g., medical researchers who apply an intricate ML model to obtain well-performing predictions but who simultaneously also aim to discover important risk factors. The Right to Explanation legislation in the European Union’s General Data Protection Regulation (GDPR) has also been a driving factor (European Commission 2016).

One of the most commonly used explanation frameworks in XAI is Shapely values, which is an explanation methodology with a strong mathematical foundation and unique theoretical properties from cooperative game theory (Shapley 1953). Shapley values are most commonly used as a model-agnostic explanation framework for individual predictions, that is, for local explanations. Model-agnostic means that Shapley values do not rely on model internals and can be used to compare and explain any ML model trained on the same supervised learning problem. Local explanation means that Shapley values explain the local model behavior for a specific observation and not the global model behavior across all observations. The methodology has also been used to provide global explanations, see, e.g., Owen (2014), Covert et al. (2020), Frye et al. (2021), Giudici and Raffinetti (2021). See Molnar (2022) for an overview and detailed introduction to other explanation frameworks.

Shapley values originated in cooperative game theory but have been reintroduced as a model explanation framework by Strumbelj and Kononenko (2010), Strumbelj and Kononenko (2014), Lundberg and Lee (2017). Originally, Shapley values described a possible solution concept of how to fairly allocate a game’s payout among the players based on their contribution to the overall cooperation/payout. The solution concept is based on several desirable axioms, for which the Shapley values are the unique solution. When applying Shapley values as an explanation framework, we treat the features as the players, the predictive model as the game, and the corresponding prediction as the payout.

There are several ways to define the game, which yields different types of Shapley values. For local explanations, the two main types are marginal and conditional Shapley valuesFootnote 1, and there is an ongoing debate about when to use them (Chen et al. 2020; Kumar et al. 2020; Chen et al. 2022). Briefly stated, the marginal version ignores dependencies between the features, while the conditional version incorporates them. Thus, a disadvantage of the conditional Shapley values, compared to the marginal counterpart, is that they require the estimation/modeling of non-trivial conditional expectations. However, they are robust against adversarial attacks, which the marginal Shapley values are not (Blesch et al. 2023). Throughout this article, we refer to conditional Shapley values when discussing Shapley values if not otherwise specified.

There is a vast amount of literature on different approaches for estimating Shapley values (Strumbelj et al. 2009; Lundberg and Lee 2017; Lundberg et al. 2018; Redelmeier et al. 2020; Williamson and Feng 2020; Aas et al. 2021a, b; Frye et al. 2021; Covert et al. 2021; Olsen et al. 2022). These methods can be grouped into different method classes based on their characteristics, that is, if they (implicitly) assume feature independence or use empirical estimates, parametric assumptions, generative methods, and/or regression models; see Fig. 1. To the best of our knowledge, there exists no thorough and methodological comparison between all the method classes and approaches. Chen et al. (2022, Sect. 6) states that “[conditional Shapley values] constitutes an important future research direction that would benefit from new methods or systematic evaluations of existing approaches”.

Fig. 1
figure 1

Schematic overview of the paradigms, method classes, and methods/ approaches used in this article to compute conditional Shapley value explanations

In this article, we both investigate existing methods, introduce several new approaches, and conduct extensive simulation studies starting from a very simple set-up with an interpretable model as a sanity check and gradually increase the complexity of the predictive model. We also investigate the effects the data distribution, with varying levels of dependence, and the training sample size have on the estimation of the conditional expectations using the different methods. Finally, we also conduct experiments on real-world data sets from the UCI Machine Learning Repository. In the numerical simulation studies, the parametric methods, which correctly (or nearly correctly) assume the data distribution, generate the most accurate Shapley values. However, if the data distribution is unknown, such as for most real-world data sets, our experiments show that using either a generative method or a regression model with the same form as the predictive model is the best option. In addition to accuracy, we also investigate the computation times of the methods. Based on our findings, we present recommendations for when to use the different method classes and approaches.

In Sect. 2, we give an overview of Shapley values’ origin and their use as a model-agnostic explanation framework. The existing and novel methods for estimating the Shapley value explanations are described in Sect. 3. In Sect. 4, we present the simulation studies and discuss the corresponding results. We conduct experiments on real-world data sets in Sect. 5. Recommendations for when to use the different methods and a conclusion are given in Sects. 6 and 7, respectively. In the Appendix, we provide implementation details and more information about some of the methods. We provide additional methods and simulation studies in the Supplement.

2 Shapley values

In this section, we first briefly describe Shapley values in cooperative game theory before we elaborate on their use in model explanation.

2.1 Shapley values in cooperative game theory

Shapley values are a solution concept of how to divide the payout of a cooperative game \(v:\mathcal {P}({\mathcal {M}}) \mapsto \mathbb {R}\) based on four axioms (Shapley 1953). The game is played by M players where \({\mathcal {M}}= \{1, 2,\dots , M\}\) denotes the set of all players and \(\mathcal {P}({\mathcal {M}})\) is the power set, that is, the set of all subsets of \({\mathcal {M}}\). We call \(v({\mathcal {S}})\) the contribution functionFootnote 2, and it maps a subset of players \({\mathcal {S}}\in {\mathcal {P}}({\mathcal {M}})\), also called a coalition, to a real number representing their contribution in the game v. The Shapley values \(\phi _j = \phi _j(v)\) assigned to each player j, for \(j = 1, \dots , M\), uniquely satisfy the following properties:

  • Efficiency: They sum to the value of the grand coalition \({\mathcal {M}}\) over the empty set \(\emptyset \), that is, \(\sum _{j=1}^M \phi _j = v({\mathcal {M}}) - v(\emptyset )\).

  • Symmetry: Two equally contributing players j and k, that is,\(v({\mathcal {S}}\cup \{j\}) = v({\mathcal {S}}\cup \{k\})\) for all \({\mathcal {S}}\), receive equal payouts \(\phi _j = \phi _k\).

  • Dummy: A non-contributing player j, that is, \(v({\mathcal {S}}) = v({\mathcal {S}}\cup \{j\})\) for all \({\mathcal {S}}\), receives \(\phi _j = 0\).

  • Linearity: A linear combination of n games \(\{v_1, \dots , v_n\}\), that is, \(v({\mathcal {S}}) = \sum _{k=1}^nc_kv_k({\mathcal {S}})\), has Shapley values given by \(\phi _j(v) = \sum _{k=1}^nc_k\phi _j(v_k)\).

Shapley (1953) showed that the values \(\phi _j\) which satisfy these axioms are given by

$$\begin{aligned} \phi _j = \sum _{\mathcal {S} \in {\mathcal {P}}(\mathcal {M} \backslash \{j\})} \frac{|\mathcal {S}|!(M-|\mathcal {S}|-1)!}{M!}\left( v(\mathcal {S} \cup \{j\}) - v(S) \right) , \end{aligned}$$
(1)

where \(|\mathcal {S}|\) is the number of players in coalition \({\mathcal {S}}\). The number of terms in (1) is \(2^{M}\). Hence, the complexity grows exponentially with the number of players M. Each Shapley value is a weighted average of the player’s marginal contribution to each coalition \(\mathcal {S}\).

2.2 Shapley values in model explanation

We consider the setting of supervised learning where we aim to explain a predictive model \(f(\varvec{x})\) trained on \(\mathcal {X} = \{\varvec{x}^{[i]}, y^{[i]}\}_{i = 1}^{N_\text {train}}\), where \(\varvec{x}^{[i]}\) is an M-dimensional feature vector, \(y^{[i]}\) is a univariate response, and \(N_\text {train}\) is the number of training observations. The prediction \(\hat{y} = f(\varvec{x})\), for a specific feature vector \(\varvec{x} = \varvec{x}^*\), is explained using Shapley values as a model-agnostic explanation framework (Strumbelj and Kononenko 2010, 2014; Lundberg and Lee 2017). The fairness aspect of Shapley values in the model explanation setting is discussed in, for example, Chen et al. (2020), Fryer et al. (2021), Aas et al. (2021a).

In the Shapley value framework, the predictive model f (indirectly) replaces the cooperative game, and the M-dimensional feature vector replaces the M players. The Shapley value \(\phi _j\) describes the importance of the jth feature in the prediction \(f(\varvec{x}^*) = \phi _0 + \sum _{j=1}^M\phi _j^*\), where \(\phi _0 = \mathbb {E} \left[ f(\varvec{x})\right] \). That is, the sum of the Shapley values explains the difference between the prediction \(f(\varvec{x}^*)\) and the global average prediction.

To calculate (1), we need to define an appropriate contribution function \(v(\mathcal {S}) = v(\mathcal {S}, \varvec{x}^*)\) which should resemble the value of \(f(\varvec{x}^*)\) when only the features in coalition \(\mathcal {S}\) are known. We use the contribution function proposed by Lundberg and Lee (2017), namely the expected response of \(f(\varvec{x})\) conditioned on the features in \(\mathcal {S}\) taking on the values \(\varvec{x}_\mathcal {S}^*\). That is,

(2)

where \(\varvec{x}_{\mathcal {S}} = \{x_j:j \in \mathcal {S}\}\) denotes the features in subset \(\mathcal {S}\), denotes the features outside \(\mathcal {S}\), that is, , and is the conditional density of given \({\varvec{x}_{\mathcal {S}}}= {\varvec{x}_{\mathcal {S}}^*}\). The conditional expectation summarizes the whole probability distribution, it is the most common estimator in prediction applications, and it is also the minimizer of the commonly used squared error loss function (Aas et al. 2021a). Note that the last equality of (2) only holds for continuous features. If there are any discrete or categorical features, the integral should be replaced by sums for these features. Hence, is then no longer continuous.

The contribution function in (2) is also used by, for example, Covert et al. (2020), Aas et al. (2021a), Aas et al. (2021b), Frye et al. (2021), Olsen et al. (2022). Covert et al. (2021) argue that the conditional approach in (2) is the only approach that is consistent with standard probability axioms. Computing (2) is not straightforward for a general data distribution and model. Assuming independent features, or having f be linear, simplifies the computations (Lundberg and Lee 2017; Aas et al. 2021a), but these assumptions do not hold in general.

To compute the Shapley values in (1), we need to compute the contribution function v(s) in (2) for all \({\mathcal {S}}\in {\mathcal {P}}({\mathcal {M}})\), except for the edge cases \(\mathcal {S} \in \left\{ \emptyset , {\mathcal {M}} \right\} \). For \({\mathcal {S}}= {\mathcal {M}}\), we have that \({\varvec{x}_{\mathcal {S}}}= \varvec{x}^*\) and \(v(\mathcal {M}) = f(\varvec{x}^*)\) by definition. For \({\mathcal {S}}= \emptyset \), we have by definition that \(\phi _0 = v(\emptyset ) = \mathbb {E}[f(\varvec{x})]\), where the average training response is a commonly used estimate (Aas et al. 2021a). We denote the non-trivial coalitions by \({\mathcal {P}}^*({\mathcal {M}}) = {\mathcal {P}}({\mathcal {M}}) \backslash \{\emptyset , {\mathcal {M}}\}\). The Shapley values \({\varvec{\phi }}^* = \{\phi _j^*\}_{j=0}^M\) for the prediction \(f(\varvec{x}^*)\) are computed as the solution of a weighted least squares problem (Charnes et al. 1988; Lundberg and Lee 2017; Aas et al. 2021a). We refer to Molnar (2023) for an extensive introduction to the Shapley value explanation framework.

In Sects. 2.2.1 and 2.2.2, we describe two prominent paradigms for estimating the contribution function \(v(\mathcal {S})\) for all \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\), namely, Monte Carlo integration and regression.

2.2.1 Monte Carlo integration

One way to estimate the contribution function \(v(\mathcal {S})\) is by using Monte Carlo integration (Lundberg and Lee 2017; Aas et al. 2021a). That is,

(3)

where f is the predictive model, , for \(k=1,2,\dots ,K\), and K is the number of Monte Carlo samples. We insert \(\hat{v}(\mathcal {S})\) into (1) to estimate the Shapley values. To obtain accurate conditional Shapley values, we need to generate Monte Carlo samples that follow the true conditional distribution of the data. This distribution is generally not known and needs to be estimated based on the training data. In Sects. 3.1, 3.2, 3.3, and 3.4, we describe different method classes for generating the conditional samples .

2.2.2 Regression

As stated above, the conditional expectation (2) is the minimizer of the mean squared error loss function. That is,

(4)

Thus, any regression model \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}})\), which is fitted with the mean squared error loss function as the objective function, will approximate (4), obtaining an alternative estimator \(\hat{v}({\mathcal {S}})\) (Aas et al. 2021a; Frye et al. 2021; Williamson and Feng 2020). The accuracy of the approximation will depend on the form of the predictive model \(f(\varvec{x})\), the flexibility of the regression model \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}})\), and the optimization routine. We can either train a separate regression model \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}})\) for each \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\) or we can train a single regression model \(g(\tilde{\varvec{x}}_{\mathcal {S}})\) which approximates the contribution function \(v({\mathcal {S}})\) for all \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\) simultaneously. Here \(\tilde{\varvec{x}}_{\mathcal {S}}\) is an augmented version of \({\varvec{x}_{\mathcal {S}}}\) with fixed-length M, where the augmented values are mask values to be explained later. We elaborate on the notation and details of these two regression methodologies in Sects. 3.5 and 3.6, respectively.

2.2.3 Approximation strategies

In general, computing Shapley values is an NP-hard problem (Deng and Papadimitriou 1994; Faigle and Kern 1992), and the complexity of the direct computation of (1) is exponential in the number of features M. In this section, we highlight strategies using approximations or model assumptions to reduce the Shapley value explanation framework’s computational complexity and make it tractable in higher dimensions. However, this paper’s primary consideration is to accurately compare methods for estimating \(v({\mathcal {S}})\), but using an approximation strategy will introduce additional uncertainty in the explanations. Thus, we consider relatively low-dimensional settings in the experiments in Sects. 4 and 5, where the exact computation of the Shapley value formula in (1) is feasible. The approximate speed-up strategies can be divided into model-agnostic and model-specific strategies (Chen et al. 2022).

The model-agnostic strategies put no assumptions on the predictive model f and often use stochastic sampling-based estimators (Aas et al. 2021a; Lundberg and Lee 2017; Okhrati and Lipani 2021; Mitchell et al. 2022). That is, to speed up the computations, they approximate the Shapley value explanations by a sampled subset of the coalitions instead of considering the exponential amount of them. Thus, the strategies are stochastic; however, their expectations are unbiased. One of the most common model-agnostic strategies is the KernelSHAP strategy introduced in Lundberg and Lee (2017) and improved by Covert and Lee (2021). In the KernelSHAP strategy, we sample, e.g., \(N_{\mathcal {S}}= 2000 < 2^M\) coalitions and use only these coalitions to approximate the Shapley value explanations. This strategy enables us to approximate the explanations in tractable time even for large values of M; however, a \(N_{\mathcal {S}}\ll 2^M\) will (likely) produce poor approximations.

The model-specific strategies put assumptions on the predictive model f to improve the computational cost, but some of the strategies are restricted to marginal Shapley values. For conditional Shapley values, Aas et al. (2021a), Chen et al. (2020) derive explicit expressions for linear models to speed up the computations, and Lundberg et al. (2020) proposes the path-dependent TreeSHAP algorithm for tree-based models. There are speed-up strategies for deep neural network models, too, but they are limited to marginal Shapley values (Ancona et al. 2019; Wang et al. 2020).

For more details about the model-agnostic and model-specific strategies, we refer to Chen et al. (2022, Sect. 5.2), which provides an excellent introduction to both strategies. Jethani et al. (2021) proposes another strategy called FastSHAP, which sidesteps the Shapley value formula by training a black-box neural network to directly output the Shapley value explanations. Another strategy for reducing the computations is to explain groups of similar/correlated features instead of individual features (Jullum et al. 2021).

3 Conditional expectation estimation

Computing conditional Shapley values is difficult due to the complexity of estimating the conditional distributions, which are not directly available from the training data. In this section, we give a methodological introduction to different methods for estimating the conditional expectation in (2) via either Monte Carlo integration or regression, while we provide implementation details in Appendix A. We organize the methods into six method classes in accordance with those described in Chen et al. (2022, Sect. 5.1.3) and Covert et al. (2021, Sect. 8.2). The method classes we consider are called; independence, empirical, parametric, generative, separate regression, and surrogate regression, and they are described in Sects. 3.1, 3.2, 3.3, 3.4, 3.5, and 3.6, respectively. The first four classes estimate the conditional expectation in (2) using Monte Carlo integration, while the last two classes use regression.

3.1 The independence method

Lundberg and Lee (2017) avoided estimating the complex conditional distributions by implicitly assuming feature independence. In the independence approach, the conditional distribution simplifies to , and the corresponding Shapley values are the marginal Shapley values discussed in Sect. 1. The Monte Carlo samples are generated by randomly sampling observations from the training data; thus, no modeling is needed and follows the assumed true data distribution. However, for dependent features, which is common in observational studies, the independence approach produces biased estimates of the contribution function (2) and the conditional Shapley values. Thus, the independence approach can lead to incorrect conditional Shapley value explanations for real-world data (Aas et al. 2021a; Merrick and Taly 2020; Frye et al. 2021; Olsen et al. 2022).

3.2 The empirical method

Instead of sampling randomly from the training data, the empirical method samples only from similar observations in the training data. The optimal procedure is to use only samples that perfectly match the feature values \({\varvec{x}_{\mathcal {S}}^*}\), as this approach exactly estimates the conditional expectation when the number of matching observations tends to infinity (Chen et al. 2022). However, this is not applicable in practice, as data sets can have few observations, contain a high number of features to match, or have continuous features where an exact match is very unlikely. A natural extension is to relax the perfect match criterion and allow for similar observations (Mase et al. 2019; Sundararajan and Najmi 2020; Aas et al. 2021a). However, this procedure will also be influenced by the curse of dimensionality as conditioning on many features can yield few similar observations and thereby inaccurate estimates of the conditional expectation (2). We can relax the similarity criterion and include less similar observations, but then we break the feature dependencies. The empirical approach coincides with the independence approach when the similarity measure defines all observations in the training data as similar.

We use the empirical approach described in Aas et al. (2021a). The approach uses a scaled version of the Mahalanobis distance to calculate a distance \(D_{\mathcal {S}}(\varvec{x}^*, \varvec{x}^{[i]})\) between the observation being explained \(\varvec{x}^*\) and every training instance \(\varvec{x}^{[i]}\). Then they use a Gaussian distribution kernel to convert the distance into a weight \(w_{\mathcal {S}}(\varvec{x}^*, \varvec{x}^{[i]})\) for a given bandwidth parameter \(\sigma \). All the weights are sorted in increasing order with \(\varvec{x}^{\left\{ k \right\} }\) having the kth largest value. Finally, they approximate (2) by a weighted version of (3), namely, . The number of samples used is \(K^* = \min _{L \in \mathbb {N}} \big \{\sum _{k=1}^L w_{\mathcal {S}}(\varvec{x}^*, \varvec{x}^{\left\{ k \right\} }) \big / \sum _{i=1}^{N_\text {train}} w_{\mathcal {S}}(\varvec{x}^*, \varvec{x}^{[i]}) > \eta \big \}\), that is, the ratio between the sum of the \(K^*\) largest weights and the sum of all weights must be at least \(\eta \), for instance, 0.95.

Note that as Aas et al. (2021a) use the Mahalanobis distance, their approach is limited to continuous features. One could potentially extend their method by using a distance measure that supports mixed data, for example, the Gower’s distance (Gower 1971; Podani 1999). Another solution is to use, for example, encodings like one-hot-encoding or entity embeddings to represent the categorical variables as numerical (Guo and Berkhahn 2016), although that would increase the computational demand due to increased dimension.

3.3 The parametric method class

In the parametric method class, we make a parametric assumption about the distribution of the data. This simplifies the process of generating the conditional Monte Carlo samples . The idea is to assume a distribution whose conditional distributions have closed-form solutions or are otherwise easily obtainable after estimating the parameters of the full joint distribution. The parametric approaches can yield very accurate representations if the data truly follows the assumed distribution, but they may impose a large bias for incorrect parametric assumptions. In this section, we discuss two previously proposed parametric approaches and introduce two new methods. The current parametric approaches do not support categorical features, which is a major drawback, but one can potentially use the same type of encodings or entity embeddings of the categorical variables as for the empirical method.

3.3.1 Gaussian

Both Chen et al. (2020), Aas et al. (2021a) assume that the observations are multivariate Gaussian distributed with mean \({\varvec{\mu }}\) and covariance matrix \(\Sigma \). That is, , where and . The conditional distribution is also multivariate Gaussian, that is, , where and . The parameters \({\varvec{\mu }}\) and \({\varvec{\Sigma }}\) are easily estimated using the sample mean and covariance matrix of the training data, respectively. In the Gaussian approach, we sample the conditional samples from , for \(k=1,2,\dots ,K\) and \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\), and use them in (3) to estimate the Shapley values in (1).

3.3.2 Gaussian copula

Aas et al. (2021a) also proposed an alternative approach if the features are far from multivariate Gaussian distributed, namely the (Gaussian) copula approach. The idea is to represent the marginals of the features by their empirical distributions and then model the dependence structure by a Gaussian copula. See Appendix B.1 for additional information about copulas.

Assuming a Gaussian copula, Aas et al. (2021a) use the following procedure to generate the K conditional Monte Carlo samples :

  1. 1.

    Convert each marginal \(x_j\) of the feature distribution \(\varvec{x}\) to a Gaussian feature \(v_j\) by \(v_j = \Phi ^{-1}(\hat{F}(x_j))\), where \(\hat{F}(x_j)\) is the empirical distribution function of marginal j.

  2. 2.

    Assume that \(\varvec{v}\) is distributed according to a multivariate Gaussian (the quality of this assumption will depend on how close the Gaussian copula is to the true copula), and sample from the conditional distribution using the method described in Sect. 3.3.1.

  3. 3.

    Convert the margins \(v_j\) in the conditional distribution to the original distribution using \(\hat{x}_j = \hat{F}_j^{-1}(\Phi (v_j))\).

3.3.3 Burr and generalized hyperbolic

The multivariate Gaussian distribution is probably the most well-known multivariate distribution with closed-form expressions for the conditional distributions. However, any other distribution with easily obtainable conditional distributions is also applicable, for example, the multivariate Burr distribution (Takahasi 1965; Yari and Jafari 2006) and the multivariate generalized hyperbolic (GH) distribution (Barndorff-Nielsen 1977; McNeil et al. 2015; Browne and McNicholas 2015; Wei et al. 2019). We call these two approaches for Burr and GH, respectively. In contrast to the Gaussian distribution, whose parameters can easily be estimated by the sample means and covariance matrix, the parameters of the Burr and GH distributions are more cumbersome to estimate. We describe the distributions in more detail in Appendix B. The GH distribution is unbounded and can model any continuous data set, while the Burr distribution is strictly positive and is therefore limited to positive data sets. The GH distribution is related to the Gaussian distribution through the t-distribution, where the latter is a special case of the GH distribution and coincides with the Gaussian distribution when the degree of freedom tends to infinity.

3.4 The generative method class

The generative and parametric methods are similar in that they both generate Monte Carlo samples from the estimated conditional distributions. However, the generative methods do not make a parametric assumption about the data. We consider two generative approaches: the ctree approach of Redelmeier et al. (2020) and the \(\texttt {VAEAC}\) approach of Olsen et al. (2022). The latter is an extension of the approach suggested by Frye et al. (2021). Both methods support mixed data, i.e., both continuous and categorical data.

3.4.1 Ctree

Redelmeier et al. (2020) compute conditional Shapley values by modeling the dependence structure between the features with conditional inference trees (ctree). A ctree is a type of recursive partitioning algorithm that builds trees recursively by making binary splits on features until a stopping criterion is satisfied (Hothorn et al. 2006). The process is sequential, where the splitting feature is chosen first using statistical significance tests, and then the splitting point is chosen using any type of splitting criterion. The ctree algorithm is independent of the dimension of the response, which in our case is , while the input features are \({\varvec{x}_{\mathcal {S}}}\), which varies in dimension based on the coalition \({\mathcal {S}}\). That is, for each coalition \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\), a ctree with \({\varvec{x}_{\mathcal {S}}}\) as the features and as the response is fitted to the training data. For a given \({\varvec{x}_{\mathcal {S}}^*}\), the ctree approach finds the corresponding leaf node and samples K observations with replacement from the part of the training observations in the same node to generate the conditional Monte Carlo samples . We get duplicated Monte Carlo samples when K is larger than the number of samples in the leaf node. Thus, the ctree method weighs the Monte Carlo samples based on their sampling frequencies to bypass redundant calls to f. Therefore, the contribution function \(v({\mathcal {S}})\) is not estimated by (3) but rather by the weighted average , where \(K^*\) is the number of unique Monte Carlo samples. For more details, see Redelmeier et al. (2020, Sect. 3).

3.4.2 VAEAC

Olsen et al. (2022) use a type of variational autoencoder called \(\texttt {VAEAC}\) (Ivanov et al. 2019) to generate the conditional Monte Carlo samples. Briefly stated, the original variational autoencoder (Kingma and Welling 2014, 2019; Rezende et al. 2014) gives a probabilistic representation of the true unknown distribution \(p(\varvec{x})\). The \(\texttt {VAEAC}\) model extends this methodology to all conditional distributions simultaneously. That is, a single \(\texttt {VAEAC}\) model can generate Monte Carlo samples for all coalitions \(\mathcal {S} \in {\mathcal {P}}^*(\mathcal {M})\). It is advantageous to only have to fit a single model for all coalitions, as in higher dimensions, the number of coalitions is \(2^M-2\). That is, the number of coalitions increases exponentially with the number of features. In contrast, ctree trains \(2^M-2\) different models, which eventually becomes computationally intractable for large M. The \(\texttt {VAEAC}\) model is trained by maximizing a variational lower bound, which conceptually corresponds to artificially masking features, and then trying to reproduce them using a probabilistic representation. In deployment, the \(\texttt {VAEAC}\) method considers the unconditional features as masked features to be imputed.

3.5 The separate regression method class

The next two method classes use regression instead of Monte Carlo integration to estimate the conditional expectation in (2). In the separate regression methods, we train a new regression model \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}})\) to estimate the conditional expectation for each coalition of features. Related ideas have been explored by Lipovetsky and Conklin (2001), Strumbelj et al. (2009), Williamson and Feng (2020). However, to the best of our knowledge, we are the first to compare different regression models for estimating the conditional expectation as the contribution function \(v(\mathcal {S})\) in the local Shapley value explanation framework.

The idea is to estimate separately for each coalition \({\mathcal {S}}\) using regression. As in Sect. 2.2, let \(\mathcal {X} = \{ \varvec{x}^{[i]}, y^{[i]} \}_{i=1}^{N_{\text {train}}}\) denote the training data, where \(\varvec{x}^{[i]}\) is the ith M-dimensional input and \(y^{[i]}\) is the associated response. For each \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\), the corresponding training data set is

For each data set \(\mathcal {X}_{\mathcal {S}}\), we train a regression model \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}})\) with respect to the mean squared error loss function. The optimal regression model, with respect to the loss function, is , which corresponds to the contribution function \(v({\mathcal {S}})\). The regression model \(g_{\mathcal {S}}\) aims for the optimal \(g^\dagger _{\mathcal {S}}\). Hence, \(g_{\mathcal {S}}\) resembles/estimates the contribution function, that is, .

A wide variety of regression models minimize the MSE, and we describe a selection of them in Sects. 3.5.1, 3.5.2, 3.5.3, 3.5.4, and 3.5.5. The selection discussed below consists of classical regression models and those that generally perform well for many experiments.

3.5.1 Linear regression model

The simplest regression model we consider is the linear regression model. It takes the form \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}}) = \beta _{{\mathcal {S}},0} + \sum _{j \in S} \beta _{{\mathcal {S}},j}x_j = \varvec{x}_{\mathcal {S}}^T\varvec{\beta }_{\mathcal {S}}\), where the coefficients \(\varvec{\beta }_{\mathcal {S}}\) are estimated by the least squares solution, that is, \(\hat{\varvec{\beta }}_{\mathcal {S}}= {\text {arg\,min}}_{\varvec{\beta }} \Vert X_{\mathcal {S}}\varvec{\beta } - \varvec{z}\Vert ^2 = (X_{\mathcal {S}}^TX_{\mathcal {S}})^{-1}X_{\mathcal {S}}^T\varvec{z}\), for all \({\mathcal {S}}\in \mathcal {P}^*(\mathcal {M})\). Here \(X_{\mathcal {S}}\) is the design matrix with the first column consisting of 1s to also estimate the intercept \(\beta _{{\mathcal {S}},0}\). We call this approach LM separate.

3.5.2 Generalized additive model

The generalized additive model (GAM) extends the linear regression model and allows for nonlinear effects between the features and the response (Wood 2006b; Hastie et al. 2009). The fitted GAM takes the form \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}}) = \beta _{{\mathcal {S}},0} + \sum _{j \in {\mathcal {S}}} g_{{\mathcal {S}},j}(x_{{\mathcal {S}},j})\), where the effect functions \(g_{{\mathcal {S}},j}\) are penalized regression splines. We call this approach GAM separate.

3.5.3 Projection pursuit regression

The projection pursuit regression (PPR) model extends the GAM model (Friedman and Stuetzle 1981; Hastie et al. 2009). The PPR model takes the form \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}}) = \beta _{{\mathcal {S}},0} + \sum _{l=1}^L g_{{\mathcal {S}},l}(\varvec{\beta }_{{\mathcal {S}},l}^T{\varvec{x}_{\mathcal {S}}})\), where the parameter vector \(\varvec{\beta }_{{\mathcal {S}},l}\) is an \(|{\mathcal {S}}|\)-dimensional unit vector. The PPR is an additive model, but in the transformed features \(\varvec{\beta }_{{\mathcal {S}},l}^T{\varvec{x}_{\mathcal {S}}}\) rather than in the original features \({\varvec{x}_{\mathcal {S}}}\). The ridge functions \(g_{{\mathcal {S}},l}\) are unspecified and are estimated along with the parameters \(\varvec{\beta }_{{\mathcal {S}},l}\) using some flexible smoothing method. The PPR model combines nonlinear functions of linear combinations, producing a large class of potential models. Moreover, it is a universal approximator for continuous functions for arbitrary large L and appropriate choice of \(g_{{\mathcal {S}},l}\) (Hastie et al. 2009, Sect. 11.2). We call this approach PPR separate.

3.5.4 Random forest

A random forest (RF) is an ensemble model consisting of a multitude of decision trees, where the average prediction of the individual trees is returned. The first algorithm was developed by Ho (1995), but Breiman (2001) later extended the algorithm to include bootstrap aggregating to improve the stability and accuracy. We call this approach RF separate.

3.5.5 Boosting

A (tree-based) boosted model is an ensemble learner consisting of weighted weak base-learners which has been iteratively fitted to the error of the previous base-learners and together they form a strong learner (Hastie et al. 2009). The seminal boosting algorithm was developed by Freund and Schapire (1997), but multitudes of boosting algorithms has later been developed (Mayr et al. 2014), for example, CatBoost (Prokhorenkova et al. 2018). We call this approach CatBoost separate.

3.6 The surrogate regression method class

Since the separate regression methods train a new regression model \(g_{\mathcal {S}}({\varvec{x}_{\mathcal {S}}})\) for each coalition \(\mathcal {S} \in {\mathcal {P}}^*({\mathcal {M}})\), a total of \(2^M-2\) models has to be trained, which can be time-consuming for slowly fitted models. The surrogate regression method class builds on the ideas from the separate regression class, but instead of fitting a new regression model for each coalition, we train a single regression model \(g(\tilde{\varvec{x}}_{\mathcal {S}})\) for all coalitions \({\mathcal {S}}\in {\mathcal {P}}^*({\mathcal {M}})\), where \(\tilde{\varvec{x}}_{\mathcal {S}}\) is defined in Sect. 3.6.1. The surrogate regression idea is used by Frye et al. (2021), Covert et al. (2021), but their setup is limited to neural networks. In Sect. 3.6.1, we propose a general and novel framework that allows us to use any regression model. Then, we relate our framework to the previously proposed neural network setup in Sect. 3.6.2.

3.6.1 General surrogate regression framework

To construct a surrogate regression method, we must consider that most regression models g rely on a fixed-length input, while the size of \({\varvec{x}_{\mathcal {S}}}\) varies with coalition \({\mathcal {S}}\). Thus, we are either limited to regression models that support variable-length input, or we can create a fixed-length representation \(\tilde{\varvec{x}}_{\mathcal {S}}\) of \({\varvec{x}_{\mathcal {S}}}\) for all coalitions \({\mathcal {S}}\). The \(\tilde{\varvec{x}}_{\mathcal {S}}\) representation must also include fixed-length information about the coalition \({\mathcal {S}}\) to enable the regression model g to distinguish between coalitions. Finally, we need to augment the training data to reflect that g is to predict the conditional expectation for all coalitions \({\mathcal {S}}\).

In our framework, we augment the training data by systematically applying all possible coalitions to all training observations. We can then train a single regression model g on the augmented training data set, and the corresponding regression model can then (in theory) estimate the contribution function \(v({\mathcal {S}})\) for all coalitions \({\mathcal {S}}\in \mathcal {P}^*(\mathcal {M})\) simultaneously.

To illustrate the augmentation idea, we consider a small example with \(M = 3\) features and \(N_\text {train} = 2\) training observations. Let \(\mathcal {X} = \big [ {\begin{matrix} x_{11} &{} x_{12} &{} x_{13} \\ x_{21} &{} x_{22} &{} x_{23} \end{matrix}}\big ]\) and \(\varvec{z} = \big [ {\begin{matrix} f(\varvec{x}_{1}) \\ f(\varvec{x}_{2}) \end{matrix}}\big ]\) denote the training data and responses, respectively. In this setting, \({\mathcal {M}}= \left\{ 1,2,3 \right\} \) and \({\mathcal {P}}^*({\mathcal {M}}) = \left\{ \left\{ 1 \right\} , \left\{ 2 \right\} , \left\{ 3 \right\} , \left\{ 1,2 \right\} , \left\{ 1,3 \right\} , \left\{ 2,3 \right\} \right\} \) consists of six different coalitions \({\mathcal {S}}\), or equivalently masks . Assuming that g relies on fixed-length input, we must represent both the observed values \({\varvec{x}_{\mathcal {S}}}\) and coalition \({\mathcal {S}}\) in a fixed-length notation for the surrogate regression methods to work.

To solve this, we first introduce \(I({\mathcal {S}}) = \left\{ \textbf{1}(j \in {\mathcal {S}}): j = 1, \dots , M\right\} \in \{0, 1\}^M\), where \(\textbf{1}(j \in {\mathcal {S}})\) is the indicator function which is one if \(j \in {\mathcal {S}}\) and zero otherwise. Then, \(I({\mathcal {S}})\) is an M-dimensional binary vector where the jth element \(I({\mathcal {S}})_j\) is one if the jth feature is in \({\mathcal {S}}\) (i.e., observed/conditioned on) and zero if it is in (i.e., unobserved/unconditioned). The I function ensures fixed-length representations of the coalitions/masks, and note that , where \(\textbf{1}^M\) is the size M vector of 1s. Second, to obtain a fixed-length representation \(\hat{\varvec{x}}_{{\mathcal {S}}}\) of the observed/conditioned feature vector \(\varvec{x}_{{\mathcal {S}}}\), we apply the fixed-length mask \(I({\mathcal {S}})\) to \(\varvec{x}\) as an element-wise product, that is, , where \(\circ \) is the element-wise product. Finally, we concatenate the fixed-length representations together to form the augmented version of \({\varvec{x}_{\mathcal {S}}}\), namely, , which has \(2\,M\) entries. We include in \(\tilde{\varvec{x}}_{{\mathcal {S}}}\) such that the model g can distinguish between actual zeros in \({\varvec{x}_{\mathcal {S}}}\) and those induced by the masking procedure when creating \(\hat{\varvec{x}}_{{\mathcal {S}}}\). We treat as binary categorical features.

After carrying out this procedure for all coalitions and training observations, we obtain the following augmented training data and responses:

(5)

The number of rows in \(\mathcal {X}_\text {aug}\) and \(\varvec{z}_\text {aug}\) is \(N_\text {train}(2^M-2)\). For example, with \(N_\text {train} = 1000\) and \(M=8\) the augmented data \(\mathcal {X}_\text {aug}\) consists of 254,000 rows, while the number of rows is 65,534,000 when \(M=16\). This exponential growth can make it computationally intractable to fit some types of regression models to the augmented training data \(\left\{ \mathcal {X}_\text {aug}, \varvec{z}_\text {aug} \right\} \) in high-dimensions. A potential solution is to sample a subset of rows from (5) and only use them to train the regression model. The observation \(\varvec{x}^*\), which we want to explain, is augmented by the same procedure, and \(g(\tilde{\varvec{x}}^*_{{\mathcal {S}}})\) then approximates the corresponding contribution function \(v({\mathcal {S}}, \varvec{x}^*)\).

Methods

For the surrogate regression method class, we consider the same regression models as in Sect. 3.5. We call the methods for LM surrogate, GAM surrogate, PPR surrogate, RF surrogate, and CatBoost surrogate, and they take the following forms:

  • LM surrogate: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}}) = \beta _{0} + \sum _{j = 1}^{2\,M} \beta _{j}\tilde{x}_{{\mathcal {S}}, j} = \tilde{\varvec{x}}_{\mathcal {S}}^{T}\varvec{\beta }\).

  • GAM surrogate: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}}) = \beta _{0} + \sum _{j = 1}^{M} g_{j}(\tilde{x}_{{\mathcal {S}}, j}) + \sum _{j = M+1}^{2\,M} \beta _{j}\tilde{x}_{{\mathcal {S}}, j}\). That is, we add nonlinear effect functions to the augmented features \(\hat{\varvec{x}}_{{\mathcal {S}}} = \varvec{x} \circ I({\mathcal {S}})\) in \(\tilde{\varvec{x}}_{{\mathcal {S}}}\) while letting the binary mask indicators in \(\tilde{\varvec{x}}_{{\mathcal {S}}}\) be linear.

  • PPR surrogate: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}}) = \beta _{0} + \sum _{l=1}^L g_{l}(\varvec{\beta }_{l}^T\tilde{\varvec{x}}_{{\mathcal {S}}})\), where \(g_{l}\) and \(\varvec{\beta }_{l}\) are the lth ridge function and parameter vector, respectively.

  • RF surrogate: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}})\) is a RF model fitted to the augmented data on the same form as in (5).

  • CatBoost surrogate: \(g(\tilde{\varvec{x}}_{{\mathcal {S}}})\) is a CatBoost model fitted to the augmented data on the same form as in (5).

3.6.2 Surrogate regression: neural networks

The surrogate regression neural network (NN-Frye surrogate) approach in Frye et al. (2021) differs from our general setup above in that they do not train the model on the complete augmented data. Instead, for each observation in every batch in the training process, they randomly sample a coalition \({\mathcal {S}}\) with probability \(\frac{|\mathcal {S}|!(M-|\mathcal {S}|-1)!}{M!}\). Then they set the masked entries of the observation, i.e., the features not in \({\mathcal {S}}\), to an off-distribution value not present in the data. Furthermore, they do not concatenate the masks to the data, as we do in (5).

We propose an additional neural network (NN-Olsen surrogate) approach to illustrate that one can improve on the NN-Frye surrogate method. The main conceptual differences between the methods are the following. First, for each batch, we generate a missing completely at random (MCAR) mask with paired sampling. MCAR means that the binary entries in the mask are Bernoulli distributed with probability 0.5, which ensures that all coalitions are equally likely to be considered. Further, paired sampling means that we duplicate the observations in the batch and apply the complement mask, \({\mathcal {S}}\), on these duplicates. This ensures more stable training as the network can associate both \(\varvec{x}_{{\mathcal {S}}}\) and with the response \(f(\varvec{x})\). Second, we set the masked entries to zero and include the binary mask entries as additional features, as done in (5) and Olsen et al. (2022). This enables the network to learn to distinguish actual zeros in the data set and zeros induced by the masking, removing the need to set an off-distribution masking value. Additional differences due to implementation, for example, network architecture and optimization routine, are elaborated in Appendix A.

3.7 Time complexity

In this section, we elaborate on the time complexity of computing the Shapley value explanations for the different method classes. We provide a simplified overview of the complexities in Table 1 and the corresponding in-depth explanations here. In general, we can decompose the computation time into three components for each method: training, generating, and predicting. The method-independent computation time of setting up and using the Shapley value explanation framework is described in Appendix A.

  • Training: The computation time of the training step depends on several method (class) specific attributes. The independence method is trivial as no training is needed; hence, its training time is zero. For all other methods, the computation time is affected by the number of features M, training observations \(N_\text {train}\), and models within the method. The number of models is one for the \(\texttt {VAEAC}\) and surrogate regression methods, while it is \(2^M-2\) for the other methods. In the former case, only a single model is trained, but in return, the surrogate regression methods using (5) have doubled M and increased \(N_\text {train}\) by a factor of \(2^M-2\). In the latter case, the number of features in each of the \(2^M-2\) models varies from 1 to M, depending on the coalition \({\mathcal {S}}\), which also alters the computation time. The computation time is also increased if cross-validation is used to tune some or all of the hyperparameters. Thus, the overall training time complexity is \(\mathcal {O}(C_\text {train}2^M)\), where \(C_\text {train}\) is method-specific and depends on the factors discussed above. For example, \(C_\text {train} = 0\) for the trivial independence method. In contrast, estimating a single conditional multivariate normal distribution in the Gaussian method yields \(C_\text {train} = \mathcal {O}(M^2(M+N_\text {train}))\), and the same for training a single non-cross-validated linear model in the LM separate method. While \(C_\text {train} = \mathcal {O}(MN_\text {trees}N_\text {train}\log _2N_\text {train})\) for the RF separate method, where \(N_\text {trees}\) is the number of balanced trees in the forest. Thus, the complexities of the four methods are constant, linear, linear, and log-linear in the number of training observations \(N_\text {train}\), respectively.

  • Generating: The computation time of the generating step is only applicable to the Monte Carlo-based methods as the regression-based methods do not generate any Monte Carlo samples. The time needed to generate an -dimension Monte Carlo sample can vary for different coalitions due to the coalition size. The time complexity of generating the K Monte Carlo samples for the \(2^M-2\) coalitions and \(N_\text {test}\) test observations is \(\mathcal {O}(C_\text {MC} K N_\text {test} 2^M)\), where \(C_\text {MC}\) is method-specific and represents the generation of one Monte Carlo sample. For example, \(C_\text {MC}\) corresponds to sample one observation from the training data in the independence method, which is done in constant time, that is, \(C_\text {MC} = \mathcal {O}(1)\). While in the Gaussian method, \(C_\text {MC}\) represents the cost of generating standard Gaussian data and converting it to the associated multivariate conditional Gaussian distribution using the Cholesky decomposition of the conditional covariance matrix , which is \(\mathcal {O}(M^3)\). Note, however, that the cost of the Cholesky decomposition can be shared among all K Monte Carlos samples and \(N_\text {test}\) test observations. The computation of the Cholesky decomposition could also have been considered part of the training step.

  • Predicting: The computation time of the predicting step varies between the Monte Carlo and regression paradigms due to their conceptually different techniques for computing \(v({\mathcal {S}})\). The Monte Carlo paradigm computes the contribution function \(v({\mathcal {S}})\) based on (3), while the regression paradigm uses (4). The Monte Carlo paradigm relies on averaging K calls to the predictive model f for each of the \((2^M-2)\) coalition and \(N_\text {test}\) test observation. Thus, the overall predicting time complexity for the Monte Carlo paradigm is \(\mathcal {O}(C_f K N_\text {test} 2^M)\), where \(C_f\) represents the computation time of calling f once. In the regression paradigm, the value of the contribution function is directly estimated as the output of a regression model g. Thus, the overall predicting time complexity for the regression paradigm is \(\mathcal {O}(C_g N_\text {test} 2^M)\), where \(C_g\) represents the computation time of calling g once. Both \(C_f\) and \(C_g\) are influenced by, e.g., the number of features M and training observations \(N_\text {train}\), but also by the intricacy of the predictive and regression model, respectively. For example, the time complexity of calling a linear regression model with M features once is \(\mathcal {O}(M)\), while it is \(\mathcal {O}(N_\text {trees}\log _2N_\text {train})\) for a random forest model with \(N_\text {trees}\) balanced trees.

Table 1 A simplified overview of the time complexities of the paradigm/method classes

3.8 Additional methods in the supplement

In addition to the methods described in Sects. 3.1, 3.2, 3.3, 3.4, 3.5, and 3.6, we include dozens more generative, separate regression, and surrogate regression methods in the Supplement. These methods are not included in the main text as they generally perform worse than the introduced methods. For the generative method class, we consider three additional \(\texttt {VAEAC}\) approaches with methodological differences and point to eleven other potential generative methods. For the separate regression method class, we consider twenty other regression models, and most of these are also applicable to the surrogate regression method class. Among the regression methods are linear regression with interactions, polynomial regression with and without interactions, elastic nets, generalized additive models, principal component regression, partial least squares, K-nearest neighbors, support vector machines, decision trees, boosting, and neural networks. In the Supplement, we apply the additional methods to the numerical simulation studies and real-world data experiments conducted in Sects. 4 and 5, respectively.

4 Numerical simulation studies

A major problem of evaluating explanation frameworks is that there is no ground truth for authentic real-world data. In this section, we simulate data for which we can compute the true Shapley values \({\varvec{\phi }}_\texttt {true}\) and compare how close the estimated Shapley values \(\hat{{\varvec{\phi }}}_\texttt {q}\) are when using approach \(\texttt {q}\). We gradually increase the complexity of the setups in the simulation studies to uncover in which settings the different methods described in Sect. 3 perform the best and should be used. Additionally, as we focus on conditional Shapley values, we vary the dependencies between the features within each simulation setup to investigate how the methods cope with different dependence levels.

In all experiments, we generate univariate prediction problems with \(M = 8\)-dimensional features simulated from a multivariate Gaussian distribution \(p(\varvec{x}) = \mathcal {N}_{8}(\varvec{0}, \Sigma )\), where \(\Sigma _{ij} = \rho ^{\left| i-j \right| }\) for \(\rho \in \left\{ 0, 0.3, 0.5, 0.9 \right\} \) and 1 on the diagonal. Larger values of \(\rho \) correspond to higher dependencies between the features. Higher feature dimensions are possible, but we chose \(M=8\) to keep the computation time of the simulation studies feasible. The real-world data sets in Sect. 5 contain more features. In Sect. 2.2.3, we discuss approximation strategies used in the literature to compute Shapley value explanations in higher dimensions.

We let the number of training observations be \(N_\text {train} = 1000\), while we explain \(N_\text {test} = 250\) test observations. Thus, the training data set is \(\left\{ \varvec{x}^{[i]}, y^{[i]} \right\} _{i=1}^{N_{\text {train}}}\), where \(\varvec{x}^{[i]} \sim \mathcal {N}_{8}(\varvec{0}, \Sigma )\) and the response \(y^{[i]} = f_\texttt {true}(\varvec{x}^{[i]}) + \varepsilon ^{[i]}\). The function \(f_\texttt {true}\) is different in different experiments and \(\varepsilon ^{[i]} \sim \mathcal {N}(0, 1)\). The test data sets are created by the same procedure. We provide additional experiments with other settings and some illustrative plots of the data in the Supplement.

We evaluate the performance of the different approaches by computing the mean absolute error (MAE) between the true and estimated Shapley values, averaged over all test observations and features. This criterion has been used in Redelmeier et al. (2020), Aas et al. (2021a), Aas et al. (2021b), Olsen et al. (2022). The MAE is given by

$$\begin{aligned} {\text {MAE}} = {\text {MAE}}_{\phi }(\text {method } \texttt {q}) = \frac{1}{N_\text {test}} \sum _{i=1}^{N_\text {test}} \frac{1}{M} \sum _{j=1}^M |\phi _{j, \texttt {true}}(\varvec{x}^{[i]}) - \hat{\phi }_{j, \texttt {q}}(\varvec{x}^{[i]})|. \end{aligned}$$
(6)

The true Shapley values are in general unknown, but we can compute them with arbitrary precision in our setup with multivariate Gaussian distributed data, as the conditional distributions are analytically known and samplable for all \({\mathcal {S}}\). Thus, by sampling , we can compute the true contribution function \(v_\texttt {true}({\mathcal {S}})\) in (2) by using (3). The true Shapley values are then obtained by inserting the \(v_\texttt {true}({\mathcal {S}})\) quantities into the Shapley value formula in (1). The \(v_\texttt {true}({\mathcal {S}})\) quantities can be arbitrarily precise by choosing a sufficiently large number of Monte Carlo samples, e.g., \(K =\) 10,000.

Fig. 2
figure 2

Results of the lm_no_interactions experiment: boxplots of the mean absolute error between the true and estimated Shapley values for the test observations using different methods and for different dependence levels \(\rho \)

4.1 Linear regression models

The first simulation setup should be considered a sanity check, as we generate the response \(y^{[i]}\) according to the following linear regression models:

  • lm_no_interactions: \(f_{\text {lm}, \text {no}}(\varvec{x}) = \beta _0 + \sum _{j=1}^{M} \beta _jx_j\),

  • lm_more_interactions: \(f_{\text {lm}, \text {more}}(\varvec{x}) = f_{lm, \text {no}}(\varvec{x}) + \gamma _1x_1x_2 + \gamma _2x_3x_4\),

  • lm_numerous_interactions:

    \(f_{\text {lm}, \text {numerous}}(\varvec{x}) = f_{\text {lm}, \text {more}}(\varvec{x}) + \gamma _3x_5x_6 + \gamma _4x_7x_8\),

where \(\varvec{\beta } = \{1.0, 0.2, -\,0.8, 1.0, 0.5, -\,0.8, 0.6, -\,0.7, -\,0.6\}\) and \(\varvec{\gamma } = \{0.8, -\,1.0, -\,2.0, 1.5\}\). For each setup, we fit a predictive linear model f with the same form as the true model. For example, in the lm_more_interactions setup, the predictive linear model f has eight linear terms and two interaction terms reflecting the form of \(f_{\text {lm}, \text {more}}\). We fit the predictive models using the lm function in base R.

Fig. 3
figure 3

Results of the lm_more_interactions experiment

In Figs. 2, 3, and 4, we show the MAE for each test observation (i.e., the absolute error averaged only over the features), and we keep the methods in the same order throughout the figures. In what follows, we briefly summarize the results of the different simulation setups.

  • lm_no_interactions (Fig. 2): For \(\rho = 0\), we see that ctree and LM surrogate perform the best. The independence approach, which makes the correct feature independence assumption, is close behind. For \(\rho > 0\), the parametric and separate regression (LM, GAM, and PPR) methods generally perform the best. In particular, the LM separate method, which makes the correct model assumption, is the best-performing approach. The generative and empirical approaches form the mid-field, while the surrogate regression and independence methods seem to be the least precise.

  • lm_more_interactions (Fig. 3): In this case, the LM separate method performs poorly, which is reasonable due to the incorrect model assumption. For \(\rho = 0\), the ctree approach is the most accurate, but the independence and parametric methods are close behind. For \(\rho > 0\), the parametric methods are clearly the best approaches as they make the correct parametric assumption. The PPR separate method performs very well, and the generative approaches are almost on par for moderate correlation. The NN-Olsen surrogate method is the most accurate surrogate regression approach. In general, the separate regression methods perform better as \(\rho \) increases due to the simpler regression problems/prediction tasks. The performance of the GAM separate method particularly improves for larger values of \(\rho \).

  • lm_numerous_interactions (Fig. 4): The overall tendencies are very similar to those in the lm_more_interactions experiment. The parametric methods are by far the most accurate. Further, ctree is the best generative approach, the NN-Olsen surrogate is the best surrogate regression method, and the PPR separate method is the best separate regression approach.

In boxplots, the box and midline represent the interquartile range (IQR) and median, respectively, while the dots denote outliers beyond the whisker boundaries, i.e., observations with MAE scores less/greater than the lower/upper quartile minus/plus 1.5 times the IQR. Olsen (2023) investigates the underlying structure of the outliers and finds that the same test observations often constitute the outliers for the different methods. Crucially, this means that some predictions are intrinsically more challenging to compute accurate Shapley values for. Furthermore, Olsen (2023) illustrates that these observations are often in the outer regions of the training data where less data was available to learn the dependency structures between the features.

Fig. 4
figure 4

Results of the lm_numerous_interactions experiment

4.2 Generalized additive models

In this section, we first investigate situations with a nonlinear relationship between the features and the response, while we later also include pairwise nonlinear interaction terms. More specifically, we first gradually progress from the lm_no_interactions model to a full generalized additive model by applying the nonlinear function \({\text {cos}}(x_j)\) to a subset of the features in \(\varvec{x}\). Then, we extend the full generalized additive model by also including pairwise nonlinear interaction terms of the form \(g(x_j,x_k) = x_jx_k + x_jx_k^2 + x_kx_j^2\). We generate the features \(\varvec{x}^{[i]}\) as before, but the response value \(y^{[i]}\) is now generated according to:

  • gam_three: \(f_{\text {gam}, \text {three}}(\varvec{x}) = \beta _0 + \sum _{j=1}^{3}\beta _j\cos (x_j) + \sum _{j=4}^{M} \beta _jx_j\),

  • gam_all: \(f_{\text {gam}, \text {all}}(\varvec{x}) = \beta _0 + \sum _{j=1}^{M}\beta _j\cos (x_j)\),

  • gam_more_interactions:

    \(f_{\text {gam}, \text {more}}(\varvec{x}) = f_{\text {gam}, \text {all}}(\varvec{x}) + \gamma _1g(x_1, x_2) + \gamma _2g(x_3, x_4)\),

  • gam_numerous_interactions:

    \(f_{\text {gam}, \text {numerous}}(\varvec{x}) = f_{\text {gam}, \text {more}}(\varvec{x}) + \gamma _3g(x_5, x_6) + \gamma _4g(x_7, x_8)\),

where we let \(\varvec{\beta } = \left\{ 1.0, 0.2, -\,0.8, 1.0, 0.5, -\,0.8, 0.6, -\,0.7, -\,0.6 \right\} \) and \(\varvec{\gamma } = \left\{ 0.8, -\,1.0, -\,2.0, 1.5 \right\} \), i.e., the same coefficients as in Sect. 4.1.

Fig. 5
figure 5

Results of the gam_three experiment

Fig. 6
figure 6

Results of the gam_all experiment

As the true models contain smooth nonlinear effects and smooth pairwise nonlinear interaction terms, we let the corresponding predictive models be GAMs with splines for the nonlinear terms and tensor product smooths for the nonlinear interaction terms. For example, in the gam_three experiment, the fitted predictive model f uses splines on the three first features while the others are linear. In the gam_more_interactions experiment, f uses splines on all eight features, and tensor product smooths on the two nonlinear interaction terms. We fit the predictive models using the mgcv package with default parameters (Wood 2006a, 2022). In what follows, we briefly summarize the results of the different simulation setups.

  • gam_three (Fig. 5): On the contrary to the lm_no_interactions experiment, we see that the LM separate approach performs much worse than the GAM separate approach, which makes sense as we have moved from a linear to a nonlinear setting. For \(\rho = 0\), we see that ctree and independence are the best approaches. For \(\rho > 0\), the parametric approaches are superior, but the GAM separate approach is not far behind, while the NN-Olsen surrogate method is the best surrogate regression approach.

  • gam_all (Fig. 6): The performance of the LM approaches continue to degenerate. The separate regression methods get gradually better for higher values of \(\rho \), but the parametric methods are still superior. The generative methods constitute the second-best class for \(\rho \in \{0.3, 0.5\}\), but the GAM separate and PPR separate approaches are relatively close. The latter approaches outperform the generative methods when \(\rho = 0.9\).

  • gam_more_interactions (Fig. 7): We see similar results to those in the gam_all experiment. The parametric approaches are superior in all settings. The generative methods perform quite well for \(\rho < 0.5\), but they are beaten by the PPR separate method for \(\rho = 0.9\). Note that the GAM separate approach now falls behind the PPR separate approach, as it is not complex enough to model the nonlinear interaction terms. This indicates that complex separate regression approaches are needed to model complex predictive models. Furthermore, the RF surrogate method is on par or outperforms the NN based surrogate regression approaches.

  • gam_numerous_interactions (Fig. 8): We get nearly identical results as in the previous experiment. Hence, we do not provide further comments on the results.

Fig. 7
figure 7

Results of the gam_more_interactions experiment

Fig. 8
figure 8

Results of the gam_numerous_interactions experiment

4.3 Computation time

Table 2 The CPU times used by the methods to compute Shapley values for the \(N_\text {test} = 250\) test observations in the gam_more_interactions experiment with \(\rho = 0.5\) and \(N_\text {train} = 1000\)

In this section, we discuss the computation time used by the different methods to estimate the Shapley values, as a proper evaluation of the methods should not only be limited to their accuracy. We report the CPU times to get a fair comparison between the approaches, as some methods are parallelized and would, therefore, benefit from multiple cores when it comes to elapsed time. The CPU times for the different methods will vary significantly depending on the operating system, hardware, and implementation. The times we report here are based on an Intel(R) Core(TM) i5-1038NG7 CPU@2.00GHz with 16GB 3733MHz LPDDR4X RAM running R version 4.2.0 on the macOS Ventura (13.0.1) operating system. Throughout this article, we mean CPU time when we discuss time.

In Table 2, we report the time it took to estimate the Shapley values using the different methods in the gam_more_interactions experiment with \(\rho = 0.5\), \(N_\text {train} = 1000\), and \(N_\text {test} = 250\) in Sect. 4.2. We split the total time into the same three time components as in Sect. 3.7. That is, time used training the approaches, time used generating the Monte Carlo samples, and time used predicting the \(v({\mathcal {S}})\) using Monte Carlo integration (including the calls to f) or regression. We denote these three components by training, generating, and predicting, respectively. The matrix multiplication needed to estimate the Shapley values from the estimated contribution functions is almost instantaneous and is part of the predicting time. Furthermore, creating the augmented training data for the surrogate regression methods in Sect. 3.6.1 takes around one second and is part of the training time. We see a wide spread in the times, but, in general, the Monte Carlo approaches take, on average, around half an hour, while the regression methods are either much faster or slower, depending on the approach.

The Monte Carlo methods make a total of \(N_\text {test}K(2^M-2)\) calls to the predictive model f to explain the \(N_\text {test}\) test observations with M features and K Monte Carlo samples. In our setting with \(N_\text {test} = 250\), \(M=8\), and \(K = 250\), the predictive model is called \(N_f =\) 15,875,000 times, thus, the speed of calculating f greatly effects the explanation time. For example, the GAM model in the gam_more_interactions experiment is slow, as we can see in Table 2, since the predicting time constitutes the majority of the total time. To compare, \(N_f\) calls to the linear model in the lm_more_interactions experiment takes approximately 3 CPU seconds, while the GAMs in the gam_three and gam_more_interactions experiments take roughly 13 and 35 CPU minutes, respectively. In the latter experiment, the PPR and RF models in Sect. 4.5 take around 0.5 and 40 CPU minutes, respectively.

The empirical and ctree approaches have lower predicting time than the other Monte Carlo-based methods due to fewer calls to f since they use weighted Monte Carlo samples; see Sects. 3.2 and 3.4.1. The three influential time factors for the Monte Carlo methods are: the training time of the approach (estimating the parameters), the sampling time of the Monte Carlo samples, and the computational cost of calling f; see Sect. 3.7.

In contrast, both the separate regression and surrogate regression methods use roughly the same time to estimate the Shapley values for different predictive models f, as f is only called \(N_\text {train}\) times when creating the training data sets. After that, we train the separate regression and surrogate regression approaches and use them to directly estimate the contribution functions. The influential factors for the regression methods are the training time of the \(2^M-2\) separate models (or the one surrogate model) and the prediction time of calling them a total of \(N_\text {test}(2^M-2)\) times. The former is the primary factor, and it is influenced by, e.g., hyperparameter tuning and the training data size. The latter can be a problem for the augmented training data for the surrogate regression methods, as we will see in Sect. 5.4.

When excluding the time of the training step, which is only done once and can be considered as an upfront time cost, it is evident that the regression-based methods produce the Shapley value explanations considerably faster than the Monte Carlo-based methods. For example, consider the most accurate Monte Carlo and regression-based methods in the gam_more_interactions experiment with \(\rho = 0.5\), i.e., the Gaussian and PPR separate methods, respectively. The Gaussian approach uses approximately 37 CPU minutes to explain 250 predictions, an average of 8.88 seconds per explanation. In contrast, the PPR separate method explains all the \(N_\text {test} = 250\) predictions in half a second. Thus, the PPR separate method is approximately 4440 times faster than the Gaussian approach per explanation, which is essential for large values of \(N_\text {test}\). However, note that this factor is substantially lower for predictive models that are less computationally expensive to call.

The computation times reported in Table 2 align with the time complexities discussed in Sect. 3.7. However, it is crucial to consider the cost of the cross-validation procedure when evaluating the computation times. For example, the RF separate method is by far the slowest separate regression method, but the large training time is (mainly) due to the method’s extensive hyperparameter tuning described in Appendix A. As the number of folds and hyperparameter combinations in the cross-validation procedure is \(N_\text {folds} = 4\) and \(N_\text {hyper} = 12\), respectively, we fit a total of \(N_\text {folds}N_\text {hyper}(2^M-2) = 12\,192\) random forest models. In contrast, the LM separate method directly fits the \(2^M-2\) separate models without any tuning. In the Supplement, we omit the cross-validation procedure and use default hyperparameter values, which significantly reduces the computation time but also the performance. The two slowest methods are the NN-Frye and NN-Olsen surrogate methods, which consider six and nine hyperparameter combinations each. Thus, using default values would reduce the training time by a factor of 6 and 9, respectively, but at the cost of precision.

4.4 Number of training observations

To investigate if the training data size has an effect on the ordering of the methods according to the MAE criterion, we repeat the experiments in Sects. 4.1 and 4.2 with \(N_\text {train} \in \left\{ 100, 5000 \right\} \), and some of them with \(N_\text {train} =\) 20,000. We obtain nearly identical results, except for three distinctions. First, the independence approach becomes relatively more accurate compared to the other methods when \(N_\text {train} = 100\), and worse when \(N_\text {train} \in \left\{ 5000, 20,000 \right\} \). This is intuitive, as modeling the data distribution/response is easier when the methods have access to more data. Second, in the simple experiments in Sects. 4.1 and 4.2 and \(N_\text {train} \in \left\{ 5000, 20,000 \right\} \), the GAM separate and PPR separate approaches become even better, but are still beaten by the Gaussian and copula approaches in most experiments. Third, we observe that the MAE has a tendency to decrease when \(N_\text {train}\) increases. However, we cannot directly compare the MAE scores as they depend on the fitted predictive model f, which changes when \(N_\text {train}\) is adjusted.

4.5 Other choices for the predictive model

Fig. 9
figure 9

Experiment gam_more_interactions with a PPR as the predictive model

Fig. 10
figure 10

Experiment gam_more_interactions with a RF as the predictive model

In practice, it might be difficult to identify the pairwise interactions in Sect. 4.2. Hence, one would potentially fit a model without them. We included them above as we knew the data-generating processes and wanted a precise model, but we now pretend otherwise and fit other predictive models. We consider two different types of complex black-box predictive models: projection pursuit regression (PPR) and random forest (RF), and we conduct the same type of hyperparameter tuning as in the other experiments. However, we conduct no feature transformations and directly use the original features when fitting the models. These models are less precise than the GAMs in Sect. 4.2, which have an unfair advantage as they use the true formulas. For example, in the gam_more_interactions experiment, the MSE test prediction values were 1.32, 3.67, 7.36 for GAM, PPR, RF, respectively, where 1 is the theoretical optimum as \({\text {Var}}(\varepsilon ) = 1\).

We only include the figures for the gam_more_interactions experiment, as the corresponding figures for the other experiments are almost identical. The results are displayed in Figs. 9 and 10 for the PPR and RF models, respectively, and the results are quite similar to those obtained for the GAM model in Fig. 7. In general, the parametric methods are superior, followed by the generative methods, while the empirical, separate regression, and surrogate regression approaches are worse. Some separate regression approaches perform, however, much better for high dependence. The independence method performs well when \(\rho = 0\), but it gradually degenerates as the dependence level increases, as expected. We see that the PPR separate approach performs well for the PPR predictive model, but it is outperformed by the CatBoost separate method for the RF models. These results indicate that for our experiments, it is beneficial to choose a regression method similar to the predictive model; that is, for a non-smooth model, one should consider using a non-smooth regression method. However, note that the difference in the MAE is minuscule.

4.6 Different data distribution

In the Supplement, we repeat all the experiments described in Sects. 4.1 and 4.2 but with multivariate Burr distributed features instead of Gaussian ones. The Burr distribution allows for heavy-tailed, skewed marginals, and nonlinear dependence. In this case, the parametric Burr approach, which assumes Burr distributed data, not surprisingly, is the most accurate. The Gaussian method, which now incorrectly assumes Gaussian distributed data, performs worse. The \(\texttt {VAEAC}\) approach performs very well on the Burr distributed data, which was also observed by Olsen et al. (2022). In general, \(\texttt {VAEAC}\) is the second-best approach after Burr. The PPR separate method also performs well, but compared to the Burr and \(\texttt {VAEAC}\) approaches, it is less precise in the experiments with nonlinear interaction terms.

4.7 Summary of the experiments

Making the correct (or nearly correct) parametric assumption about the data is advantageous, as the corresponding parametric methods significantly outperform the other approaches in most settings. In general, if the distribution is unknown, the second-best option for low to moderate levels of dependence is the generative method class. The separate regression approaches improve relative to the other methods when the feature dependence increases, and for highly dependent features, the PPR separate approach is a prime choice. Furthermore, the separate regression methods that match the form of f often give more accurate Shapley value estimates. The PPR model in the PPR separate approach is simple to fit but is still very flexible and can, therefore, accurately model complex predictive models. The independence approach is accurate for no (or very low) feature dependence, but it is often the worst approach for high feature dependence. The NN-Olsen surrogate method outperforms the NN-Frye surrogate approach in most settings and is generally the best surrogate regression approach.

We found it (often) necessary to conduct some form of cross-validation to tune (most of) the separate regression and surrogate regression methods to make them more competitive. Using default hyperparameter values usually resulted in less accurate Shapley value explanations; see additional experiments in the Supplement. The hyperparameter tuning can be time-consuming, but it was feasible in our setting with \(M=8\) features and \(N_\text {train} = 1000\) training observations. The regression-based methods use most of their computation time on training, while the predicting step is almost instantaneous for several methods. The opposite holds for the Monte Carlo-based approaches, which are overall slower than most regression-based methods. Hence, we have a trade-off between computation time and Shapley value accuracy in the numerical simulation studies. We did not conduct hyperparameter tuning for the empirical, parametric, and generative methods. Thus, the methods where we conduct hyperparameter tuning have an unfair advantage regarding the precision of the estimated Shapley values.

5 Real-world data experiments

In this section, we fit several predictive models to different real-world data sets from the UCI Machine Learning Repository and then use the Shapley value explanation framework to explain the models’ predictions. The models range from complex statistical models to black-box machine learning methods. We consider four data sets: Abalone, Diabetes, Wine, and Adult. Some illustrative data plots are provided in the Supplement.

For real-world data sets, the true Shapley values are unknown. Hence, we cannot use the MAE evaluation criterion from Sect. 4 to evaluate and rank the approaches. Instead, we use the \({\text {MSE}}_{v}\) criterion proposed by Frye et al. (2021) and later used by Olsen et al. (2022). The \({\text {MSE}}_{v}\) is given by

$$\begin{aligned} {\text {MSE}}_{v} = {\text {MSE}}_{v}(\text {method } \texttt {q}) = \frac{1}{N_\mathcal {S}} \sum _{{\mathcal {S}}\in {\mathcal {P}}^*(\mathcal {M})} \frac{1}{N_\text {test}} \sum _{i=1}^{N_\text {test}} \left( f(\varvec{x}^{[i]}) - \hat{v}_{\texttt {q}}({\mathcal {S}}, \varvec{x}^{[i]})\right) ^2\!, \end{aligned}$$
(7)

where \(N_\mathcal {S} = |{\mathcal {P}}^*(\mathcal {M})| = 2^M-2\) and \(\hat{v}_{\texttt {q}}\) is the estimated contribution function using method \(\texttt {q}\). The motivation behind the \({\text {MSE}}_{v}\) criterion is that \(\mathbb {E}_{\mathcal {S}}\mathbb {E}_{\varvec{x}} (v_{\texttt {true}}({\mathcal {S}}, \varvec{x}) - \hat{v}_{\texttt {q}}({\mathcal {S}}, \varvec{x}))^2\) can be decomposed as

$$\begin{aligned} \begin{aligned} \mathbb {E}_{\mathcal {S}}\mathbb {E}_{\varvec{x}} (v_{\texttt {true}}({\mathcal {S}}, \varvec{x})- \hat{v}_{\texttt {q}}({\mathcal {S}}, \varvec{x}))^2&= \mathbb {E}_{\mathcal {S}}\mathbb {E}_{\varvec{x}} (f(\varvec{x}) - \hat{v}_{\texttt {q}}({\mathcal {S}}, \varvec{x}))^2 \\&\quad - \mathbb {E}_{\mathcal {S}}\mathbb {E}_{\varvec{x}} (f(\varvec{x})-v_{\texttt {true}}({\mathcal {S}}, \varvec{x}))^2, \end{aligned} \end{aligned}$$
(8)

see Covert et al. (2020, Appendix A). The first term on the right-hand side of (8) can be estimated by (7), while the second term is a fixed (unknown) constant not influenced by the approach q. Thus, a low value of (7) indicates that the estimated contribution function \(\hat{v}_{\texttt {q}}\) is closer to the true counterpart \(v_{\texttt {true}}\) than a high value.

An advantage of the \({\text {MSE}}_{v}\) criterion is that \(v_\texttt {true}\) is not involved. Thus, we can apply it to real-world data sets. However, the criterion has two drawbacks. First, we can only use it to rank the methods and not assess their closeness to the optimum since the minimum value of the \({\text {MSE}}_{v}\) criterion is unknown. Second, the criterion evaluates the contribution functions and not the Shapley values. It might be the case that the estimates for \(v({\mathcal {S}})\) overshoot for some coalitions and undershoot for others, and such errors may cancel each other out in the Shapley value formula in (1). Nevertheless, in the numerical simulation studies in Sect. 4, we computed both criteria to compare the ordering of two criteria empirically. We generally observe a relatively linear relationship between the \({\text {MAE}}\) and \({\text {MSE}}_{v}\) criteria. That is, a method that achieves a low \({\text {MSE}}_{v}\) score also tends to obtain a low \({\text {MAE}}\) score, and vice versa. To illustrate this tendency, we include Fig. 11, where we plot the \({\text {MSE}}_{v}\) criterion against the \({\text {MAE}}\) criterion for the gam_more_interactions experiment with Gaussian distributed data with \(\rho = 0.5\). Note that the orderings of the two criteria are not one-to-one, but they give fairly similar rankings of the methods.

Fig. 11
figure 11

Illustration of a fairly strong linear relationship between the \({\text {MSE}}_{v}\) and \({\text {MAE}}\) criteria for the gam_more_interactions experiment with Gaussian data and \(\rho = 0.5\)

In Fig. 3, we report the \({\text {MSE}}_{v}\) scores and CPU times of the different methods for the four data sets. The Abalone, Diabetes, and Wine data sets were run on the same system as specified in Sect. 4.3, while the Adult data set was run on a shared computer server running Red Hat Enterprise Linux 8.5 with two Intel(R) Xeon(R) Gold 6226R CPU@2.90GHz (16 cores, 32 threads each) and 768GB DDR4 RAM, due to memory constraints on the former system. Thus, one should not compare the CPU times across these systems but only the CPU times of the different methods within the same experiment. More detailed decomposition of the CPU times and additional methods are provided in the Supplement.

5.1 Abalone

We first consider the classical Abalone data set with mixed features. The data set originates from a study by the Tasmanian Aquaculture and Fisheries Institute (Nash et al. 1994) and has been used in several XAI papers (Vilone et al. 2020; Aas et al. 2021b; Frye et al. 2021; Olsen et al. 2022). The data set contains clear nonlinearity and heteroscedasticity among the pairs of features, and there is a significant pairwise correlation between the features, as all continuous features have a pairwise correlation above 0.775. The mean correlation is 0.89, and the maximum is 0.99. Furthermore, all marginals are skewed.

We split the 4177 observations into training (\(75\%\)) and testing (\(25\%\)) data sets. The goal is to predict the age of the abalone based on \(M=8\) easily obtainable features: Length, Diameter, Height, WholeWeight, ShuckedWeight, VisceraWeight, ShellWeight, and Sex. All features are continuous except for Sex which is a three-level categorical feature (infant, female, male). Thus, the empirical and parametric methods are not applicable. However, to remedy this, we train two PPR models to act as our predictive models; one based on all features (\(\text {PPR}_\text {all}\)) and another based solely on the continuous features (\(\text {PPR}_\text {cont}\)). We chose the PPR model as it outperformed the other prediction models we fitted (GAM, RF, CatBoost). The test MSE increases from 2.04 to 2.07 when excluding \(\texttt {Sex}\). Cross-validation determined that number of terms in \(\text {PPR}_\text {all}\) and \(\text {PPR}_\text {cont}\) should be 4 and 7, respectively.

Table 3 The \({\text {MSE}}_v\) scores and CPU times for the methods applied to the real-world data sets in Sect. 5

Table 3 shows that the best approaches for explaining the PPR predictive models are the PPR separate, NN-Olsen surrogate, and \(\texttt {VAEAC}\) methods. For the \(\text {Abalone}_\text {cont}\) data set, the PPR separate and NN-Olsen surrogate methods perform equally well and share first place, but both methods are marginally outperformed by the \(\texttt {VAEAC}\) approach for the \(\text {Abalone}_\text {all}\) data set. However, both the \(\texttt {VAEAC}\) and NN-Olsen surrogate methods are very slow compared to the PPR separate approach. The second-best Monte Carlo-based method for the \(\text {Abalone}_\text {cont}\) data set is the Gaussian copula approach, even though the Abalone data set is far from Gaussian distributed. This is probably because the copula method does not make a parametric assumption about the marginal distributions of the data, but rather the copula/dependence structure, which makes it a more robust method than the Gaussian approach.

5.2 Diabetes

The diabetes data set stems from Efron et al. (2004) and contains \(M=10\) baseline features; Age, Sex, BMI, BP (blood pressure), and six blood serum measurements (S1, S2, S3, S4, S5, S6) obtained from 442 diabetes patients. The response of interest is a quantitative measure of disease progression one year after the baseline. Like Efron et al. (2004), we treat Sex as numerical and standardize all features; hence, we can apply all methods. Many features are strongly correlated, with a mean absolute correlation of 0.35, while the maximum is 0.90. The Age feature is the least correlated with the other features. Most scatter plots and marginal density functions display structures and marginals somewhat similar to the Gaussian distribution, except those related to the S4 feature, which has a multi-modal marginal. We split the data into a training and test data set at a 75–25 ratio, and we let the predictive model be a principle component regression (PCR) model with six principal components. This model outperformed the linear model and cross-validated random forest, XGBoost, CatBoost, PPR, and NN models in prediction error on the test data. The PCR model is not easily interpretable as it does not directly depend on the features but on their principal components.

Table 3 shows that the LM separate, GAM separate, and PPR separate methods obtain the lowest \({\text {MSE}}_v\) scores, with the \(\texttt {VAEAC}\), Gaussian, and copula approaches having nearly as low \({\text {MSE}}_v\) scores. We are not surprised that the latter two methods are competitive due to the Gaussian-like structures in the Diabetes data set. The LM separate method is the fastest approach, with a CPU time of 1.9 seconds.

5.3 Red wine

The Red Wine data set contains information about variants of the Portuguese Vinho Verde wine (Cortez et al. 2009). The response is a quality score between 0 and 10, while the \(M = 11\) continuous features are based on physicochemical tests: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol. For the Red Wine data set, most scatter plots and marginal density functions display structures and marginals far from the Gaussian distribution, as most of the marginals are right-skewed. Many of the features have no to moderate correlation, with a mean absolute correlation of 0.20, while the largest correlation in absolute value is 0.683 between pH and fix_acid. The data set contains 1599 wines, and we split it into a training (1349) and a test (250) data set. A cross-validated XGBoost model and a random forest with 500 trees perform equally well on the test data, and we use the latter as the predictive model f.

Table 3 shows that the RF separate approach is the best method by far. Next, we have the CatBoost separate, RF surrogate, empirical, and \(\texttt {VAEAC}\) methods. The RF surrogate and CatBoost surrogate perform well compared to the other surrogate regression methods. The good performance of the non-smooth RF separate and CatBoost separate methods on the non-smooth predictive model f supports our findings from the simulation studies, where we observed that using a separate regression method with the same form as f was beneficial. The generative methods perform better than the GH and copula methods, while the Gaussian method falls behind. This is intuitive as the data distribution of the Red Wine data set is far from the Gaussian distribution.

5.4 Adult

The Adult data set is based on the 1994 Census database, and the goal is to predict whether a person makes over $50,000 a year based on \(M=14\) mixed features: age (cont.), workclass (7 cat.), fnlwgt (cont.), education (16 cat.), education-num (cont.), marital-status (7 cat.), occupation (14 cat.), relationship (6 cat.), race (5 cat.), sex (2 cat.), capital-gain (cont.), capital-loss (cont.), hours-per-week (cont.), and native-country (41 cat.). The pairwise Pearson correlation coefficients for the continuous features are all close to zero, with a mean absolute correlation of 0.06. The data set contains 30,162 individuals, and we split it into a training (30,000) and a test (162) data set. We train a CatBoost model on the training data to predict an individual’s probability of making over $50,000 a year and use the test data to compute the evaluation criterion. We used a relatively small test set due to memory constraints, and we chose the CatBoost as it outperformed the other prediction models we fitted (LM, GAM, RF, NN).

Table 3 shows that the best method is the CatBoost separate approach, while second place is shared by the RF separate and \(\texttt {VAEAC}\) methods. Note that the difference in the \({\text {MSE}}_v\) score is very small. Like in the previous experiments, we observe that using a separate regression method with the same form as f is beneficial. The ctree approach supports mixed data, but we deemed it infeasible due to a very long computation time. Furthermore, the surrogate regression methods based on (5) ran out of memory as \(\mathcal {X}_\text {aug}\) consists of 30,000 \(\times (2^{14}-2) =\)  491,460,000 training observations.

In Table 4, we decompose the Adult computation times reported in Table 3 in the same manner as in Sect. 4.3. We see that the training times dwarf the predicting times for the regression-based methods, while the opposite holds for the Monte Carlo-based methods when including the generating times. Recall that the training time is a one-time upfront cost. In contrast, the generating and predicting times are more interesting as they express the time needed to compute future Shapley value explanations. In particular, the sum of the generating and predicting times divided by \(N_\text {test} = 162\) yields an estimate of the time needed to explain one new prediction in the future.

The long training time of the RF separate method is due to the hyperparameter tuning, as discussed in Sect. 4.3. In the Supplement, we include a random forest method without hyperparameter tuning but instead with default values and a lower number of trees, denoted by RF-def separate. It obtains an \({\text {MSE}}_v = 0.028\) with a training time of 12:49:17.0, i.e., a \(99.5\%\) reduction in the training time. The competitive \({\text {MSE}}_v\) illustrates that hyperparameter tuning was not essential in this experiment, unlike in the simulation studies where hyperparameter tuning was crucial for obtaining a competitive method. Furthermore, as discussed in Sect. 2.2.3, using an approximation strategy with \(N_{\mathcal {S}}< 2^M-2\) coalitions would also speed up the computations. This could also make the other surrogate regression methods applicable, as the size of the augmented training data would be reduced by a factor of \(N_{\mathcal {S}}/(2^M-2)\).

Table 4 Decomposed computation times for the Adult data set experiment with \(M = 14\), \(N_\text {train} =\) 30,000, and \(N_\text {test} = 162\)

6 Recommendations

In this section, we propose a list of advice for when to use the different methods and method classes based on the results of the simulations studies and the real-world data experiments. The list is not exhaustive. Hence, it must not be interpreted as definite rules but as guidance and points that should be considered when using conditional Shapley values for model explanation.

  1. 1.

    For data sets with no or minuscule feature dependencies, the independence approach is the simplest method to use.

  2. 2.

    In general, a parametric approach with the correct (or nearly correct) parametric assumption about the data distribution generates the most accurate Shapley values.

    • The copula method does not make an assumption about the marginals of the data, but rather the copula/dependence structure, which makes it a more robust method.

    • For features that do not fit the assumed distribution in the parametric approach, one can consider transformations, for example, power transformations, to make the data more Gaussian-like distributed.

    • For categorical features, one can use, e.g., encodings or entity embeddings to represent the categorical features as numerical. This is needed, as no directly applicable multivariate distribution exists for mixed data. However, there exist copulas that support mixed data.

    • If the parametric methods are not applicable, the next best option is (often) a generative or separate regression method, where all considered approaches support mixed data sets by default.

  3. 3.

    For the separate and surrogate regression methods, using a method with the same form as the predictive model f provides more precise Shapley value estimates.

    • For some predictive models, e.g., the linear regression model in Fig. 2, we know that the true conditional model is also a linear model. Thus, using a regression method that can model a linear model (e.g., lm, GAM, PPR) produces more accurate Shapley values. However, the form of the true conditional model is usually unknown for most predictive models.

    • It is important that the regression method used is flexible enough to estimate/model the predictive model f properly.

    • In the numerical simulation studies, the separate regression methods performed relatively better compared to the other method classes for higher feature dependence. In the real-world experiments, the separate regression methods were also (among) the best approaches on data sets with moderate dependence.

    • In general, conducting hyperparameter tuning of the regression methods improves the precision of the produced explanations, but this increases the computation time.

    • In the simulation studies, a \(\texttt {PPR separate}\) approach with fixed \(L = |{\mathcal {S}}|\) (often) provides fast and accurate Shapley value explanations; see the Supplement.

  4. 4.

    The modeling of the conditional distributions in the Monte Carlo-based methods is independent of the predictive model f.

    • For popular data sets, one can fine-tune an empirical, parametric, or generative method and let other researchers reuse the method to estimate Shapley values for their own predictive models.

    • If a researcher is to explain several predictive models fitted to the same data, then reusing the generated Monte Carlo samples will save computation time.

  5. 5.

    There is a time-accuracy trade-off between the different method classes and approaches.

    • The simplest separate and surrogate regression methods are rapidly trained, while the complex methods are time-consuming. This is, however, a one-time upfront time cost. In return, all regression-based methods produce the Shapley value explanations almost instantly. Thus, developers can develop the predictive model f simultaneously with a suitable regression-based method and deploy them together. The user of f will then get predictions and explanations almost instantaneously.

    • In contrast, several of the Monte Carlo-based methods are trained considerably faster than many of the regression-based methods but are, in return, substantially slower at producing the Shapley value explanations. Generating Monte Carlo samples and using them to estimate the Shapley values for new predictions are computationally expensive and cannot be done in the development phase. Thus, the Monte Carlo-based methods cannot produce explanations in real-time.

    • If the predictive model f is computationally expensive to call, then the Monte Carlo-based methods will be extra time-consuming due to \(\mathcal {O}(KN2^M)\) calls to f. Here, K, N, and M are the number of Monte Carlo samples, predictions to explain, and features, respectively. In contrast, the separate and surrogate regression methods make only \(\mathcal {O}(N2^M)\) calls to their fitted regression model(s).

    • The regression-based methods can be computationally tractable when the Monte Carlo-based methods are not, for example, when N is large. We can reduce the time by decreasing the number of Monte Carlo samples K, but this results in less stable and accurate Shapley value explanations.

    • If accurate Shapley values are essential, then a suitable parametric, generative, or separate regression approach with the same form as f yields desirable estimates, depending on the dependence level. The NN-Olsen surrogate method also provided accurate Shapley values for some real-world data sets. Furthermore, hyperparameter tuning should be conducted for extra accuracy.

    • If coarsely estimated Shapley values are acceptable, then some of the simple separate regression methods can be trained and produce estimates almost immediately, such as LM separate. The PPR separate approach with fixed \(L = |{\mathcal {S}}|\) is often a fair trade-off between time and accuracy, especially for smooth predictive functions.

  6. 6.

    The number of training observations \(N_\text {train}\) did not significantly affect the method classes’ overall ordering in our simulation studies. However, individual approaches, such as the PPR separate, performed even better when trained on more training observations.

  7. 7.

    All method classes benefit from having access to multiple CPUs when properly implemented. For example, the Monte Carlo-based approaches can generate the samples for different coalitions and test observations on different cores, and the same when predicting the responses. A separate regression method can train the individual models in parallel, while a surrogate regression method can cross-validate the model’s hyperparameters on different cores.

  8. 8.

    For high-dimensional settings, the number of models to fit in the separate regression class is infeasible. Then, the surrogate regression methods and the \(\texttt {VAEAC}\) approach with arbitrary conditioning can be useful. However, their accuracy will likely also decrease with higher dimensions. In high-dimensional settings, one can, e.g., group the features into relevant groups (Jullum et al. 2021) or use approximation strategies to simplify the Shapley value computations, as described in Sect. 2.2.3.

7 Conclusion

In this article, we have discussed a large sample of Monte Carlo integration and regression-based methods used to estimate conditional Shapley values for model explanation. In agreement with the literature (Covert et al. 2021; Chen et al. 2022), we have divided the studied methods into six different method classes. For each class, we have given an overview of the idea, reviewed earlier proposed methods within the class, and finally proposed and developed several new approaches for most classes. The existing and novel approaches have been systematically evaluated through a series of simulation studies with increasing complexity, as such evaluation has until now been lacking in the field of conditional Shapley values (Chen et al. 2022). We also conducted several experiments on real-world data sets from the UCI Machine Learning Repository. The ranking of the method classes and approaches differed slightly in the numerical simulation studies and real-world experiments.

The most accurate Shapley value explanations in the simulation studies were generally produced by a parametric method with a correctly (or nearly correctly) assumed data distribution. This is intuitive, as making a correct parametric assumption is advantageous throughout statistics. However, the true data distribution is seldom known, e.g., for real-world data sets. In the simulation studies with moderate feature dependence levels, the second-best method class was generally the generative class with the ctree and \(\texttt {VAEAC}\) methods, which outperformed the independence, empirical, separate regression, and surrogate regression methods. For high feature dependence, the separate regression methods improved relative to the other classes, particularly the PPR separate method. Using a separate regression method with the same form as the predictive model proved beneficial.

In the real-world experiments, the parametric methods fell behind the best approaches, except for the simplest data set with Gaussian-like structures. In general, the best approaches in the real-world data set experiments belong to the separate regression method class and have the same form as the predictive model. However, the NN-Olsen surrogate method tied the best separate regression method in one experiment, and the \(\texttt {VAEAC}\) approach was marginally more precise in another experiment. The second-best method class varied for the different data sets, with all method classes, except the independence and empirical, taking at least one second place each.

In addition to the accuracy of the methods, we also investigated the computation time. The regression-based methods are often slowly trained, but they produce the Shapley value explanations almost instantaneously. In contrast, the Monte Carlo-based methods are often faster to train but drastically slower at generating the Shapley value explanations. Finally, we gave some recommendations and considerations for when to use the different method classes and approaches.

In further work, one direction is to extend the investigation into higher dimensions to verify that the tendencies and order of the methods we discovered remain. However, one would then probably need to sample a subset of the coalitions to cope with the exponential complexity of Shapley values. In agreement with Chen et al. (2022), one can also try to determine robust architectures, training procedures, and hyperparameter optimization for the generative and surrogate regression methods, investigate how non-optimal approaches change the estimated conditional Shapley values, and finally evaluate bias in estimated conditional Shapley values for data with known conditional distributions.