A key aspect in the development and testing of psychological theories of cognition is the increasing reliance on formal modeling approaches (for an introduction, see Lewandowsky & Farrell, 2010). Through formal models, researchers can characterize the observed data in terms of latent cognitive processes such as attention, memory retrieval, or response biases. Associated to each of these different processes are parameters that determine their expression and role: For example, consider Schweickert’s (1993) model of short-term memory retrieval illustrated in Fig. 1. This model postulates a parameter I quantifying the probability that the representation of a studied word in short-term memory is intact and a parameter R quantifying the probability that in the absence of an intact representation the word can be successfully redintegrated. According to this model, the probability of an item being correctly recalled is I + (1 − I) × R (item is intact or can be successfully redintegrated), whereas the probability of an incorrect recall is (1 − I) × (1 − R) (item is not intact and redintegration is not successful).

Fig. 1
figure 1

Schweickert’s (1993) redintegration model

When specifying the multiple components within a given model, researchers face several challenges. Among them is the need to ensure that model parameters are identifiable (Bamber & van Santen 1985; Moran, 2016). Broadly speaking, identifiability concerns the notion that each combination of parameter values (e.g., I and R) in a model yields a unique set of expectations. When the parameters of a model are identifiable, the modeler can be sure that there is a unique set of parameters providing the best match between the model’s expectations and the data. The only limitation then is the informativeness of the data. Unfortunately, this is not the case in Schweickert’s model, as its parameters are not identifiable: The same recall probabilities can be obtained by trading off I and R. For example, a correct-recall probability of .90 can be obtained with an infinite range of {I,R} pairs. From {I = .90,R = .00}, which states that items are very likely to have an intact representation but no hope for redintegration, to {I = .00,R = .90}, which states that there are no intact memory representations but redintegration is nevertheless highly likely. In order to address the non-identifiability of these parameters, researchers have relied on complex experimental designs through which principled constraints can be imposed (Hulme, Roodenrys, Schweickert, Brown, et al., 1997; Buchner & Erdfelder, 2005; Schweickert, 1993). In general, the identifiability of parameters will be a function of several factors, such as the model class, number of parameters to be estimated, parametric assumptions made, the data to which the models are fit, and the method used to fit the models (Bamber & van Santen, 1985; Batchelder & Riefer, 1990; Moran, 2016; Ahn, Krawitz, Kim, Busemeyer, & Brown, 2011; Wetzels, Vandekerckhove, Tuerlinckx, & Wagenmakers, 2010).

The importance of parameter identifiability can hardly be overstated, given that it ensures that the theoretically motivated characterization of the data yielded by a model is unique. In lay terms, we want to make sure that the model tells us a unique story for a given set of data, not a multitude of distinct stories. After all, the parameters of cognitive models are supposed to reflect psychologically meaningful processes. In light of the importance of parameter identifiability, many approaches have been developed for overcoming difficulties with it. The present work is concerned with one specific approach for overcoming such issues: The use of parametric approximations to the distributions of parameter values obtained from a separate dataset as informative priors, and incorporating them in subsequent modeling efforts.

In what follows, we will first provide an overview of the different kinds of identifiability and the notion of sloppiness. We will then discuss general approaches of alleviating the problems associated with sloppiness and non-identifiability and focus on one specific method, the empirical-prior approach (Gershman, 2016). We evaluate the point-estimate variant of the empirical-prior approach as proposed by Gershman (2016) in a well-known class of reinforcement-learning models. We assess the improvements brought about by this approach in comparison to the improvements coming from simple extensions of the experimental design. Finally, we develop a Bayesian variant of the empirical-prior approach and apply it to another well-known class of models, this time concerned with decision-making under risk.

Foreshadowing our results, we found that the empirical-prior approach cannot recover the true parameter population distributions, does not improve parameter recovery, and is fragile to mismatches between the true parameter population distributions and the prior used. Similar results were obtained with a Bayesian variant of the empirical-prior approach. In contrast with these rather disappointing results, small changes in the experimental design showed clear improvements in parameter recovery. We conclude that for researchers interested in inferring the ground truth from empirical observations, the shortcomings of impoverished experimental designs that fail to constrain parameter estimates cannot be compensated for through the use of statistical methods such as the empirical-prior approach. We highlight the differences between this so-called “question of inversion” and the “question of inference” that is concerned with coherent interpretation of the data irrespective of the ground truth that generated them.

Overcoming varieties of non-identifiability and sloppiness

A more nuanced view of identifiability is given by the distinction between global identifiability, which concerns the identifiability of the parameters irrespective of any particular data within a given experimental design, and local identifiability, which is only concerned with identifiability of a model for a particular set of data (for a detailed discussion and examples, see Schmittmann et al., 2010). A model can be globally identifiable but locally non-identifiable due to sparse data (e.g., several empty cells in a multinomial distribution) and/or extreme performance (e.g., ceiling or floor effects). For example, a model that is constrained to predict either the sequence a-a-a-a or b-b-b-b under different parameter values will be locally non-identifiable when observing the sequence a-b-a-b, as both sequences the model is able to produce account for the data equally well/badly.

A concept closely related to identifiability is “sloppiness” (Brown & Sethna, 2003). Although identifiable, the parameters of a “sloppy” model can be adjusted to partially compensate any change in the expectations produced by the variation of another parameter. One consequence of sloppiness is that it becomes difficult to determine the parameter values of the underlying data-generating process. These difficulties can be demonstrated in parameter-recovery simulations, in which artificial data are generated with known parameter values. The very same model is then used to fit the data, so that the original and estimated parameter values can be compared. Note that non-identifiability corresponds to the worst-case scenario in terms of sloppiness, with parameters being able to perfectly compensate for the changes in other parameters.

Given the importance of parameter estimates in a model-based characterization of behavioral phenomena, it follows that parameter identifiability is only a minimal and insufficient requirement. After all, a model can have identifiable parameters but nevertheless manifest poor parameter recovery under a more realistic setting. In line with this notion, considerable efforts have been made to identify and ameliorate difficulties with parameter recovery. For example, White, Servant, and Logan (2017) showed that some of the parameters in drift-diffusion models for conflict tasks, although identifiable, have poor recoverability. In order to mitigate these issues, White et al., (2017) suggested the use of derived measures that try to make up for the parameter trade-offs they observed. In other domains (e.g., decision-making under risk), alternative solutions such as parameter restrictions have been proposed: Nilsson, Rieskamp, and Wagenmakers (2011) evaluated the value function v(⋅) of prospect theory (Kahneman & Tversky, 1979), which assumes that the subjective representation of monetary gains x > 0 follows \(v(x) = x^{\alpha ^{+}}\), whereas for losses x < 0 it follows \(v(x) = -\lambda |x|^{\alpha ^{-}}\). Nilsson et al., showed that the loss-aversion parameter λ and the diminishing-sensitivity parameter for losses α are extremely hard to recover as they tend to serve very similar purposes and can therefore trade off with little to no cost. Their solution to this parameter-recovery problem was to simplify the model by setting α to be equal to its gain-outcome counterpart α+. But as discussed later on, other issues in parameter recovery remain to be addressed.

A more general modeling approach that can address some of the difficulties in parameter recoverability consists of relying on hierarchical or random-effect implementations of models. A key aspect of these implementations is that the estimations of individual parameters are informed by the overall sample, capitalizing on the similarities across individuals by shrinking individual estimates towards a central tendency of the group level and thus preventing extreme, noise-driven parameter estimates. This approach is particularly helpful if the experimental design is informative and the model identifiable, but there are not enough data per individual to estimate their respective parameters with high precision (Broomell & Bhatia, 2014). In the cognitive-modeling literature, hierarchical models are typically fitted using Bayesian parameter estimation (e.g., Katahira, 2016; Steingroever, Wetzels, & Wagenmakers, 2014; Wetzels et al., 2010; Ahn et al., 2011).

In contrast to the conventional maximum likelihood estimation (MLE) methods often used in model fitting, Bayesian approaches require the specification of prior distributions for each of the parameters, representing the (prior) beliefs that parameters will take on certain values. These priors are updated using the information provided by the data, resulting in posterior distributions. MLE does not require such a prior nor does it yield a distribution—only the set of parameter values for which the likelihood of the data is maximal. Whether a model is fitted using MLE or Bayesian parameter estimation affects the assessment of identifiability, sloppiness, and parameter recovery. Non-identifiability and sloppiness will lead to regions of the joint posterior distributions which have equal and near-equal density, respectively. When using MLE, trade-offs between sloppy models’ parameters can be observed in the covariance matrix of parameter estimates (see Li, Lewandowsky, & DeBrunner, 1996). When using a Bayesian parameter-estimation framework, parameter trade-offs are reflected in the covariances of posterior samples and the (multivariate) posterior distributions of parameters. Specifically, they often manifest themselves as ridges in the joint posterior distributions (e.g., Scheibehenne & Pachur, 2015, Fig. 1).

A compromise between MLE and Bayesian estimation is offered by the maximum a-posteriori (MAP) method (Cousineau & Hélie, 2013). MAP introduces prior parameter distributions that are used to weight the model’s likelihood function. This prior-weighted MLE yields the modes of the posterior parameter distributions that would be obtained with a fully Bayesian approach using the same set of priors. Both the fully Bayesian approach and MAP have been alluded to as ways to attenuate parameter-identifiability problems (e.g., Moran, 2016) by using an informative prior. An informative prior carries information about the parameters that goes beyond the data at hand, such as from previous empirical work or from theoretical considerations (for an overview, see Lee & Vanpaemel, 2017). For instance, if one has good reasons to believe that reasonable parameter values should most often lie within a certain range, one can specify a prior that overweights that specific range relative to the rest. This weighting would discourage estimates to go outside this expected range unless the data strongly support that. Hierarchical parameter estimation poses a special case of the use of informative priors in which the parameter estimates of individuals inform each other.

Gershman’s (2016) empirical-prior approach

Gershman (2016) recently argued that one way to obtain an informative prior is to use the distribution of MLE estimates obtained from a separate dataset in an attempt to approximate the population distribution of parameter values—an empirical prior. In the context of reinforcement-learning models (Sutton & Barto 1998), Gershman demonstrated that the MAP method, together with empirical priors, could improve model performance in several ways: more reliable parameter estimates, improved characterization of individual differences, and increases in predictive accuracy.

The improvements reported by Gershman’s (2016) empirical-prior approach in the context of reinforcement-learning modeling are quite fortunate, as they directly address some long-standing challenges in this domain. Reinforcement-learning models are regularly adopted as a way to analyze repeated trial-and-error decisions in psychology and neuroscience (e.g., Yechiam & Busemeyer, 2005; Schulze, van Ravenzwaaij, & Newell, 2015; Erev & Barron, 2005; Baron & Erev, 2003; Niv et al., 2015; Dayan & Daw, 2008; Chase, Kumar, Eickhoff, & Dombrovski, 2015; Dayan & Balleine, 2002). Despite their prominence, these models have well-documented cases of parameter non-identifiability and sloppiness (e.g., Humphries, Bruno, Karpievitch, & Wotherspoon, 2015; Wetzels et al., 2010; but see, e.g., Ahn et al., 2011, 2014; Steingroever, Wetzels, & Wagenmakers, 2013, for examples of satisfactory parameter identifiability). One illustrative example was recently given by Humphries et al., (2015), who fitted a popular reinforcement-learning model to choice data obtained with the Iowa gambling task (Bechara, Damasio, Damasio, and Anderson, 1994). Humphries et al., showed that the best fits could often be achieved under very different sets of parameter values that yielded quite distinct accounts of the data. For example, one participant had a set of best-fitting parameters indicating that s/he had good memory and produced impulsive, consistent choices. However, another equally good set parameter estimates for that same participant suggested that s/he could be characterized as an individual with inconsistent choices and poor memory, and who focused more on losses than wins (p. 24). Reducing the model complexity by restricting the number of free parameters limited the degree to which parameters traded off, but also limited the richness of the characterization provided by the model.

Despite its purported advantages, some aspects of Gershman’s (2016) empirical-prior approach require further scrutiny. First, it is not clear how this approach can mitigate the problems of parameter identifiability and recoverability. Since it relies on marginal prior parameter distributions that are independent from each other, the constraints imposed do not extend to the way parameters can jointly vary to produce equivalent or very similar results. In somewhat broad strokes, it can be said that Gershman’s empirical-prior approach is attempting to tackle a problem of parameter covariance by constraining variances. It could very well be that the improvements reported by Gershman (2016) result from shifting the parameter estimates towards values that are more commonly observed in the population, thus avoiding overfitting.

Second, Gershman (2016) approach assumes that the empirical priors obtained are somewhat reasonable approximations to the distributions of parameter values in the population. Given that the reinforcement-learning models considered in his application suffer from identifiability and recoverability issues, it seems unlikely that such an approximation is achieved to any reasonable degree. If the empirical priors are themselves based on unreliable parameter estimates that do not match the actual data-generating parameters, it is not clear how they could contribute to mitigating any identifiability or recoverability problems. In short, the approach seems to be stuck with a “chicken-or-the-egg” type of problem.

Evaluating the empirical-prior approach

Given the standing questions regarding Gershman’s (2016) empirical-prior approach, we conducted additional evaluations. These evaluations required knowledge of the true parameter values that generated any given dataset, something that is only possible when the data are artificially generated. Therefore, we implemented a series of simulation studies, through which we tested whether that empirical-prior approach improves the recovery of parameter estimates in the reinforcement-learning models considered by Gershman, and how these estimates are affected by the (mis)match between the empirical priors and the actual distribution of parameters in the population. We will later extend these simulation studies to the domain of decision-making under risk.Footnote 1

Method

Data

We used the data from Gershman (2015) as the basis for our simulations. These data came from a probability-learning task in which participants were presented with two options differing in terms of their probabilities of yielding a reward (0 or 1; see Fig. 2 for an example of such a probability-learning task). Participants were provided with partial feedback such that they were only informed of the outcome yielded by the chosen option. Each participant engaged in four blocks of 25 trials each, for a total of 100 trials. Their responses came from four experiments, which only differed in terms of the reward probabilities used, for a total of 166 participants.Footnote 2

Fig. 2
figure 2

Example of a typical probability-learning task

Models

We focused our comparisons on the same four reinforcement-learning models that Gershman (2016) considered, all of which conform to the basic structure of a Q-learning model (Sutton & Barto, 1998). According to this model, participants keep subjective expectations Q for each of the options and update them based on the feedback they receive from the choices. The updating mechanism typically used is temporal-difference learning, in which the expectation regarding an option X is updated based on the difference between the observed reward R(X) and an expected reward Q(X). This difference is conventionally referred to as the reward prediction error. Formally, the subjective expectation Qt+ 1(X) for option X on trial t + 1 is given by

$$ Q_{t + 1}(X) = Q_{t}(X) + \eta \left( \vphantom{1^{-}}R_{t}(X) - Q_{t}(X)\right), $$
(1)

where Rt(X) is the reward coming from option X at trial t, and η is the learning rate. Note that the difference in parentheses is the reward prediction error on trial t. Initial expectations Q0(⋅) are set to 0 and on every trial the probabilities of the choices are governed by a logistic function with scaling parameter β.Footnote 3 The simplest model in our set, \(\mathcal {M}_{1}\), assumes a single learning rate η. Model \(\mathcal {M}_{2}\) extends \(\mathcal {M}_{1}\) by assuming different learning rates η for the different reward prediction errors:

$$ \eta = \left\{\begin{array}{lllllll} \ \eta^{+} & \ \text{if} \quad R_{t}(X) - Q_{t}(X)\geq 0\\ \ \eta^{-} & \ \text{if} \quad R_{t}(X) - Q_{t}(X) < 0 \end{array}\right.. $$
(2)

The third model, \(\mathcal {M}_{3}\), also builds on \(\mathcal {M}_{1}\) by including a stickiness mechanism that attempts to capture the notion that individuals tend to repeat previous choices independently of the outcome obtained in the previous trial (also referred to as “perseveration” or “choice inertia”; Yechiam & Ert, 2007; Worthy, Pang, & Byrne, 2013). Formally, ct− 1, the choice made on trial t − 1, introduces an additive constant ω to the expectation of this option on trial t:

$$ Q_{t}(X) \,=\, Q_{t-1}(X) + \eta \left( \vphantom{1^{-}}R_{t-1}(X) \,-\, Q_{t-1}(X)\right) + \omega, \;\!\text{if}\; c_{t-1} \,=\, X. $$
(3)

Note that ω < 0 penalizes repeated choices of the same option, and ω > 0 encourages repeated choices from the same option, irrespective of the rewards it yields. The final model, \(\mathcal {M}_{4}\), combines both the two learning rates introduced in \(\mathcal {M}_{2}\) with the stickiness parameter included in \(\mathcal {M}_{3}\).

Altogether, the four models have the following parameters:

  • \(\mathcal {M}_{1}\): η and β

  • \(\mathcal {M}_{2}\): η+, η, and β

  • \(\mathcal {M}_{3}\): η, β, and ω

  • \(\mathcal {M}_{4}\): η+, η, β, and ω

Priors

We obtained empirical priors for each of the models by fitting them to the individual datasets using a differential-evolution algorithm as implemented in DEoptim (Mullen, Ardia, Gil,Windover, & Cline, 2011) for R (R Development Core Team, 2008). Following Gershman’s (2016) procedure, we restricted the unbounded parameters β and ω to very broad but not impossible ranges ([0, 50] and [− 5, 5], respectively). We used the algorithm with mostly default settings, but increased the number of population members to 50 and the maximum allowed population generations to 100. The population members were initialized randomly within the parameter boundaries.

To facilitate sampling from the empirical prior distribution without committing to too many auxiliary assumptions, we fitted the observed parameter estimates with Gaussian mixture models (GMMs).Footnote 4 These mixtures were obtained by first linearly transforming all parameters to the unit scale [0,1] and then applying an inverse-probit transformation so that they would be represented along the real line.Footnote 5We then fitted mixture models with up to ten component or base distributions and selected the best-performing mixture model using the Bayesian information criterion (BIC; Schwarz, 1978). The BICs of the fitted GMMs are reported in Table 4 in the Appendix.

In addition to the empirical priors, we also considered uniform prior distributions that simply reflect the parameter bounds used in the fitting procedure: [0,1], [0,50], and [− 5,5] for the learning rates, scaling parameter, and stickiness, respectively. The results obtained with these uniform priors provide us with the yardstick with which we can evaluate the benefits associated with the use of empirical priors.Footnote 6

Simulation procedure

We began by generating 1,000 sets of 4 × 25 reward sequences based on the reward probabilities associated to the different choice options. Afterwards, we drew 1,000 independent parameter-set samples for each model×prior combination. These draws were used to create the so-called empirical populations (as they were obtained from the empirical priors) and uniform populations (obtained from the uniform priors). The sampled parameters were then used to simulate responses for all of the generated reward sequences.

Parameter recovery

To assess parameter recovery, we fitted the simulated individual responses twice, once using MLE and once again using MAP in conjunction with the empirical priors. As before, we relied on a differential-evolution algorithm. To facilitate model fitting, we initialized the population members of the algorithm with samples from the ground-truth population distributions. Parameter recovery was assessed by regressing the estimated parameter values with the true data-generating parameters. This was done separately for each of the parameters, models, and population distributions. Within this context, we chose to use the explained variance statistic r2 as a measure of parameter recovery. The reason is that it quantifies the ability of one parameter to capture the variability found in the estimates obtained from the data.

Results

Empirical priors

As previously discussed, we obtained our empirical priors by fitting GMMs to the distributions of parameter estimates. The best-fitting estimates are reported in Table 5 on a probability scale. To obtain values on the real scale, transformations have to be applied: specifically, for the parameters on the unit [0, 1] scale (learning rates η, η+, and η), values have to be probit-transformed. For parameters on the [0, 50] scale (sensitivity β), values have to be probit-transformed and then multiplied by 50. For the stickiness parameter ω, values have to be probit-transformed, multiplied by 10, and then have 5 subtracted.

  • \(\mathcal {M}_{1}\): For the η parameter, the best-fitting solution consists of a mixture of four component distributions. The resulting empirical prior is characterized by a strong bimodality at the edges of the parameter space, with relatively little density in between (see Fig. 3, top row, first column). For the β parameter, the best-fitting solution is also a mixture of four Gaussian distributions. Most of the density is concentrated at the region between 0 and 10 on the real scale, with another peak at the upper boundary of the parameter space (50). Almost no density is found between these peaks (see Fig. 3, top row, third column).

  • \(\mathcal {M}_{2}\): For the η+ and η parameters, the best-fitting solution is a mixture of three Gaussian distributions. In both cases, the priors are characterized by strong bimodalities at the edges of the parameter spaces, with comparatively little density in between (see Fig. 3, second row, first and second columns). For the β parameter, the best-fitting solution is a mixture of six Gaussian distributions. Most of the density is concentrated at the region between 0 and 10 on the real scale, with another peak at the upper boundary of the parameter space—again, with almost no density in between (see Fig. 3, second row, third column).

  • \(\mathcal {M}_{3}\): For the η and β parameters, the best-fitting solutions correspond to a mixture of three Gaussian distributions. As with models \(\mathcal {M}_{1}\) and \(\mathcal {M}_{2}\), they are characterized by large peaks at the boundaries of their respective parameter spaces and little density in between (see Fig. 3, third row, first and third columns). Parameter ω, on the other hand, has a highly peaked trimodal distribution (captured by a mixture of six Gaussian distributions). The peaks are at the boundaries (-5 and + 5) and at 0 (see Fig. 3, third row, last column).

  • \(\mathcal {M}_{4}\): The parameter distributions are comparable to the other models’ distributions. For the learning rates η+ and η, the best-fitting solutions are mixtures of four Gaussian distributions. The distribution of β estimates is best captured by a mixture of two Gaussian distributions, and the stickiness parameter ω is best described by a mixture of six Gaussian distributions. In all cases, the distributions are multimodal with peaks at the boundaries of the parameter spaces (and at 0 for ω) and very little density in between (see Fig. 3, bottom row).

Fig. 3
figure 3

Distribution of individual-level maximum likelihood parameter estimates and the fitted Gaussian mixture models for each of the four reinforcement-learning models. Empty cells indicate parameters not present in the respective model. Each cell depicts the distributions of the estimated parameters (shaded area) and the fitted mixture model (black line). See “Models” for model specifications

Simulation results

Recovery of population parameter distributions

We assessed whether the parameter estimates obtained from the simulated data resembled the true population distribution. To do so, we first created 100 equally spaced bins that covered the entire parameter range for each of the parameters. Afterwards, we computed the proportion of the ground-truth parameter values falling within each bin (expected frequencies). We then computed a discrepancy statistic using the sum of squares between the expected frequencies and the binned frequencies of the fitted parameter estimates (observed frequencies). For 10,000 samples of 1,000 parameters from each of the ground-truth parameter distributions, we calculated their discrepancy statistics with respect to the expected frequencies, thus obtaining a distribution of discrepancies. Using the relative rank (RR) of observed frequencies within the distribution of discrepancies, we can calculate the probability PR = .50 −|(RR − .50)| of such a rank being observed when assuming that the recovered parameters stem from the true population distributions.

Recovery of the parameter distributions was found to be somewhat poor across all models, population distributions, and estimation procedures. For \(\mathcal {M}_{1}\), only one out of eight parameter distribution could be recovered, namely the distribution of β stemming from the empirical-prior population when using MLE (PR = .16, all other PRs ≤ .04; see Fig. 4 for an example illustrating the poor distribution recovery). For \(\mathcal {M}_{2}\), the pattern looks worse than for \(\mathcal {M}_{1}\), as none of the distributions could be successfully recovered (all PRs < .01). For \(\mathcal {M}_{3}\), the pattern changes slightly: When using uniform priors, an empirical parameter population distribution can successfully be recovered for the β and ω parameters (both PRs ≥ .12). Otherwise, no parameter distributions were recovered (all PRs ≥ .05). For \(\mathcal {M}_{4}\), no parameter distributions were recovered (all PRs ≤ .02).

Fig. 4
figure 4

Parameter and distribution recovery for model \(\mathcal {M}_{1}\) (the best-faring case for both types of recoveries). The axes show the true parameters of the model and the corresponding recovered parameters. The true parameters stem from the empirical-prior distribution, and recovered parameters have been obtained using maximum a-posteriori estimation in conjunction with empirical priors

Individual parameter recovery

Parameter recovery of individual parameters, quantified here in terms of r2, were generally poor (see the “Overall” column in Table 1). The only exception was the simplest model, \(\mathcal {M}_{1}\), which had the best general parameter recovery. \(\mathcal {M}_{2}\) had the worst parameter recovery for all parameters. The most complex model, \(\mathcal {M}_{4}\), is the second-worst in terms of recoverability, followed by \(\mathcal {M}_{3}\).

Table 1 Parameter recoverability (in r2)

Sensitivity to misspecification

To assess the sensitivity to prior misspecification in MAP, we compared the parameter recoveries when the population distributions matched the empirical priors with the analogous recoveries when the two distributions did not match (e.g., fitting data generated from a uniform parameter population when using the empirical priors). The r2 values for all models, estimation methods, and priors are reported in Table 1.

For model \(\mathcal {M}_{1}\), matching the priors to the true underlying population distribution played an important role. The differences in r2 between the matching and mismatching priors could be as high as .10. For model \(\mathcal {M}_{2}\), failing to match the prior used in MAP to the underlying data-generating parameter distributions led to mixed results. In four cases, recoverability became worse when the two distributions mismatched, but it actually improved in two other cases. Turning to model \(\mathcal {M}_{3}\), we found a pattern similar to the other models: Matching the parameter distributions was important, yet a mismatch also led to substantially improved parameter recovery in one case. Finally, for model \(\mathcal {M}_{4}\), except for the learning rates stemming from the empirical-prior distribution, MLE was in all cases better in recovering the true individual-level parameters.

Interim summary

We explored the merit of using MAP in conjunction with empirical priors as a way to improve parameter recovery in reinforcement-learning models. Using simulations, we found that no method yielded satisfactory results for any of the criteria we used (i.e., distribution recovery and individual-parameter recovery), with no method being consistently superior to the other across all models. Parameter recoverability was generally poor and alarmingly so in some cases, raising serious questions on the ability to draw conclusions about underlying psychological processes under the experimental design considered by Gershman (2015). In the hopes of following up this rather negative state of affairs with a more positive message, we explored different ways to improve the present experimental design.

Exploring ways to improve recoverability

In an attempt to improve recoverability, we considered different ways in which the experimental design used by Gershman (2015) could be improved. To keep things as simple as possible, we restricted ourselves to model \(\mathcal {M}_{1}\). Also, instead of using either MLE or MAP, we adopted a fully Bayesian approach in which posterior distributions of parameters are obtained. In contrast to the point estimates yielded by MLE and MAP, these posterior distributions can be conveniently used to assess the degree of uncertainty surrounding each parameter estimate. Diffuse posteriors are expected when parameters are not identifiable or sloppy. Note that in some cases, non-identifiability can lead to multimodalities in the marginal posterior distributions, and ridges in the joint posterior distributions (with each ridge reflecting a specific parameter trade-off).

Method and Results

We obtained the posterior distributions using a No-U-turn sampler (Hoffman & Gelman, 2014) as implemented in Stan (Carpenter et al., 2017) via the PyStan interface (Stan Development Team, 2016a). We ran four randomly initialized chains in parallel for 1,000 total iterations, out of which 500 were used as a warm-up period to tune the sampler’s parameters. These warm-up samples were discarded afterwards. The remaining 500 iterations from each chain were concatenated, resulting in a total of 2,000 samples. We restricted β to be between 0 and 50, just like with the point-estimate fits beforehand. In these analyses, we focused on the range of each parameter’s 95% central posterior interval, divided by the range of its support. The resulting coverage ratio yields values between 0 and 1, with 0 indicating that all posterior mass in a single point, and with values approaching 1 indicating that any permissible value is likely (i.e., the data are not informative for the estimation of a given parameter).

Baseline

To ensure a more comprehensive assessment, we engaged in a systematic exploration of the parameter space using a grid search. As our baseline for quantitative comparisons, we used a probability-learning task with 4 blocks of 25 trials each. Within the blocks, the virtual participants chose between two options with reward probabilities (.1, .3), (.2, .4), (.6, .8), and (.7, .9), respectively. We created a linearly spaced 101×101 grid of sensitivities β and learning rates η between 0 and 50 and 0 and 1, respectively. As both η and β values of 0 lead to completely random choices, we excluded these values from the simulations, resulting in a final 100×100 grid. For each of the parameter combinations, we simulated the outcome sequences and responses of ten virtual participants, for a total of 100,000 response vectors. Results for the baseline design and all variants are reported in Table 2.

Table 2 Recoverability of baseline analysis and different manipulations of experimental design (model \(\mathcal {M}_{1}\))

Compared to the previous simulation results reported in the Individual parameter recovery section, the baseline design showed a generally better parameter recoverability due to the change from MLE to the means of the respective posterior distributions (see Table 2). But as the coverage ratios show, the uncertainty surrounding the estimates was still unsatisfactory: In the case of η, we can only hope to reliably distinguish between extremely low and high learning rates. Parameter β does not even lend itself to such hopes. Note that one critical difference between the two estimation procedures is that in the case of multiple maxima, MLE will only yield one of them, whereas using the mean of the posterior distribution effectively averages across these multiple modes. What these results show is that, if anything, one is better off by adopting a fully Bayesian approach with non-informative priors than introducing empirical priors over point estimates.

Variant 1: Increase number of trials

We explored how an increase of the number of trials within blocks improves identifiability (Table 2, Trials). We increased the number from 25 (baseline) to 50, leading to a total of 200 instead of 100 trials (baseline). As expected, increasing the number of trials within each block improves recoverability for both parameters, although the coverage ratios still indicate a considerable degree of uncertainty.Footnote 7

Variant 2: Increase number of options

As a second variant, we explored the possibility of increasing the number of options for participants to choose from (while keeping the number of blocks and trials per block constant; Table 2, Options). We formed four blocks of four options with reward probabilities (.1, .2, .3, .4), (.6, .7, .8, .9), (.2, .3, .5, .6), and (.5, .6, .8, .9), respectively. The most notable difference between this variant and the first one is that comparable improvements in recoverability are achieved without an increase of the total number of trials.

Variant 3: Provide full feedback

As the last variant, we explored how providing participants with full feedback (i.e., giving feedback about the forgone outcomes) influenced parameter recovery (Table 2, + Full feedback). We assumed that the learning rate is identical for both the chosen and the non-chosen options. Similar to Variant 2, this change of design does not lead to an increased number of observations. Yet, it descriptively provides the greatest improvement of parameter recovery, although the recoverability of β, in particular the coverage ratio, is still somewhat disappointing, as it covers more than half the parameter range on average.

Generalizing the evaluation of the empirical-prior approach: An application to risky-choice modeling

It is possible that our disappointing results with the empirical-prior approach were due to the reliance on point estimates, together with the specific reinforcement-learning models and experimental designs considered by Gershman (2016). In order to evaluate this possibility, we developed a fully Bayesian implementation of the empirical-prior approach, and applied it to a different model class and experimental paradigm.

The basic idea of the fully Bayesian empirical-prior approach is a straightforward extension of the previously used point-estimate empirical-prior method: Instead of fitting GMMs to the point estimates of an MLE-based procedure, the GMMs are fitted to the pooled individual-level posterior distributions. This extension offers two main advantages: First, uncertainty about the parameter estimates used to obtain the empirical priors is directly reflected in the empirical priors obtained, and second, because there are many more data points available per individual, it becomes feasible to estimate the covariance matrices associated with each of the multivariate component distributions in a mixture.

The uncertainty associated with any parameter estimate under a Bayesian framework is directly expressed in that parameter’s posterior distribution. We can use this feature to establish an alternative way of assessing parameter recovery. In addition to computing r2 and coverage ratios, we now also consider P(95%CI): The proportion of times that the true parameter was included in the 95% credible interval estimated from the generated data. These intervals encompass the central 95% of their respective parameter’s posterior distribution. Ideally, one would expect these credible intervals include the true parameter values with probability .95.

Prospect theory and the risky-choice paradigm

One of the most widely used paradigms in the decision-making literature is the risky-choice paradigm. In this paradigm, an individual is requested to express her/his preferences between different options that yield monetary outcomes with known probabilities (decision-making under risk), such as the lottery \(\text {A} = \left (\begin {array}{cccccc} \$100 & -\$20 \\ .50 & .50 \end {array}\right )\) that yields a $100 gain with probability .50, otherwise a $20 loss, and an option \(\text {B} = \left (\begin {array}{cccccc} \$80 & \$0 \\ .50 & .50 \end {array}\right )\) that yields a $80 gain with probability .50, otherwise nothing. Individuals’ preferences regarding options of this kind are expected to capture their subjective representations of monetary outcomes, probabilities, as well as their integration.

The arguably most prominent theory to describe human behavior in such situations is prospect theory (Kahneman & Tversky, 1979; Tversky & Kahneman, 1992; see Wakker, 2010, for an overview). According to prospect theory, individuals evaluate a decision between lotteries such as A and B by calculating its utilities U(A) and U(B). The core mechanisms that govern that calculation are (a) a reference point relative to which outcomes are evaluated, (b) diminishing sensitivity to larger deviances from the reference point (i.e., the difference between $10 and $20 is perceived as larger than the difference between $1,010 and $1,020), (c) loss aversion (i.e., losses have a higher impact on utilities relative to gains of the same magnitude), (d) over-weighting of rare events, and (e) under-weighting of probable events. Afterwards, the utilities of A and B are compared with each other and the option with the higher utility is chosen by applying some choice rule.

According to prospect theory, the utility U of a two-outcome mixed lottery \(\text {L} = \left (\begin {array}{ccccc} x^{+} & x^{-} \\ p^{+} & p^{-} \end {array}\right )\) with gain and loss outcomes x+ and x, respectively, and their respective probabilities p+ and p is given by:

$$ U(\text{L}) = v(x^{+})w(p^{+}) + v(x^{-})w(p^{-}), $$
(4)

where v(⋅) is the (already-mentioned) value function and w(⋅) the probability-weighting function. The value function is typically cast as a piecewise power function with parameters α+ and α capturing the diminishing sensitivity in the domains of monetary gains and losses, respectively, and a loss-aversion parameter λ that captures the asymmetries in the valuation of gains and losses:

$$ v(x) = \left\{\begin{array}{lllllll} x^{\alpha^{+}}, & \text{for } x= x^{+}, \\ -\lambda|x|^{\alpha^{-}}, & \text{for } x = x^{-},\\ 0, & \text{for } x = 0. \end{array}\right. $$
(5)

Probabilities are assumed to be weighted by an inversely S-shaped function that overweights small probabilities and underweights large probabilities, such as the function proposed by Kahneman and Tversky (1979):

$$ w(p) = \frac{p^{\gamma}}{\left( \vphantom{1^{+}}p^{\gamma} + (1-p)^{\gamma}\right)^{\frac{1}{\gamma}}}, $$
(6)

where γ is the probability-sensitivity parameter. Different parameters γ can also be assigned to the probabilities associated with gains and losses (i.e., γ+ and γ).

Preferences such as A is preferred to B (A ≻B) are translated into choice probabilities via a choice rule like the logistic choice function:

$$ Pr(\text{A}) = \frac{1}{1 + e^{-\theta\left( \vphantom{1^{+}}U(\text{A}) - U(\text{B})\right)}}, $$
(7)

where the choice-sensitivity parameter 𝜃 modulates how differences in utilities affect choice probabilities. Responses are random when 𝜃 = 0, and become more deterministic as it increases.

Prospect theory has often been used in the study of individual differences and temporal stability, from risk attitudes to the subjective representation of monetary outcomes and probabilities (e.g., Booij, van Praag, & van de Kuilen, 2009; Broomell & Bhatia, 2014; Kellen, Pachur, & Hertwig, 2016; Scheibehenne & Pachur, 2015). But despite its many merits, prospect theory is sloppy to some degree, and its parameters suffer from well-documented parameter trade-offs, most notably between the outcome-sensitivity parameters α and the choice-sensitivity parameter 𝜃.

Constructing priors for prospect theory

To evaluate the Bayesian variant of the empirical-prior approach in the risky-choice paradigm, we used the data from Walasek and Stewart (2015, Experiment 1a and 1b). In these experiments, participants were faced with a single-lottery accept-reject task, in which they were offered a mixed lottery with two equiprobable outcomes such as \(\text {L} = \left (\begin {array}{cccccccccc} \$20 & -\$12 \\ .50 & .50 \end {array}\right )\). Participants were asked to decide whether to accept or reject such a lottery, a decision that is assumed to imply a comparison between the lottery and a status quo (with utility 0). This accept-reject task is often used in neuroscientific investigations (e.g., Tom, Fox, Trepel, & Poldrack, 2007; De Martino, Camerer, & Adolphs, 2010; Canessa et al., 2013; Pighin, Bonini, Savadori, Hadjichristidis, & Schena, 2014). Each participant completed all possible combinations of eight different gains and eight different losses, resulting in a total of 64 trials.

Walasek and Stewart’s (2015) study revolved around four different between-subjects conditions that were designed to specifically affect the loss-aversion parameter λ. We will focus on the two conditions that produced the most extreme median λ estimates. In the 40-20 condition (n = 191), gain outcomes ranged from $12 up to $40 in steps of $4, whereas losses ranged from $6 up to $20 in steps of $2. The 20-40 condition (n = 198) flipped the signs of these outcomes (i.e., gain outcomes became losses and vice-versa). Walasek and Stewart (2015) reported that the λ estimates were generally above 1 in the 40-20 Condition, indicating loss-averse preferences, and below 1 in the 20-40 Condition, indicating gain-seeking preferences.

We modeled the data with a streamlined version of prospect theory, in which we assumed that w(p+) = w(p) = .50 (see Kellen, Mata, & Davis-Stober, 2017; Levy & Levy, 2002; Quiggin, 1982):

$$ P(\text{Accept L}) = \frac{1}{1 + e^{-\frac{\theta}{2}\left( v(x^{+}) - v(x^{-})\right)}}. $$
(8)

Samples from the parameters’ posterior distributions were obtained using a No-U-turn sampler (Hoffman & Gelman, 2014) as implemented in Stan (Carpenter et al., 2017) via the RStan interface (Stan Development Team, 2016b). We ran four randomly initialized chains in parallel for initially 4,000 total iterations, out of which 2,000 were used as a warm-up period to tune the sampler’s parameters. These warm-up samples were discarded afterwards. The remaining 2,000 iterations from each chain were thinned and then concatenated, resulting in a total of 1,000 samples. To assess convergence, we used the \(\hat {R}\) statistic (Gelman et al., 2013, p. 285) and assumed that convergence was reached if all \(\hat {R} \leq 1.01\). If not, we repeated the sampling procedure with twice as many iterations as before, until all parameters converged or a maximum of 64,000 iterations was reached. To avoid singularities in model expectations, we set an upper limit of 2 for parameters α and 𝜃, and of 4 for parameter λ. Also, to avoid numerical over- and underflows, we restricted likelihoods to be between 10− 7 and 1 − 10− 7. Finally, we used uniform priors that spanned the permitted range of each parameter.

The posterior samples from each individual were then linearly transformed so they would all fall within the [0, 1] range and then inverse-probit transformed into the real space. We used multivariate GMMs (with estimation of the full covariance matrix per multivariate kernel) to approximate the aggregated individual posterior distribution of the parameters, separately for each of the two conditions. To determine the best-performing GMMs (we considered GMMs with up to ten component distributions), we used leave-one-participant-out cross validation (see Vehtari & Lampinen, 2002, for other variants of cross validation). The parameters of the best-performing GMMs are reported in Table 6 in the Appendix.

The simulation procedure was similar to the one used in the first part of the paper, but extended by one additional factor: For each of the two conditions, we generated data from a uniform distribution within the restricted parameter boundaries and from each of the two fitted prior distributions. Afterwards, we obtained samples from the posterior distributions using a uniform prior, the prior obtained from fitting the 40-20 condition, or the prior obtained from fitting the 20-40 condition. These samples were obtained using a differential-evolution sampler (Ter Braak & Vrugt, 2008) as implemented in BayesianTools package (Hartig, Minunno, & Paul, 2017). Consequently, this resulted in a 2 (condition) × 3 (ground-truth prior distribution) × 3 (used prior) simulation design. Within each cell, we obtained a total of 1,000 observations.

As dependent variables, we used the coverage ratio, the r2 across individuals of the true parameters and the mode of the respective posterior distributions, and the proportion of true parameters included in the 95% credible interval, P(95%CI). A low coverage ratio and a proportion close to .95 of parameters included in the 95% credible interval hint towards good parameter identifiability, and a high r2 reflects a good recovery of the rank ordering of parameters across individuals.

Results

Empirical prior in the 40-20 condition

The empirical distributions of parameters in the 40-20 condition mostly follow the expectations about prospect theory parameter distributions reported in the literature (Booij et al., 2009). We observed a slight tendency towards risk aversion (i.e., the posterior distribution of α peaks slightly below 1), a tendency towards loss aversion (i.e., most of the posterior mass of λ is above 1), and choices that are stochastic (i.e., the peak of the posterior distribution of 𝜃 tends towards 0). It is noteworthy that the distribution of λ is multimodal. It looks like there are at least two peaks, one around a value slightly below 2 and one at 1 (i.e., loss neutrality). See Fig. 5 (main diagonal, gray line) for fine-grained histograms of the marginal distributions of the parameters.

Fig. 5
figure 5

Posterior distribution across individuals in the 40-20 condition of the risky choice paradigm and the Gaussian mixture model (GMM) that was fitted to it. The main diagonal depicts the marginal distributions of the parameters of the posterior samples (gray line) and the fitted mixture model (black line). Lower diagonal elements show the pair plots of the posterior samples, upper diagonal elements the pair plots of the fitted mixture model. Parameters α, λ, and 𝜃 reflect risk aversion, loss aversion, and the scaling parameter of the logistic choice function, respectively

The inspection of joint parameter distributions (see Fig. 5, lower diagonal elements) reveals strong dependencies. The negative, curvilinear dependency between α and 𝜃 resembles the dependency reported by Scheibehenne and Pachur (2015). The multimodality of the λ parameter makes it difficult to interpret its interdependencies. Disregarding the peak of λ at 1, at which the parameter has no influence on decisions (and thus should not correlate with any other parameter), λ seems to be positively correlated with α and negatively correlated with 𝜃, a pattern that is not very surprising: Larger values of λ lead to a larger influence of losses on the decision variable, which can be partially compensated for by also increasing the symmetrical scaling of both losses and gains (α). These large values in the decision variable, in turn, would lead to more deterministic choices, which can be scaled down with lower values of 𝜃.

The best-fitting GMM in the 40-20 condition turned out to be a mixture of three components (see Fig. 5, main diagonal, black line for the marginal distributions of the best-fitting GMM and Table 4 in the Appendix for the BICs for all numbers of mixtures). Whereas the distributions of α and 𝜃 were approximated very closely, the multimodality of λ cannot be well accommodated with this solution.Footnote 8 Except for the fan-like correlation of λ with α, the GMM was able to closely approximate the covariations found among other parameter pairings.

Empirical prior in the 20–40 condition

The empirical distributions of parameters in the 20-40 condition reflected the experimental manipulations reported by Walasek and Stewart (2015). We observed risk neutrality (i.e., the posterior distribution of α peaks at around 1), a slight tendency towards gain seeking (i.e., the distribution of λ peaks slightly below 1), and stochasticity of choices. All distributions were found to be unimodal, making it easier for the GMMs to approximate them. See Fig. 6 (main diagonal, gray line) for fine-grained histograms of the marginal distributions of the parameters. The inspection of the parameter distributions (see Fig. 6, lower diagonal elements) only revealed a strong dependency between α and 𝜃. This ridge-like relationship is very similar to the one observed in the distributions of the 40-20 conditions. Otherwise, there were almost no dependencies observable. Note that this lack of dependencies mainly results from the fact that neither an α of 1 nor a λ of 1 have any influence on the decision variable. Therefore, distributions that are strongly peaked around 1 cannot sensibly covary with other parameters.

Fig. 6
figure 6

Posterior distribution across individuals in the 20-40 condition of the risky choice paradigm and the Gaussian mixture model that was fitted to it. The main diagonal depicts the marginal distributions of the parameters of the posterior samples (gray line) and the fitted mixture model (black line). Lower diagonal elements show the pair plots of the posterior samples, upper diagonal elements the pair plots of the fitted mixture model. Parameters α, λ, and 𝜃 reflect risk aversion, loss aversion, and the scaling parameter of the logistic choice function, respectively

The best-fitting GMM to the posterior distribution of parameters in the 20-40 condition was a mixture of four Gaussians (see Fig. 6, main diagonal, black line for the marginal distributions of the best-fitting GMM and Table 4 for the BICs for all numbers of mixtures). Apart from the height of the peak of λ, all other aspects of the empirical distributions, including the covariations, were well approximated.

Simulation results

We simulated 1,000 virtual participants from each of 2 (experimental condition) × 3 (ground-truth prior) = 6 factor combinations. We then refitted the data coming from each of these virtual participants under three different conditions: (a) using a uniform prior, (b) the empirical prior obtained for the 40-20 condition, and (c) the empirical prior obtained for the 20-40 condition. Just as in the first part of the paper, we first report global results aggregated across all factors, only then turning to the effects of matching conditions and priors, and the influence of (mis)matches between them.

Results reported in Table 3 show that across all experimental conditions, population distributions, and priors, parameter λ was recovered best, followed by α and 𝜃. This rank order holds for both the coverage ratio and r2. However, these results were far from satisfactory, as r2(λ) = .30, r2(α) = .11, and r2(𝜃) = .02 across conditions. Also, the 95% credible intervals included the true parameter values at much lower rates than expected.

Table 3 Parameter recoverability of the fully Bayesian empirical-prior method using streamlined prospect theory

In cases where the prior used matched the data-generating population and the condition, parameter recovery was on average slightly better. In the case of the 40–20 condition, this led to a significantly lower coverage ratio for both λ (M = .16, Md = .15, SD = .06) and α (M = .20, Md = .20, SD = .04). While the coverage ratio for 𝜃 decreased as well, it remained somewhat unsatisfactory as it still spanned roughly half of the range of possible values (M = .50, Md = .49, SD = .07). The pattern of rank stability shows a somewhat different picture though: Whereas the correlation between the ground-truth values and the posterior modes of λ (r2 = .75) and α (r2 = .42) improved dramatically (compared to the aggregated values), it became worse for 𝜃 (r2 = .07). The proportion of ground-truth parameters included in the 95% credible interval barely changed (Mmin = .57, Mmax = .66). Very similar results were found in the 20-40 condition.

The scenario in which the prior matched the data-generating population as well as the condition is the most optimistic one. This notion is important when considering the rates at which the 95% credible intervals included the true parameter values. With increasing sloppiness of a model and fully Bayesian parameter estimation, one would expect the credible intervals to widen without affecting that rate. However, as shown in Fig. 7, the credible intervals missed the true values much more often than they should. This result shows that the challenges created by parameter non-identifiability and sloppiness are not automatically addressed by a fully Bayesian treatment. Crucially, the posterior distributions appear well-behaved and show no signs that anything might be wrong with the model specification (see Fig. 7).

Fig. 7
figure 7

Violin plots for 20 individual posterior distributions of random virtual participants in the risky choice paradigm. The ground-truth parameters stemmed from the 40–20 prior in the 40–20 condition and were fitted using the 40–20 empirical prior. Thick bars represent the 95% credible interval. Vertical red line represents the ground-truth parameter. Note that in expectation, the 95% credible interval should include the red line in .95 of the cases (i.e., only one out of 20 red lines for each parameter should lie outside the credible interval). In that sample, eight, five, and ten ground truths of α, λ, and 𝜃, respectively, lie outside the 95% credible interval. Parameters α, λ, and 𝜃 reflect risk aversion, loss aversion, and the scaling parameter of the logistic choice function, respectively. The violin plots have been scaled vertically to fit their respective cells

Let us now turn to the (perhaps more realistic) cases in which there was a mismatch between the ground truth and the modeling assumptions. Here, we report the results from mixed mismatching between condition and prior used (i.e., participants stem from the ground-truth prior from the condition from which the priors were obtained, while the prior that is used during re-fitting varies). Table 3 reports all dependent variables for all the combinations analyzed here. When using the empirical prior from 20-40 condition in the fitting of data from the 20-40 condition, we observe lower coverage ratios together with lower proportions of true values included in the 95% credible interval. Both variables improved when the uniform prior was used instead. Results were somewhat similar when the mismatching data came from the 40-20 condition.

General discussion

The present work evaluated the empirical-prior approach for obtaining informative priors, which has been proposed as a way to deal with problems concerning parameter non-identifiability and model sloppiness (Gershman, 2016). Using the reinforcement-learning data originally reported by Gershman (2015), we first tested how the point-estimate variant of the empirical-prior approach fared in comparison with simple MLE. We found that neither approach provided satisfactory results and that neither one of them consistently outperformed the other. We then considered potential variations of Gershman’s experimental design as ways to improve recoverability. Modest but encouraging improvements were observed when increasing the number of trials per block or the number of options made available to the participants. To assess whether the rather poor performance of the point-estimate empirical-prior method was specific to its application to reinforcement-learning models (and the reliance on point estimates), we developed a fully Bayesian extension to the method and tested it in a streamlined variant of prospect theory (Kahneman & Tversky, 1979). In line with the results we obtained so far, we again did not observe a general advantage of the empirical-prior method. Important, we found that the true parameter values were often missed by the estimates’ respective 95% credible intervals, even when under a best-case scenario in which both the model and priors are “true”. This result goes counter to the expectation that parameter non-identifiability and sloppiness issues are well captured by the posterior distributions such that they should simply lead to wider posteriors. Instead, we often find posterior distributions that are concentrated in regions that do not include the true data-generating values.

Fitting data from an experiment and plugging the resulting parameter distributions as informative priors into a separate model-fitting procedure is an elegant and easy-to-implement procedure. Unfortunately, the informativeness of these informative priors is limited, and the method does not help solve the problem it was designed for. We showed that even when the priors used for the model-fitting procedure (be it using MAP or fully Bayesian estimation) are aligned with the true underlying parameter distributions, there are no systematic advantages of using informative over uniform priors. In case of a mismatch, which in empirical settings is likely to be the case, the ability to recover parameters can drop dramatically, even for rather simple models. Given that the true underlying parameter distributions cannot be recovered to a satisfactory degree, the ability to compare group-level differences is also compromised.

In the case of Gershman’s (2015) baseline experiment, we found that the main culprit for the poor recoverability was the limited informativeness of the data. On average, the parameters posterior distributions were well dispersed across the ranges of possible values, making it practically impossible to reasonably interpret any point estimates obtained through model fitting. Although some of these problems could be ameliorated by extending the experimental design, such extensions can also introduce their own set of practical problems. For instance, the increase of the number of options implies an increase in terms of tasks demands, which can in turn lead to individual preference profiles that models have trouble accounting for (e.g., Steingroever et al., 2014, for a demonstration in the Iowa gambling task; Bechara et al., 1994). Similarly, full feedback can lead to behavioral phenomena that are either unique to such scenarios, like attention allocation to foregone outcomes (e.g., Ashby & Rakow, 2016), or that at least differ considerably from what is found in the case of partial feedback (e.g., Plonsky, Teodorescu, & Erev, 2015, Plonsky & Erev, 2017; Yechiam, Stout, Busemeyer, Rock, & Finn, 2005).

Inversion versus inference

The concepts of parameter identifiability, recoverability, and model sloppiness discussed here are instrumental when attempting to infer the ground truth from data generated by it. Within the context of this question of inversion, identifiability and recoverability are of utmost importance, as without them it is impossible to draw correct conclusion about the underlying cognitive processes. For example, the β and 𝜃 parameters of the evaluated reinforcement-learning models and prospect theory, respectively, had the lowest recoverability of rank orders as reflected in r2 values close to 0. In light of such poor parameter recovery, a relatively high estimated parameter value was not predictive of whether the “true” choice consistency of the respective virtual participant was high or low.

However, such concerns do not carry over wholesale when, for instance, one frames the problem of parameter estimation as a question of inference. In this context, the ability to recover the ground truth cedes the center stage to the coherence of our relative support for the different hypotheses. For example, consider a model-selection scenario in which data are generated from a complex model, but turn out to still be somewhat likely under a much simpler candidate model. A greater support for the simpler model is a sensible conclusion here as this is the model that provides the best trade-off between goodness of fit and parsimony, even if it did not generate the data (Lee, forthcoming, pp. 60–61).Footnote 9 After all, models that are “wrong” (e.g., simpler than the generative model) can still be useful in predicting behavior (e.g., Lee & Webb, 2005). Nevertheless, it would be unwise to assume that even under such framing, we can completely divorce ourselves from any concerns related with identifiability and sloppiness. After all, it is still sensible to carefully evaluate the roles that the different parameters in a model can play, and how these can be ascertained under different experimental designs (Broomell & Bhatia, 2014). And even if one is ultimately not attempting to recover some ground truth, parameter-recovery exercises in which a ground truth is known can be seen as a sandbox that helps us to understand current difficulties in disentangling the roles parameters play, and develop ways to overcome them.

Conclusions

Computational models are popular tools to develop and test psychological theories of cognition. For them to also be useful tools, it is important to ensure that the parameters obtained from fitting models to data provide an accurate characterization of the underlying cognitive processes. If the data are not suited to inform us about the model parameters, as in the simple probability learning task used by Gershman (2015), then this requirement is not fulfilled. Informative priors used during the model fitting procedure can be helpful for estimating parameters (see Lee & Vanpaemel, 2017), however, they do not constitute a panacea for the identifiability or sloppiness problems that often arise when using non-informative experimental designs. In contrast, simple adjustments in the experimental design can often improve parameter recoverability. Based on these results, we conclude that researchers should invest more of their efforts in assessing and improving the information content of their experimental designs instead of relying on statistical methods after the fact. In the end, whether empirical priors help or not (and how much) is a shot in the dark, as they only seem to help in some of the rare cases in which there is a match between the priors and population distribution.