1 Introduction

Inverse reinforcement learning (IRL) has generally been formulated (Russell 1998; Ng and Russell 2000) as:

Given (1) a Markov decision-process (MDP) with reward-function \(R(s; \theta )\), where the \(\theta \) are unknown parameters; (2) a set of state-action paths \(\varXi = \{\xi _1, \dots , \xi _N\}\) demonstrating optimal behavior given the true \(\theta ^*\), where \(\xi _i = (s^i_0, a^i_1, \dots , a^i_{T_i-1}, s^i_{T_i})\); and optionally (3) a prior \(P(\theta )\).

Determine a point estimate \(\hat{\theta }\) or the posterior \(P(\theta |\varXi )\).

IRL problems arise when it is of interest to infer the goals or predict future behavior of intelligent agents based on observations of the agent’s past behavior. Overall, there are many situations where humans behave in a complex and adaptive manner, which might not be explainable by a simpler model. Examples include driver route modeling (Ziebart et al. 2008), helicopter acrobatics (Abbeel et al. 2010), learning to perform motor tasks (Boularias et al. 2011), dialogue systems (Chandramohan et al. 2011), pedestrian activity prediction (Ziebart et al. 2009; Kitani et al. 2012), and commuting routines (Banovic et al. 2016).

Humans are, in general, able to understand and predict the behavior of other humans in familiar settings, even from rather limited observation data. Developing a similar ability in autonomous agents could thus, for example, enable them to interact more naturally with humans in rich every-day situations. However, a limitation with the traditional problem formulation is the assumption that full paths containing both actions and states have been observed. In many real-world situations such fine-grained observations may not be available for multiple reasons. For example, it may be too costly to set up sensors that could gather the fine-grained observations, or it may even be impossible to change the measurement devices if they are owned by a third party. Also, even if accurate sensors are used, various environmental factors may cause unavoidable occlusion, censoring or distortion to the measurements. Furthermore, existing datasets are unlikely to contain full path data, if the data have not been collected with IRL in mind. We elaborate on these motivations later.

There have been a few initial approaches for addressing this issue. The earliest was to assume that instead of the actual paths, we might just observe the expected sum of state feature values the agent encounters during the demonstrated paths, known as feature expectations (Abbeel and Ng 2004). Later approaches have relaxed the assumption on the state observations from accurate to probabilistic: instead of observing the states, they assume a probability distribution \(P(s_t)\) over the state-space is given for each timestep (Kitani et al. 2012). However, the existing methods are not applicable in more general situations, where the external observer has partial observability at the path level.

Summary of contributions This paper formulates the IRL from summary data (IRL-SD) problem, which extends the IRL problem to situations where the full paths are not directly available. We assume a summarizing function \(\sigma \) acts as a filter between the external observer and the true paths. We demonstrate that even in the most general case with no prior assumptions about the summarizing function, inference is still possible for this problem class, thus significantly extending the scope of problems where IRL can be performed. We derive the exact likelihood for this problem and two approximations that are significantly faster to evaluate. The first approximation is a Monte-Carlo estimate and the second uses an approximate Bayesian computation (ABC) approach. We demonstrate that both of these approximations are feasible for MDPs for which optimal policies can be estimated in a reasonable time. Using a grid world toy example, we demonstrate that both the exact and approximate methods are able to recover the parameters of the reward function with good accuracy, and that the approximate methods scale significantly better. Using a recent RL model from the cognitive science literature, we demonstrate that a sensible approximate posterior can be inferred based only on the task completion times collected from user experiments.

The methods have additional interesting properties. First, they do not differentiate between different types of MDP parameters, which allows inference to be easily extended to any interesting parameters of the generative process besides the traditional reward function. Second, they also allow non-linear reward functions to be used, which is not the case with many existing methods. Third, the approximate methods can also be used in situations where the transition function is not known, as long as we can generate draws \(s_{t+1} \sim P(s_{t+1} | s_t, a_t)\).

2 Inverse reinforcement learning

We give a brief overview of the standard assumptions existing IRL methods make of the observation data, and mention the main approaches to inference. For a more complete review see, for example, Zhifei and Joo (2012).

2.1 Model assumptions

The standard IRL modeling assumption is that an agent is interacting with an MDP environment, demonstrating optimal behavior over N independent episodes, thus creating paths \(\varXi = (\xi _1, \dots , \xi _N)\). Each path is a sequence of states and actions, denoted as \(\xi _i = (s^i_0, a^i_0, \dots , a^i_{T_i-1}, s^i_{T_i})\), where \(s_t\) and \(a_t\) are the state and action at timestep t, and \(T_i\) is the length of trajectory i.

An MDP M is defined by the tuple \((S, A, T, R, \gamma )\), where S is a set of states, A is a set of actions, \(T = P(s_{t+1} | s_t, a_t)\) is the transition function, R(s) is the reward function, and \(\gamma \) is the discount rate. M is defined in terms of some unknown parameters \(\theta \). An instance of M with fixed parameters \(\theta \) is denoted by \(M_\theta \).

If the agent has partial observability of the environment state, the situation is defined as a POMDP \((S, A, T, R, \Omega , O, \gamma )\), where \(\Omega \) is the set of possible observations and \(O = P(o_t|s_t, a_t)\) is the observation function.

2.2 Observation assumptions

Regarding the observations the external observer has of the agent’s behavior, four types of settings have been studied:

  1. (1)

    The policy \(\pi = P(a_t|s_t)\) of the agent is known (Ng and Russell 2000); in other words, we know exactly how the agent will behave in any situation.

  2. (2)

    Noise-free observations of the states of the environment (belief states in POMDP situations (Choi and Kim 2011)) and actions of the agent are available (Ng and Russell 2000; Ratliff et al. 2006; Neu and Szepesvári 2007; Ramachandran and Amir 2007; Dimitrakakis and Rothkopf 2011; Rothkopf and Dimitrakakis 2011; Klein et al. 2012; Michini and How 2012; Klein et al. 2013; Tossou and Dimitrakakis 2013; Choi and Kim 2015; Nguyen et al. 2015; Herman et al. 2016). This is probably the most common formulation in the literature. A benefit of this assumption is that it allows the likelihood to be factorized per state transition.

  3. (3)

    Feature expectations of paths traveled by the agent are available (Abbeel and Ng 2004; Ziebart et al. 2008; Boularias et al. 2011; Bloem and Bambos 2014). Feature expectations are computed from the true paths by \(\hat{\mu }_E = \dfrac{1}{N}\sum ^N_{i=1}\sum ^{T_i}_{t=0}\gamma ^t\phi (s^i_t)\), where \(\phi \) is a function yielding a vector of state features. If the reward function is linear in state features, \(R(s) = \theta ^T \phi (s)\), the inference problem can be formulated as a function of \(\theta ^T \hat{\mu }_E\).

  4. (4)

    Probabilistic observations of the states of the environment are available (Kitani et al. 2012; Surana 2014). Here it is assumed that instead of observing the state \(s_t\), the external observer only observes a distribution \(u_t = P(s_t)\). This is a natural assumption, for example, assuming measurement noise. The general approach is to estimate the state visitation frequencies based on the observations and use them in turn to estimate the feature expectations \(\hat{\mu }_E\), after which standard methods can be used. Both feature expectations and probabilistic observations can be seen as specific summaries, or incomplete versions, of the actual paths.

2.3 Inference approaches

There are two common approaches for solving the IRL problem. MCMC can be applied for computing samples of the posterior when the unnormalized likelihood can be evaluated in closed form (Ramachandran and Amir 2007). Gradient descent can be applied for giving point estimates when the gradient of the likelihood can be evaluated in closed form (Ziebart et al. 2008). Also point estimation based on linear programming (Ng and Russell 2000) and classification (Klein et al. 2012) have been considered.

2.4 Relationship to imitation learning

The formulation of the IRL problem is close to that of imitation learning (IL), also known as apprenticeship learning (Abbeel and Ng 2004). While in IRL we are interested in recovering the underlying parameters of the model, in IL being able to replicate the behavior of the expert is sufficient. Thus, the goal is to recover a policy \(\pi = P(a_t|s_t)\) such that the behavior generated by the policy matches that demonstrated by the expert, instead of explicitly recovering the parameters \(\theta ^*\) of the underlying MDP.

In general, IRL is a more complex problem than IL, as the parameter recovery problem is generally under-determined, and, depending on the formulation, may also have degenerate solutions (such as a reward function that is 0 everywhere) (Ng and Russell 2000). For this reason, the approach has been to either recover the full posterior that quantifies our uncertainty (Ramachandran and Amir 2007), or to find point estimates that are maximally robust (Ratliff et al. 2006). A solution to the IRL problem generally solves the corresponding IL problem, and might give a robust solution as the reward structure is often more generalizable compared to just a policy replicate. For example, it is not clear how an IL policy should behave in a state that is not covered by the examples, while the parameters recovered by IRL can be used to estimate the corresponding Q-values and thus generate behavior that best follows the values of the expert.

3 IRL from summary data

3.1 Problem definition

Let M be an MDP parametrized by \(\theta \), where \(\theta \) is any finite set of parameters of interest (not limited to the reward function parameters). Let the true parameters be \(\theta ^*\) and assume an agent whose behavior agrees with an optimal policy for \(M_{\theta ^*}\). We do not know \(\theta ^*\), but may have a prior \(P(\theta )\). Assume that the agent has taken paths \((\xi _1, \dots , \xi _N)\) but we only have observed summaries of these paths: \(\varXi _{\sigma } = (\xi _{1\sigma }, \dots , \xi _{N\sigma })\), where \(\xi _{i\sigma } \sim \sigma (\xi _i)\). \(\sigma (\xi _i) = P(\xi _{i\sigma }|\xi _i)\) is a stochastic summary function that transforms a path into another type of observation, which generally contains less information than the original path (thus the name summary function). The inverse reinforcement learning problem from summary data (IRL-SD) problem is:

Given (1) a set of summaries \(\varXi _{\sigma }\) from optimal behavior; (2) a summary function \(\sigma \); (3) an MDP M with \(\theta \) unknown; and optionally (4) a prior \(P(\theta )\).

Determine \(\hat{\theta }\) or the posterior \(P(\theta |\varXi _\sigma )\).

In the traditional IRL setting \(\theta \) would be the parameters of the reward function. Our formulation extends the inference problem to other parameters of the MDP as well. A similar extension in the traditional IRL setting was recently considered by Herman et al. (2016).

3.2 Motivating example

To illustrate the issue with traditional IRL methods, consider the following example: “Alice can travel from home to work using any reasonable route. The different routes go through different kinds of scenery, and Alice has specific preferences for what kind of scenery she prefers to look at when commuting. If we know the duration of the commute, can we say anything about Alice’s preferences regarding scenery?”

This is clearly an IRL-type problem, as the reward function of a rational agent should be estimated based on observation data. However, all the existing methods for IRL fail to solve the problem, as no state-action trajectories or feature expectations are available. In comparison, humans are generally able to perform inference in similar settings based on mental simulation (Gallese and Goldman 1998). This suggests that problems such as this are regularly encountered in realistic settings and that they can be solved at least approximately in reasonable time.

However, the above example precisely corresponds to the IRL-SD problem, with \(\sigma \) extracting the duration of the path. Thus, methods that are able to solve the IRL-SD problem will both extend the scope of problems which can be solved with IRL-type approaches and be a step towards being able to imitate human reasoning more closely.

3.3 Reasons for summarized observation data

There are multiple concrete reasons that prevent the use of full paths in modeling strategic behavior.

First, environmental and physical restrictions, such as physical occlusion or sensor saturation may prevent us from observing the full paths.

Second, coarse-grained or noisy observations are generally cheaper to acquire compared to accurate path observations. For example, it is significantly easier to log keyboard and mouse clicks from computer users compared to eye-tracking or think-aloud observations.

Third, full path data takes up more space than summaries, which makes it more likely that only the most relevant features of the data are stored for later analysis. Also bandwidth restrictions might prevent transmitting full path data if observations are done remotely.

Fourth, when modeling an adversary, she will likely prevent us from observing the full paths. For example in games of incomplete information, such as poker or Starcraft, the opponent hides the details of her states and actions when possible.

Fifth, privacy guarantees result in data being released only as non-identifying summaries. This is complementary to the previous; here the data is summarized to prevent a possible adversary from identifying specific types of information.

4 Inference methods for IRL-SD

We first derive the observation likelihood for the IRL-SD problem. However, as evaluating the likelihood function can be very expensive, we also propose approximations that are faster to evaluate.

4.1 Exact likelihood

To derive a computable likelihood, we assume both |S| and |A| are finite (e.g. through discretization) and that the maximum number of actions within an observed episode is \(T_{max}\). We denote the finite set of all plausible trajectories (that have non-zero contribution to the likelihood) by \(\varXi _{ap} \subseteq S^{T_{max}+1} \times A^{T_{max}}\).

The likelihood for \(\theta \) given summary observations \(\varXi _\sigma \) is

$$\begin{aligned} L(\theta |\varXi _\sigma ) = \prod _{i=1}^N P(\xi _{i\sigma }|\theta ) = \prod _{i=1}^N \sum _{\xi _i \in \varXi _{ap}} P(\xi _{i\sigma }|\xi _i) P(\xi _i | \theta ), \end{aligned}$$

where \(P(\xi _{i\sigma }|\xi _i)\) is determined by the summary function \(\sigma \), which is assumed to be known, and

$$\begin{aligned} P(\xi _i | \theta ) = P(s^i_0) \prod _{t=0}^{T_i-1} \pi ^*_\theta (s^i_t, a^i_t) P(s^i_{t+1} | s^i_t, a^i_t). \end{aligned}$$

The main difficulty with the exact likelihood is finding the set \(\varXi _{ap}\) and evaluating the sum over it. If \(\sigma \) has a known finite support, this might be used to constrain the set \(\varXi _{ap}\) as paths outside the support can be immediately ruled out.

4.2 Monte-Carlo estimate of likelihood

One possibility to deal with the sum over \(\varXi _{ap}\) is to use a Monte-Carlo estimate. In this approach, paths \(\varXi _{MC}\) (set of size \(N_{MC} \ll |\varXi _{ap}|\)) are simulated using an optimal policy \(\pi ^*_\theta \), so that each path is drawn with probability \(P(\xi | \theta )\). The likelihood of each individual observation can be estimated by a Monte-Carlo sum:

$$\begin{aligned} \hat{L}(\theta |\varXi _\sigma ) =&\prod _{i=1}^N \dfrac{1}{N_{MC}} \sum _{\xi _n \in \varXi _{MC}} \dfrac{P(\xi _{i\sigma }|\xi _n) P(\xi _n | \theta )}{P(\xi _n | \theta )}\\ =&\prod _{i=1}^N \dfrac{1}{N_{MC}} \sum _{\xi _n \in \varXi _{MC}} P(\xi _{i\sigma }|\xi _n). \end{aligned}$$

As the contribution of each sample is weighted by the probability of the path, this cancels out the existing term from the product.

A benefit of this approach is that the transition probabilities \(P(s_{t+1} | s_t, a_t)\) do not need to be defined any more in closed form: for generating the Monte-Carlo samples it is enough that we can draw samples. We also need not assume that A or S are finite in size.

One issue with this approach is that there might not be any paths in the Monte-Carlo sample that have a non-zero observation probability for a certain observation in the dataset (that is, \(P(\xi _{i\sigma }|\xi _n) = 0\) for all n). This is common when \(\sigma \) has a negligible support in \(\varXi _{ap}\), or when the path distribution has a “fat tail” which is not sufficiently covered by the finite sample. One way to alleviate this problem is to add a small constant value to the likelihood of each observation as an a-priori estimate. For example, \(1/N_{MC}\) might be a sensible heuristic, as it vanishes with a large enough sample.

4.3 ABC estimate of likelihood

A third alternative is to avoid evaluating the likelihood function entirely, and use an approximate Bayesian computation (ABC) approach (Sunnåker et al. 2013) instead. ABC also uses Monte-Carlo samples for estimating the likelihood, but does it by comparing the samples directly to the observation data using a discrepancy function, which is often chosen to be similar to the prediction error function. Essentially this means that the Monte-Carlo sample is transformed into simulated summary observations using \(\sigma \), after which the discrepancy to the observation data is computed.

The discrepancy function is denoted by

$$\begin{aligned} \delta (\varXi ^{A}_\sigma , \varXi ^{B}_\sigma ) \rightarrow [0, \infty ). \end{aligned}$$

As we make no assumptions about the type of the summary observations, the choice of \(\delta \) is not fixed here. Often in the ABC literature \(\delta \) is a norm between the general features of the summary datasets, or the prediction error function or its logarithm is used.

Using \(\delta \) we can define a stochastic variable

$$\begin{aligned} d_\theta \sim \delta (\varXi ^{sim}_\sigma , \varXi _\sigma ), \end{aligned}$$

where \(\varXi ^{sim}_\sigma = \{\sigma (\varXi _{MC,n})\}_{n=1\dots |\varXi _\sigma |}\). The ability of \(\theta \) to generate data similar to the observation data is quantified by the distribution of \(d_\theta \).

The likelihood can be retrieved exactly using a \(\delta \) with the property \(\delta (\varXi ^{A}_\sigma , \varXi ^B_\sigma ) = 0 \Leftrightarrow \varXi ^{A}_\sigma = \varXi ^B_\sigma \). In this case the likelihood can be written as

$$\begin{aligned} L(\theta | \varXi _\sigma )&= P(\varXi _\sigma | \theta ) = P(\varXi ^{sim}_\sigma = \varXi _\sigma | \theta )\\&= P(d_\theta = 0 | \theta ), \end{aligned}$$

which follows from the fact that the process for generating \(\varXi ^{sim}_\sigma \) is precisely our assumed generative model.

However, estimating \(P(d_\theta = 0 | \theta )\) from a finite Monte-Carlo sample is challenging as most realizations lead to \(d_\theta \gg 0\). For this reason, we do an ABC approximation:

$$\begin{aligned} \tilde{L}_\varepsilon (\theta | \varXi _\sigma ) = P(d_\theta \le \varepsilon | \theta ), \end{aligned}$$

with an approximation threshold \(\varepsilon \in [0, \infty )\). This approximate likelihood is easier to estimate when \(\varepsilon \) is similar to the observed values of \(d_\theta \). The choice of \(\varepsilon \) is often done adaptively.

This approach can be seen as “IRL through imitation learning”, as we are estimating the parameter likelihood through behavior similarity. This is an extension to matching feature expectations (Abbeel and Ng 2004), but generalized to the global features of the behavior available through \(\sigma \). A further benefit of this approach is that the observation probabilities \(P(\xi _{\sigma }|\xi )\) do not need to be available in closed form, as long as we can draw samples from \(\sigma \).

4.4 Inference

Recent work has shown the feasibility of Gaussian process (GP) (Rasmussen 2004) surrogates for expensive likelihoods (Rasmussen 2003), also in the ABC setting (Gutmann and Corander 2016). We use this approach as well, as the likelihoods we work with are expensive to evaluate. The Bayesian optimization (BO) (Brochu et al. 2009) sampling strategy is used for concentrating the samples so that high likelihood regions are well estimated.

Algorithm 1 summarizes the estimation of the likelihood surface based on both the exact and approximate methods. As we are performing global non-convex optimization, we make the additional assumption that the likelihood is mainly contained within a bounded region \(\varTheta \). We utilize two generic subroutines: RL(M) is a function that given MDP M finds an optimal policy \(\pi ^*\), and SIM\((M, \pi )\) is a function that given an MDP M and policy \(\pi \) simulates a path \(\xi \) using the policy. For a GP fit with data D and hyperparameters H, we denote the predicted mean at \(\theta \) by \(G_\mu (\theta | D, H)\) and the standard deviation by \(G_s(\theta | D, H)\), and the full GP posterior is denoted by \(G(\theta | D, H)\). We denote the number of samples for estimating the surrogate by \(N_{opt}\) and the BO acquisition function value at \(\theta \) by \(Acq(\theta | D, H)\) (the maximum of Acq defines the next sample location in BO). \(\Phi (\varepsilon |\mu , \sigma )\) denotes the CDF of \(N(\mu , s)\) at \(\varepsilon \). The threshold \(\varepsilon \) was set to the minimum predicted value of discrepancy, as it represents the “best that the model can do” given the available information.

figure a

For posterior inference, the log-likelihood in Algorithm 1 can be replaced with the log-posterior. With ABC, the likelihood can be multiplied by the prior after estimation.

5 Experiments

To study the performance of the proposed inference methods, we start with a well-known toy MDP, but change the observation assumptions to match the IRL-SD problem. Through this example, we demonstrate that we are able to infer the parameters of the agent’s reward function based only on summarized path observations. With this MDP the approximate methods are able to recover the reward function parameters with comparable quality to the exact method, but considerably faster.

We also demonstrate that our approach scales to realistic modeling cases as well. We show that the ABC approximation is able to infer a reasonable approximate posterior for a RL-based cognitive model from the HCI literature, based on measurements of real user behavior. The details of the experiments are given in “Appendix A”.

5.1 Grid world

Grid world is a well-known problem type in the IRL literature (Ng and Russell 2000; Abbeel and Ng 2004; Neu and Szepesvári 2007; Boularias et al. 2011; Herman et al. 2016). In this problem, an agent is located on a cell in a discrete two-dimensional grid of \(w \times w\) cells. When the agent enters a cell, it receives a reward based on the features of the cell \(\phi (s)\) and the features of the agent’s reward function \(\theta \), according to \(R(s) = \theta ^T \phi (s) + r_{step}\).

In our case, the agent is initially located on a random cell at the edge of the grid. The cell at the center of the grid is the goal, and entering the goal gives the agent a large positive reward and ends the episode. Each grid cell has \(N_f\) binary features, which have been generated by placing w walls for each feature at random on the grid (the seed value used for generating the grid is part of the MDP definition). An example of a grid with three features is shown in Fig. 1.

Fig. 1
figure 1

Visualization of a \(13\times 13\) grid with three features, generated by placing 13 random walls per feature. Each feature is shown individually, with black squares denoting the presence of the feature. Each feature can be thought of as a different type of terrain (e.g. mountains, swamp, forest)

The summary function is defined as \(\sigma (\xi ) = (s_0, |\xi |)\), yielding the initial state at the edge, and the number of steps it took to reach the goal at center (i.e. we do not know what the intermediate states or actions were). Our problem is to infer likely values for \(\theta \in [-1, 0]^{N_f}\), such that the simulated behavior with these values matches the observations, given a set of summary observations \(\varXi _\sigma \) and the MDP definition. It is also easy to verify that this corresponds to the motivating example mentioned before in Sect. 3.2, related to Alice’s scenery preferences while commuting.

5.2 Experiment 1: algorithm run-time

First, we compared the empirical run-times of the exact and approximate methods. For the approximate methods we use a Monte-Carlo sample of size 1000.

We simulated observation sets with \(N =\) 200 from grids of various sizes. We used grids with no features (\(N_f\) = 0) to avoid long paths that would make the exact method infeasible to evaluate. We computed the first iteration step for all algorithms and recorded the elapsed wall-clock time. The algorithms were implemented with Python and executed on an Intel Xeon X5650 2.67 GHz processor restricted to 300 MB of memory.

The empirical run-time of the exact algorithm grows rapidly as the size of the grid increases (Fig. 2). This is expected, as \(|\varXi _{ap}|\) grows exponentially as the length of the path grows linearly. On the other hand, the run-times of the approximate algorithms scale comparatively much better. ABC is equally expensive to Monte-Carlo (MC), as expected.

Fig. 2
figure 2

Mean duration (log10 scale) of the first step of the exact and approximate methods as a function of size of the problem (N = 5). Smaller is better

5.3 Experiment 2: inference quality

We compared the quality of inference between the exact and approximate methods on small grids. We also investigate the performance of the approximate methods on larger grids, where the exact method is computationally infeasible. The experiments were performed with \(N_f =\) 2 and 3. When comparing to the exact method (w being 9 and 11), we limited the length of paths in the observation dataset to be at most 12 to keep the computation time feasible (leaving on average 97 and 93% of observations, respectively). We also use a random baseline, which is a uniform random draw from the parameter space.

We measure inference quality both by the accuracy of the parameter recovery, which quantifies IRL performance, and prediction accuracy, which quantifies imitation learning performance. The accuracy of the parameter recovery was measured with RMSE between likelihood mean (computed using MCMC) and ground truth. The mean was used instead of ML as the likelihoods were sometimes broad; the mean was a more robust estimate in initial trials.

Prediction error was measured with the MAE in path length per individual starting location, measured on a separate dataset generated with the same ground truth parameters. As the discrepancy \(\delta \) we used the logarithm of the prediction error computed on the observation dataset (as the errors appeared to be log-normally distributed).

We observe that the approximate methods perform well compared to the exact method. The approximate methods are able to recover the reward function parameters with comparable accuracy as the exact method, shown in Fig. 3. This demonstrates that Monte-Carlo sampling is a feasible approach for estimating the true likelihood, as is directly matching the global features of the predicted behavior with ABC. Also, the discrepancy of the predicted behavior is relatively low with all methods, suggesting that the policies recovered by the methods are good approximations of the true policy. There were no statistically significant differences in ground truth errors or prediction errors with any of the methods, except for the random baseline which was worse (N = 30).

The approximate methods are able to perform well on larger grids where the exact method is computationally infeasible. They are able to recover the parameter values reliably (Fig. 3) and the discrepancy also increases predictably with the grid size (Fig. 4).

We also observe that the approximate likelihood densities are sensible estimates of the true likelihood, as shown in Fig. 5. In this particular example it can be seen that the ratio of the rewards is well identified, but there is still uncertainty left in the scale of the rewards. It would not have been possible to infer this insight from just a point estimate, which demonstrates the benefit of estimating the full likelihood surface.

Fig. 3
figure 3

RMSE to ground truth (mean and standard deviation, N = 30), smaller is better

Fig. 4
figure 4

Prediction error on test data (mean and standard deviation, N = 30), smaller is better

Fig. 5
figure 5

Representative example of likelihood densities estimated with different methods (2 features). Both Monte-Carlo and ABC are able to produce a reasonable approximation of the exact likelihood. Left: exact. Center: Monte-Carlo. Right: ABC. The color maps are chosen so that the maxima of the functions are white and minima are black. The likelihood mean is marked with a square and the ground truth parameters with a star. More examples are found in “Appendix A” (Color figure online)

5.4 Experiment 3: modeling computer users

In the final experiment we infer the full posterior of a recent RL-based cognitive model using realistic observation data.

The task is to estimate the parameters of an MDP modeling the oculomotor system of a user who is searching for a specific menu-item from a computer drop-down menu (Chen et al. 2015; Kangasrääsiö et al. 2017). Although with large computer screens traditional IRL methods have been used as detailed actions can be measured with eye-tracking (Mohammed and Staadt 2015), with small menus the accuracy of eye-tracking is often poor in comparison. However, simple summary statistics, such as the time between opening a menu and clicking the target item, are simple to measure accurately, but require solving the IRL-SD problem.

Recently Kangasrääsiö et al. (2017) found MAP parameter estimates for the model using summary observations from a user study by Bailly et al. (2014). The summary observation included the task completion time in milliseconds (TCT, sum of the durations of all actions in an episode) and whether the target was present or absent in the menu. We extend their analysis by showing that full posteriors can be estimated based on the same dataset and a similar model (see “Appendix A” for details of the model).

Although the state transition function is only defined as a computable algorithm, and the summary function \(\sigma \) is a delta distribution, the ABC method is still applicable.

Getting the average TCT predicted correctly is the primary goal of the model, and getting the variation correct as well is the secondary goal. For this reason, the discrepancy function \(\delta \) was chosen to be the logarithm of the squared differences in TCT means plus the absolute differences in standard deviations summed from both menu conditions.

We infer the posteriors of three parameters of the MDP: (1) the duration of eye fixations \(f_{dur}\) (units of 100 ms); (2) the duration of moving the mouse to select an item \(d_{sel}\) (units of 1 s); and (3) the probability of recalling the full menu layout from memory \(p_{rec}\).

The reward function is such that the agent receives a penalty equal to the number of milliseconds spent on performing the action. The duration of an action is the sum of saccade duration (based on the distance between two consecutive fixation locations), \(f_{dur}\) and \(d_{sel}\). From this perspective, \(f_{dur}\) and \(d_{sel}\) can also be seen as parameters of the reward function. Finding the correct item leads to a reward 10k, as does quitting when there is no target item in the menu. Quitting when there is a target present results in a penalty -10k.

The posterior is visualized in Fig. 6 using 2D slices at the MAP location (\(d_{sel} =\) 0.05, \(p_{rec} =\) 0.80, \(f_{dur} =\) 2.6). We observe that a posteriori there is a correlation between \(f_{dur}\) and \(p_{rec}\), and similarly for \(f_{dur}\) and \(d_{sel}\). Both of these are understandable, as increasing \(f_{dur}\) would increase the predicted TCT, as would decreasing \(p_{rec}\) or increasing \(d_{sel}\). The posterior of \(f_{dur}\) is centered around 260 ms, but there is still uncertainly left in \(d_{sel}\) and \(p_{rec}\). The uncertainty in \(d_{sel}\) is explained by the difficulty of pointing precisely to the target item with the cursor, which causes variation in its duration. The uncertainty in \(p_{rec}\) is explained by the fact that the menus encountered early on in the experiments were completely new to the subjects, but as the experiment progressed the subjects were more and more likely to recall the menus. We also observe that there is no significant posterior correlation between \(p_{rec}\) and \(d_{sel}\). This indicates that although they both affect the TCT, the effects they have are orthogonal; increasing the probability of recalling a menu can not be fully compensated just by increasing the selection duration.

Fig. 6
figure 6

The approximate posterior inferred with ABC demonstrates that the parameters can be identified and that the remaining uncertainty is well characterized. Left: fixation duration \(f_{dur}\) and menu recall probability \(p_{rec}\). Center: fixation duration \(f_{dur}\) and selection delay \(d_{sel}\). Right: menu recall probability \(p_{rec}\) and selection delay \(d_{sel}\). The color map is chosen so that the maximum of the posterior is white and minimum is black (Color figure online)

The simulated data at the MAP location was able to reproduce the general features of the observation data. A comparison of key features is shown in Table 1.

Table 1 Comparison of menu model prediction means (MAP estimate) and observation data means

6 Discussion

The experiments demonstrate that the proposed approximate methods are applicable for inferring RL-based models based on aggregate observation data, when it is acceptable that the inference takes some time. For example, many off-line scientific modeling scenarios fall into this setting. However, there are still multiple complementary options for improving the speed and scalability of the proposed methods from here on. One option to scale up to higher-dimensional parameter spaces is to find a lower-dimensional subspace where the most interesting variation takes place (Wang et al. 2016). One option to increase the speed of finding solutions to RL problems is to use RL transfer learning, as it is generally faster to find a good policy based on an existing policy from a nearby location (Ramachandran and Amir 2007), compared to learning it from scratch.

An interesting feature of both of the proposed approximations is that they do not explicitly depend on the path likelihood. With the MC approximation, this is due to a term cancellation, and with the ABC approximation this is due to the likelihood-free modeling approach. This means that the limitations to performance are different than usually; instead of being limited by the ability to evaluate the path likelihood function, the methods are limited by the ability to generate reasonable behavior with certain parameter values. Although generating samples from the model is often a less efficient inference method compared to evaluating the likelihood function directly, the situation is different when one does not have the luxury of choosing the observation data to precisely match the model assumptions, such that the likelihood would have a convenient form. Furthermore, the fact that the generative model is now “decoupled” from the inference method might open up new avenues of research in modeling strategic behavior, as this decoupling enables greater flexibility in the design of the generative model, instead of being limited strictly to the MDP assumptions.

With full path observations the summary function \(\sigma \) becomes the identity and the exact likelihood becomes the same as in most traditional IRL methods. Thus the proposed exact method should in principle yield a similar posterior as existing Bayesian IRL methods (e.g. Ramachandran and Amir 2007). The two proposed approximations have been designed specifically for the situation where the observations are available only in summarized form. If MC is used with full path observations, the possibility of sampling precisely similar paths as in the observation data might be arbitrarily small, which causes practical problems with this method. The ABC approximation can be used with full path observations as long as the discrepancy function \(\delta \) and threshold \(\varepsilon \) are reasonable. However, due to the likelihood-free approach, the ABC approximation will likely be slower than more specialized methods when the full paths are available and the likelihood gradient is computable.

The need to have some knowledge of the summarizing function \(\sigma \) is, in general, an unavoidable requirement for performing inference. In this work it was assumed that \(\sigma \) was known in advance. If \(\sigma \) is unknown, it might be estimated from data if full path observations are available for some data.

Also, it is clear that the amount of information available of the model parameter values depends on \(\sigma \). Thus, not all possible \(\sigma \) lead to a feasible setting for inference. As it is challenging to define requirements for \(\sigma \) without considering the specific application, evaluating the feasibility of inference needs to be made based on expert knowledge or empirical experiments. However, a key benefit of the proposed Bayesian approach is that the full posterior allows the remaining uncertainty to be directly estimated.

The need to choose the discrepancy function \(\delta \) and threshold \(\varepsilon \) is unavoidable in ABC; a recent summary of different methods is provided by Lintusaari et al. (2017). The most promising choices are to either use domain knowledge, which is naturally task-specific, or more generally to learn from data a classifier which can be used to form the discrepancy function (Gutmann et al. 2018).

7 Summary

In this paper we defined the IRL-SD problem, where the task is to do inverse reinforcement learning based on summarized observations of the agent’s behavior. We proposed exact and approximate methods for inference. The Monte-Carlo approximation can be used when the summary function \(\sigma \) is available as a probability distribution with a non-negligible support, and the ABC even when \(\sigma \) can only be evaluated. We demonstrated that all proposed methods are able to produce feasible results, but the exact method is computationally expensive. However, the approximate methods can be used even for full posterior inference with realistic MDPs and real observation data. The methods presented are feasible baselines for more specialized inference algorithms that may take advantage of further assumptions, and are state-of-the-art in situations that are currently out-of-reach for existing more specific methods.

Overall, regarding partial observability in IRL, there have been two cases for which methods exist:

  • If the agent has partial observability of the environment state, a POMDP model can be used (Choi and Kim 2011).

  • If the external observer has partial observability of the environment state, traditional IRL methods can be extended (Kitani et al. 2012).

This work extends this list by a third item:

  • If the external observer has partial observability of the complete path, then the presented methods for IRL-SD can be applied.