Keywords

1 Introduction

We consider forecasting a vehicle’s trajectory (i.e., predicting future paths). Forecasts can be used to foresee and avoid dangerous scenarios, plan safe paths, and model driver behavior. Context from the environment informs prediction, e.g. a map populated with features from imagery and LIDAR. We would like to learn a context-conditioned distribution over spatiotemporal trajectories to represent the many possible outcomes of the vehicle’s future. With this distribution, we can perform inference tasks such as sampling a set of plausible paths, or assigning a likelihood to a particular observed path. Sampling suggests routes and visualizes the model; assigning likelihood helps measure the model’s quality.

Fig. 1.
figure 1

Left: Natural image input. Middle: generated trajectories (red circles) and true, expert future (blue squares) overlaid on LIDAR map. Right: Generated trajectories respect approximate prior, here a “cost function,” overlaid as a heatmap. Making the expert paths likely corresponds to \(\min _\pi H(p,q_\pi )\). Only producing likely paths corresponds to steering the trajectories away from unlikely territory via \(\min _\pi H(q_\pi ,\tilde{p}\)). Doing both, i.e. producing most of the likely paths while mostly producing likely paths corresponds to \(\min _\pi H(p,q_\pi ) + \beta H(q_\pi ,\tilde{p})\). (Color figure online)

Our key motivation is to learn a trajectory forecasting model that is simultaneously “diverse”—covering all the modes of the data distribution—and “precise” in the sense that it rarely generates bad trajectories, such as trajectories that intersect obstacles. Covering the modes ensures the model can generate samples similar to human behavior. High “precision” ensures the model rarely generates samples very different from human behavior, which is important when samples are used for a downstream task. Figure 1 contrasts a model trained to cover modes, versus a model trained to cover modes and generate good samples, which generates less samples that hit perceived obstacles. To these ends, we define our model \(q_\pi \) as the trajectory distribution induced by rolling out (simulating) a stochastic one-step policy \(\pi \) for T steps to produce a trajectory sample x, and we propose choosing \(\pi \) to minimize the following symmetrized cross-entropy objective, where \(\phi \) denotes the scene context:

(1)

The \(H(p, q_\pi )\) term encourages the model \(q_\pi \) to cover all the modes of the distribution of true driver behavior p, by heavily penalizing q for assigning a low density to any observed example from p. However, \(H(p, q_\pi )\) is insensitive to samples from q, so optimizing it alone can yield a model that generates some “low-quality” samples. The \(H(q_\pi , \tilde{p})\) term penalizes \(q_\pi \) for generating “low-quality” samples (where an approximate data density \(\tilde{p}\) is low). However, \(H(q_\pi , \tilde{p})\) is insensitive to mode loss of \(\tilde{p}\). Therefore, we optimize them simultaneously to collect the complementary benefits and mitigate the complementary shortcomings of each term. This motivation is illustrated in Fig. 2. As the true density function, p, is unavailable, we cannot evaluate \(H(q_\pi , p)\). Instead, we substitute a learned approximation, \(\tilde{p}\), that is simple and visually interpretable as a “cost map.”

In this work, we advocate using the symmetrized cross-entropy metrics for both training and evaluation of trajectory forecasting methods. This is made feasible by viewing the distribution \(q_\pi \) as the pushforward of a base distribution under the function \(g_\pi \) that rolls-out (simulates) a stochastic policy \(\pi \) (see Fig. 3b). This idea (also known as the reparameterization trick, [9, 22]) enables optimization of model-sample quality metrics such as \(H(q_\pi , \tilde{p})\) with SGD. Our representation also admits efficient accurate computation of \(H(p, q_\pi )\), even when the policy is a very complex function of context and past state, such as a CNN.

Fig. 2.
figure 2

Illustration of the complementarity of cross-entropies \(H(p, q_\pi )\) (top) and \(H(q_\pi , p)\) (bottom). Dashed lines show past vehicle path. Light blue lines delineate samples from the data (expert) distribution p. Samples from the model \(q_\pi \) are depicted as red lines. Green areas represent obstacles (areas with low p). The left figure shows cross-entropy values for a reference model. Other figures show poor models and their effects on each metric. \(\epsilon \) is a very small nonnegative number. (Color figure online)

We present the following novel contributions: (1) recognize and address the diversity-precision trade-off of generative forecasting models and formulating a symmetrized cross-entropy training objective to address it; (2) propose to train a policy to induce a roll-out distribution minimizing this objective; (3) use the pushforward parameterization to render inference and learning in this model efficient; (4) refine an existing deep imitation learning method (GAIL) based on our parameterization; (5) illuminate deficiencies of previously-used trajectory forecasting metrics; (6) outperform state-of-the-art forecasting and imitation learning methods, including our improvements to GAIL; (7) present CaliForecasting, a novel large scale dataset designed specifically for vehicle ego-motion forecasting.

2 Related Work

Trajectory Forecasting prior work spans two primary domains: trajectories of vehicles, and trajectories of people. The method of [26] predicts future trajectories of wide-receivers from surveillance video. In [5, 23, 28, 50] future pedestrian trajectories are predicted from surveillance video. Deterministic vehicle predictions are produced in [18], and deterministic pedestrian trajectories are produced in [3, 30, 34]. However, non-determinism is a key aspect of forecasting: the future is generally uncertain, with many plausible outcomes. While several approaches forecast distributions over trajectories [12, 25], global sample quality and likelihood have not been considered or measured, hindering performance evaluation.

Activity Forecasting is distinct from trajectory forecasting, as it predicts categorical activities. In [17, 24, 35, 36], future activities are predicted via classification-based approaches. In [33], a first-person camera wearer’s future goals are forecasted with Inverse Reinforcement Learning (IRL). IRL has been applied to predict and control robot, taxi, and pedestrian behavior [23, 31, 52].

Imitation Learning can be used to frame our problem: learn a model to mimic an agent’s behavior from a set of demonstrations [2]. One subtle difference is that in forecasting, we are not required to actually execute our plans in the real world. IRL is a form of imitation learning in which a reward function is learned to model demonstrated behavior. In the IRL method of [49], a cost map representation is used to plan vehicle trajectories. However, no time-profile is represented in the predictions, preventing use of time-profiled metrics and modeling. GAIL [16, 27] is also a form of IRL, yet its adversarial framework and policy optimization are difficult to tune and lead to slow convergence. By adding the assumption of model dynamics, we derive a new differentiable GAIL training approach, supplanting the noisy, inefficient policy gradient search procedure. We show this easier-to-train approach achieves better performance in our domain.

Image Forecasting methods generate full image or video representations of predictions, endowing their samples with interpretability. In [43,44,45], unsupervised model are learned to generate sequences and representations of future images. In [46], surveillance image predictions of vehicles are formed by smoothing a patch across the image. [42, 47] also predict future video frames with an intermediate pose prediction. In [10], predictions inform a robot’s behavior, and in [40], policy representations for imitation and reinforcement learning are guided by a future observation forecasting objective. In [7], image boundaries are predicted. One drawback to image-based forecasting methods is difficulty in measurement, a drawback shared by many popular generative models.

Generative models have surged in popularity [9, 13, 14, 16, 25, 44, 51]. However, one major difficulty is performance evaluation. Most popular models are quantified through heuristics that attempt to measures the “quality” of model samples [25]. In image generation, the Inception score is a popular heuristic [38]. These fail to measure the learned distribution’s likelihood, the gold standard of evaluating probabilistic models. Notable exceptions include [9, 20], which also leverage invertible pushforward models to perform exact likelihood inference.

Fig. 3.
figure 3

(a) Consider making trajectories inside the yellow region on the road likelier by increasing \(\log q_\pi (x)\) for the demonstration \(x \sim p\) inside the region. This is achieved by making an infinitesimal region around \(g_\pi ^{-1}(x)\) more likely under \(q_0\) by moving the region (yellow parallelogram, size proportional to \(\vert \mathrm {det} J_{g_\pi } \vert ^{-1}\)) towards a mode of \(q_0\) (here, the center of a Gaussian), and making the region bigger. Increasing \(\log \tilde{p}(x)\) for some sample \(x \sim q_\pi \) is equivalent to sampling a (red) point z from \(q_0\) and adjusting \(\pi \) so as to increase \(\log \tilde{p}(q_0(z))\). (b) Pushing forward a base distribution to a trajectory distribution. (Color figure online)

3 Approach

We approach the forecasting problem from an imitation learning perspective, learning a policy (state-to-action mapping) \(\pi \) that mimics the actions of an expert in varying contexts. We are given a set of training episodes (a short car path trajectory) \(\{ (x,\phi )_n\}_{n=1}^{N}\). Each episode \((x,\phi )_n\) has \( x \in \mathbb R^{T \times 2}\) as a sequence of T two-dimensional future vehicle locations and \(\phi \) as an associated set of side information. In our implementation, \(\phi \) contains the past path of the car and a feature grid derived from LIDAR and semantic segmentation class scores. The grid is centered on the vehicle’s position at \(t=0\) and is aligned with its heading.

Repeatedly applying the policy \(\pi \) from a start state with the context \(\phi \) results in a distribution \(q_\pi (x|\phi )\) over trajectories x, since our policy is stochastic. Similarly, the training set is drawn from a data distribution \(p(x|\phi )\). We therefore train \(\pi \) so as to minimize a divergence between \(q_\pi \) and p. This divergence consists of a weighted combination of the cross-entropies \(H(p, q_\pi )\) and \(H(q_\pi , \tilde{p})\). We precisely describe forms of \(\tilde{p}\) in Sect. 3.1, for now, conceptualize it as a distribution that assigns low likelihood to trajectories passing through obstacles. In the following, \(\varPhi \) denotes the distribution of ground-truth features:

$$\begin{aligned} \min _\pi \mathbb {E}_{\phi \sim \varPhi } \left[ - \mathbb {E}_{x \sim p(\cdot | \phi )} \log q_\pi (x | \phi ) - \beta \mathbb {E}_{x \sim q_\pi (\cdot | \phi )} \log \tilde{p}(x | \phi ) \right] . \end{aligned}$$
(2)

The motivation for this objective is illustrated in Fig. 2. The two factors are complementary. \(H(p, q_\pi )\) is intuitively similar to recall in binary classification, in that it is very sensitive to the model’s ability to produce all of the examples in the dataset, but is relatively insensitive to whether the model produces examples that are unlikely under the data. \(H(q_\pi , \tilde{p})\) is intuitively similar to precision in that it is very sensitive to whether the model produces samples likely under \(\tilde{p}\), but is insensitive to \(q_\pi \)’s likelihood to produce all samples in the dataset.

3.1 Pushforward Distribution Modeling

Optimizing Eq. (2) presents at least two challenges: we must be able to evaluate \(q_\pi (x|\phi )\) at arbitrary x in order to compute \(H(p,q_\pi )\), and we must be able to differentiate the expression \(\mathbb E_{x \sim q_\pi (\cdot | \phi )} \log \tilde{p}(x|\phi )\). We address these issues by constructing a learnable bijection, \(g_\pi \) between samples from \(q_\pi \) and samples from a simple noise distribution \(q_0\), as illustrated in Fig. 3b; in our construction, the bijection is interpreted as a simulator mapping noise to simulated outcomes. This assumption allows us to evaluate the required expressions and derivatives via the change-of-variables formula and the reparameterization trick.

Specifically, let \(g_\pi (z; \phi ) : \mathbb R^{T \times 2} \rightarrow \mathbb R^{T \times 2}\) be a simulator mapping noise sequences \(z \sim q_0\) and scene context \(\phi \) to forecasted outcomes x. Then the distribution of forecasted outcomes \(q_\pi (x|\phi )\) is fully determined by \(q_0\) and \(g_\pi \): this distribution, \(q_\pi \), is known as the pushforward of \(q_0\) under \(g_\pi \) in measure theory. If \(g_\pi \) is differentiable and invertible (\(z=g^{-1}_\pi (x; \phi )\)), then \(q_\pi \) is obtained by the change-of-variables formula for multivariate integration:

$$\begin{aligned} q_\pi (x | \phi ) = q_0\big (g_\pi ^{-1}(x; \phi )\big )\big |\mathrm {det}~ J_{g_\pi }(g_\pi ^{-1}(x;\phi )) \big |^{-1}, \end{aligned}$$
(3)

where \(J_{g_\pi }(g_\pi ^{-1}(x;\phi ))\) is the Jacobian of \(g_\pi \) evaluated at \(g_\pi ^{-1}(x; \phi )\). This resolves both of the aforementioned issues: we can evaluate \(q_\pi \) and we can rewrite \(\mathbb E_{x \sim q_\pi } \log \tilde{p}(x)\) as \(\mathbb E_{z \sim q_0} \log \tilde{p}(g_\pi (z; \phi ))\), since \(g_\pi (z; \phi ) \sim q_\pi \). The latter allows us to move derivatives w.r.t. \(\pi \) inside the expectation, as \(q_0\) does not depend on \(\pi \). Figure 3a illustrates how this aids learning. Equation (2) can then be rewritten as:

$$\begin{aligned} \min _\pi - \mathop {{}\mathbb {E}}_{\phi \sim \varPhi }&\mathop {{}\mathbb {E}}_{x \sim p(\cdot | \phi )} \log \frac{q_0(g_\pi ^{-1}(x; \phi ))}{\big |\mathrm {det}~ J_{g_\pi }(g_\pi ^{-1}(x;\phi )) \big |} - \beta \mathop {{}\mathbb {E}}_{z \sim q_0} \log \tilde{p}(g_\pi (z; \phi ) | \phi ). \end{aligned}$$
(4)

We note ours is not the only way to represent \(q_\pi \) and optimize Eq. (2). As long as \(q_\pi \) is analytically differentiable in the parameters, we may also apply REINFORCE [48] to obtain the required parameter derivatives. However, empirical evidence and some theoretical analysis suggests that the reparameterization-based gradient estimator typically yields lower-variance gradient estimates than REINFORCE [11]. This is consistent with the results we obtained in Sect. 4.

An Invertible, Differentiable Simulator. In order to exploit the pushforward density formula (3), we must ensure \(g_\pi \) is invertible and differentiable. Inspired by [9, 21], we define \(g_\pi \) as an autoregressive map, representing the evolution of a controlled, discrete-time stochastic dynamical system with additive noise. Denoting \([x_1,\dots ,x_{t-1}]\) as \(x_{1:t-1}\), and \([x_{1:t-1}, \phi ]\) as \(\psi _t\), the system is:

$$\begin{aligned} x_t \triangleq \mu ^\pi _t(\psi _t; \theta ) + \sigma ^\pi _t(\psi _t; \theta ) z_t, \end{aligned}$$
(5)

where \(\mu ^\pi _t(\psi _t; \theta ) \in \mathbb R^2\) and \(\sigma ^\pi _t(\psi _t; \theta ) \in \mathbb R^{2\times 2}\) represent the stochastic one-step policy, and \(\theta \) its parameters. The context, \(\phi \), is given in the form of a past trajectory \(x_\mathrm {past}=x_{-H_\mathrm {past}+1:0} \in \mathbb R^{2H_\mathrm {past}}\), and overhead feature map \(M \in \mathbb R^{H_\mathrm {map} \times W_\mathrm {map} \times C}\): \(\phi = (x_\mathrm {past}, M)\). Note that the case \(\sigma ^\pi = 0\) would correspond to simply evolving the state by repeatedly applying \(\mu ^\pi \)—though this case is not allowed, as then \(g_\pi \) would not be invertible. However, as long as \(\sigma ^\pi _t\) is invertible for all x, then \(g_\pi \) is invertible, and it is differentiable in x as long as \(\mu ^\pi \) and \(\sigma ^\pi \) are differentiable in x. Since \(x_{\tau _1}\) is not a function of \(x_{\tau _2}\) for \(\tau _1 < \tau _2\), the determinant of the Jacobian of this map is easily computed, because it is triangular (see supplement). Thus, we can easily compute terms in Eq. 4 via the following:

$$\begin{aligned}{}[g_\pi ^{-1}(x)]_t = z_t = \sigma ^\pi _t(\psi _t; \theta )^{-1}(x_t - \mu ^\pi _t(\psi _t; \theta )), \end{aligned}$$
(6)
$$\begin{aligned} \log \big |\mathrm {det}~ J_{g_\pi }(g_\pi ^{-1}(x;\phi )) \big |= \sum _t \log \big \vert \mathrm {det} \big (\sigma ^\pi _t(\psi _t; \theta )\big )\big \vert . \end{aligned}$$
(7)

We note that \(q_\pi \) can also be computed via the chain rule of probability. For instance, if \(z_t\sim \) is standard normal, then the marginal distributions are

$$\begin{aligned} q_\pi (x_t| \psi _t) = \mathcal N(x_t; \mu = \mu ^\pi _t(\psi _t; \theta ), \varSigma = \sigma ^\pi _t(\psi _t; \theta )\sigma ^\pi _t(\psi _t; \theta )^\top ). \end{aligned}$$
(8)

However, since it is still necessary to compute \(g_\pi \) in order to optimize \(H(q_\pi , \tilde{p})\), we find it simplifies the implementation to compute \(q_\pi \) in terms of \(g_\pi \).

Prior Approximation of the Data Distribution. Evaluating \(H(q_\pi , p)\) directly is unfortunately impossible, since we cannot evaluate the data distribution p’s PDF. We therefore propose approximating it with a very simple density estimator \(\tilde{p} \approx p\) trained independently and then fixed while training \(q_\pi \). Simplicity reduces sample-induced variance in fitting \(\tilde{p}\)—crucial, because if \(\tilde{p}\) severely underestimates p in some region R due to sampling error, then \(H(q_\pi , \tilde{p})\) will erroneously assign a large penalty to samples from \(q_\pi \) landing in R.

We consider two options for \(\tilde{p}\)—first, simply using a kernel density estimator with a relatively large bandwidth. Since we have only one training sample per episode, this reduces to a single-kernel model. Choosing an isotropic Gaussian kernel, \(H(q_\pi , \tilde{p})\) becomes \(\mathbb E_{\hat{x} \sim q_\pi (\cdot | \phi )} \Vert x - \hat{x} \Vert ^2/\sigma ^2\), where \((x, \phi )\) constitutes an episode from the data. The net objective (2) in this case corresponds to \(H(p, q_\pi )\) plus a mean squared distance penalty between model samples and data samples.

The second possibility is making an i.i.d. approximation; i.e., parameterizing \(\tilde{p}\) as \(\tilde{p}(x \mid \phi ) = \prod _t \tilde{p}_c(x_t \mid \phi )\). We proceed by discretizing \(x_t\) in a large finite region centered at the vehicle’s start location; \(\tilde{p}_c\) then corresponds to a categorical distribution with L classes representing the L possible locations. Training the i.i.d. model can then be reduced to training \(\tilde{p}_c\) via logistic regression:

$$\begin{aligned} \min _{\tilde{p}} - \mathbb E_{x \sim p} \log \tilde{p}(x) = \max _{\theta } \mathbb E_{x \sim p} \sum _t -C_\theta (x_t, \phi ) - \log \sum _{y=1}^L \exp -C_\theta (y, \phi ), \end{aligned}$$
(9)

where \(C_\theta = -\log \tilde{p}_c\) can be thought of as a spatial cost function with parameters \(\theta \). We found it useful to decompose \(C_\theta (y)\) as a sum \(C^0_\theta (y) + C^1_\theta (y, \phi )\), where \(C^0_\theta \in \mathbb R^L\) is thought of as a non-contextual location prior, and \(C^1_\theta (y, \phi )\) has the form of a convolutional neural network acting on the spatial feature grid in \(\phi \) and producing a grid of scores \(\in \mathbb R^L\). Figure 4 shows example learned \(C^1_\theta (\cdot , \phi )\).

3.2 Policy Modeling

We turn to designing learnable functions \(\mu ^\pi _t\) and \(\sigma ^\pi _t\). Across our three models, we use the following expansion: \(\mu ^\pi _t(\psi _t) = 2x_t - x_{t-1} + \hat{\mu }^\pi _t(\psi _t)\). The first terms correspond to a constant velocity step (\(x_{t} + (x_{t} - x_{t-1})\)), and let us interpret \(\hat{\mu }^\pi _t\) as a deterministic acceleration. Altogether, the update equation (Eq. 5) mimics Verlet integration [41], used to integrate Newton’s equations of motion.

“Linear”: The simplest model uses \(\hat{\mu }^\pi _t, S_t\) linear in \(\psi _t\):

$$\begin{aligned} \hat{\mu }^\pi _t(\psi _t)&= Ah_{t} + b_0, \quad S_t(\psi _t) = Bh_{t} + b_1, \end{aligned}$$
(10)

with \(A \in \mathbb R^{2 \times 2H}\), \(h_{t} = x_{t-H:t-1} \in \mathbb R^{2H}\), \(B \in \mathbb R^{4 \times 2H}\), \(b_{i} \in \mathbb R^{2H}\), and \(S_t(\psi _t) \in \mathbb R^{2 \times 2}\). To produce \(\sigma ^\pi _t \in \mathrm {PD}\), we use the matrix exponential [29]: \(\sigma ^\pi _t = \text {expm}(S_t + S_t^\top )\), which we found to optimize more efficiently than \(\sigma ^\pi _t = S_tS_t^\top \).

Fig. 4.
figure 4

The prior penalizes positions corresponding to obstacles (white: high cost, black: low cost). The demonstrated expert trajectory is shown in each scene.

“Field”: The Linear model ignores M: it has no environment perception. We designed a CNN model that takes in M and outputs \(O \in \mathbb R^{H_\mathrm {map} \times W_\mathrm {map} \times 6}\). The 6 channels in O are used to form the 6 components of \(\mu ^\pi _t\) and \(S_t\) in the following way. To ensure differentiability, the values in O are bilinearly interpolated at the current rollout position, \(x_t\) in the spatial dimensions (\(H_\mathrm {map}\) and \(W_\mathrm {map}\)) of O.

“RNN”: The Linear and Field models reason with different contextual inputs: Linear uses the past, and CNN uses the feature map M. We developed a joint model to reason with both. M is passed through a CNN similar to Field’s. The past is encoded with a GRU-RNN. Both featurizations inform a GRU-RNN that produces \(\mu ^\pi _t\), \(S_t\). See Fig. 5 and the supplementary material for details.

3.3 GAIL and Differentiable GAIL

As a deep generative approach to imitation learning, our method is comparable to Generative Adversarial Imitation Learning (GAIL [16]). GAIL is model-free: it is agnostic to model dynamics. However, this flexibility requires an expensive model-free policy gradient method, whereas the approach we have proposed is fully differentiable. The model-free approach is significantly disadvantaged in sample complexity [19, 32] in theory and practice. By assuming the dynamics are known and differentiable, as described in Sect. 3.1, we can also derive a version of GAIL that does not require model-free RL, since we can apply the reparameterization trick to differentiate the generator objective with respect to the policy parameters. A similar idea was explored for general imitation learning in [6]. We refer to this method as R2P2 GAIL. As our experiments show, R2P2 GAIL significantly outperforms standard GAIL, and our main model (R2P2) significantly outperforms and is easier to train than both GAIL and R2P2 GAIL.

Fig. 5.
figure 5

RNN and CNN Policy models. The Field model produces a map of values to use for producing \(\mu ^\pi ,\sigma ^\pi \) through interpolation. The RNN model uses the same base as the Field model as well as information from the past trajectory to decode a featurized context representation and previous state to next \(\mu ^\pi ,\sigma ^\pi \).

4 Experiments

We implemented R2P2 and baselines with the primary aim of testing the following hypotheses. (1) The ability to exactly evaluate the model PDF should help R2P2 obtain better solutions than methods that do not use exact PDF inference (which includes GAIL). (2) The optimization of \(H(p, q_\theta )\) should be correlated with the model’s ability to cover the training data, in analogy to recall in binary classification. (3) Including \(H(q_\theta , \tilde{p})\) in our objective should improve sample quality relative to methods without this term, as it serves a purpose analogous to precision in binary classification. (4) R2P2 GAIL will outperform GAIL through its more efficient optimization scheme.

4.1 The CaliForecasting Dataset

Current public datasets such as Kitti are suboptimal for the purpose of validating these hypotheses. Kitti is relatively small and was not designed with forecasting in mind. It contains relatively few episodes of subjectively interesting, nonlinear behavior. For this reason, we collected a novel dataset specifically designed for the ego-motion forecasting task, which we make public. The data is similar to Kitti in sensor modalities, but the data was collected so as to maximize the number of intersections, turning, and other subjectively interesting episodes. The data was collected with a sensor platform consisting of a Ford Transit Connect van with two Point Grey Flea3 cameras mounted on the roof in a wide-baseline configuration, in addition to a roof-mounted Velodyne VLP16 LIDAR unit and an IMU. The initial version of the dataset consists of three continuous driving sequences, each about one hour long, collected in mostly suburban areas of northern California (USA). The data was post-processed to produce a collection of episodes in the previously described format. The overhead feature map was populated by pretraining a semantic segmentation network [39], evaluating it on the sequences, correlating them with the LIDAR point cloud, and binning the resulting semantic segmentation scores in addition to a height-above-ground plane feature. With a subsampling scheme of 2 Hz, CaliForecasting consists of over 10,000 training, 1,200 validation and 1,200 testing examples. The Kitti splits, in comparison, are about 3,100 training, 140 validation, and slightly less than 500 test examples with a subsampling scheme of 1 Hz.

Fig. 6.
figure 6

Possible objectives and their attributes. \(\min _\theta H(p,q_\theta )\) encourages data coverage, \(\min _\theta H(q_\theta ,\tilde{p})\) penalizes bad samples. Measuring mean squared error is misleading when the data is multimodal, and measuring mean squared error of the best sample fails to measure quality of samples far from the demonstrations.

4.2 Metrics and Baselines

Metrics Our primary metrics are the cross-entropy distribution metrics \(H(p, q_\theta )\) and \(H(q_\theta , \tilde{p})\). Note that \(H(p, q_\theta )\) is lower-bounded by the entropy of p, H(p), by Gibbs’ inequality. Subtracting this quantity (computing KL) would be ideal; unfortunately, since H(p) is unknown, we simply report \(H(p, q_\theta )\). We also note that cross-entropy is not coordinate-invariant: we use path coordinates in an ego-centric frame that is a rotation and translation away from UTM coordinates (in meters) and report cross-entropy values for path distributions in this frame.

A subtle related issue is that \(H(p, q_\theta )\) may be unbounded below since H(p) may be arbitrarily negative. This phenomenon arises when the support of p is restricted to a submanifold—for example, if for \(x \sim p\) and \(x_1 - x_2 = b\), the distribution achieves arbitrarily low values of \(H(p, q_\theta )\). We resolve this by slightly perturbing training and testing samples from p: i.e. instead of computing \(H(p, q_\theta )\), we compute \(-\mathbb E_{\eta \sim \mathcal N(0,\epsilon I)} \mathbb E_{x \sim p} \log q(x + \eta )\) for \(\epsilon =0.001\). This is lower-bounded by \(H(\mathcal N(0, \epsilon I))\), which resolves the issue.

We include two commonly used sample metrics [3, 8, 15, 25, 37], despite the shortcomings illustrated in Fig. 6. We measure the quality of the “best” sample from K samples from \(q_\theta \): \(\hat{X}\), relative to the demonstrated sample x via \(\mathbb E_{\hat{X}_k \sim q_\theta }\min _{\hat{x} \in \hat{X}_k}\Vert x - \hat{x} \Vert ^2\) (known as “minMSD”). This metric fails to measure the quality of all of the samples, and thus can be exploited by an approach that predicts samples that are mostly poor. Additionally, we measure the mean distance to the demonstration of all samples in \(\hat{X}\): \(\frac{1}{K}\sum _{k=1}^K\Vert x - \hat{x}_k \Vert ^2\) (known as “meanMSD”). This metric is misleading if the data is multimodal, as the metric rewards predicting the mean, as opposed to covering multiple outcomes. Due to the deficiencies of these common sample-based metrics for measuring the quality of multimodal predictions, we advocate supplementing sample-based metrics with the complementary cross-entropy metrics used in this work.

Baselines. We construct a simple a unimodal baseline: given the context, the distribution of trajectories is given as a sequence of Gaussian distributions. This is called the Gaussian Direct Cross-Entropy (DCE-G). As discussed in Sect. 3.3, we apply Generative Adversarial Imitation Learning (GAIL), along with our modified GAIL framework, R2P2 GAIL. We constructed several variants of GAIL: with and without the (improved) Wasserstein-GAN [4, 14] parameterization, with and without our novel R2P2 GAIL formulation, and using the standard MLP discriminator, versus a CNN-based discriminator with a similar architecture to the Field model (details in supplementary). Conditional Variational Autoencoders (CVAEs) are a popular approach for modeling generative distributions conditioned on context. We follow the CVAE construction of [25] in our implementation. One key distinguishing factor is that CVAEs cannot perform exact inference by construction: given an arbitrary sample, a CVAE cannot produce a PDF value. Quantification of CVAE performance is thus required to be approximation-based, or sample-based. Our approaches are implemented in Tensorflow [1]. Architectural details are given in the supplement.

Fig. 7.
figure 7

Cross Trimodal Evaluation. Top: Qualitative results. Bottom: Quantitative results. A \(^*\) indicates R2P2, and a \(^\dagger \) indicates using a WGAN Discriminator.

4.3 Cross Trimodal Experiments

Our first set of experiments is designed to test the multimodal modeling capability of each approach in an easy domain. The contextual information is fixed – a single four-way intersection, along with three demonstrated outcomes: turning left, turning right, and going straight. Figure 7 shows qualitative and quantitative results. We see that several approaches fail to model multimodality well in this scenario. RNN. The models that can perform exact inference (all except CVAE) cover the modes with different success, as measured by Test \(-H(p, q_\theta )\). We observe the models minimizing \(H(p, q_\theta )\) cover the data well, supporting hypothesis 2 (coverage hypothesis), and outperform both GAIL approaches, supporting hypothesis 1 (exact inference hypothesis). We observer R2P2 GAIL outperforms GAIL in this scenario, supporting hypothesis 4 (optimization hypothesis). We also note the failure of DCE-G: its unimodal model is too restrictive for covering the diverse demonstrated behavior.

Table 1. CaliForecasting and Kitti evaluation, \(K=12\)
Fig. 8.
figure 8

CaliForecasting Results. Comparison of R2P2 RNN (middle-left), CVAE (middle-right), and R2P2 GAIL (right). Trajectory samples are overlaid on overhead LIDAR map, colored by height. Bottom two rows: Comparison of \(\beta =0\) (top) and \(\beta =0.1\) (bottom), overlaid on \(\tilde{p}\) cost map. The cost map improves sample quality. (Color figure online)

Fig. 9.
figure 9

Comparison of using \(\beta \) on CaliForecasting test data. Top row: With \(\beta =0\), some trajectories are forecasted into obvious obstacles. Bottom row: With \(\beta \ne 0\), many forecasted trajectories do not hit obstacles.

4.4 CaliForecasting Experiments and KITTI Experiments

We conducted larger-scale experiments designed to test our hypotheses. First, we trained \(\tilde{p}\) on each dataset by the procedure described in Sect. 3.1. As discussed, our goal was to develop a simple model to minimize overfitting: we used a 3-layer Fully Convolutional NN. In the resulting spatial “cost” maps, we observe the model’s ability to perceive obstacles in its assignment of low cost to on-road regions, and high-cost to clearly visible obstacles (e.g Fig. 4). We performed hyperparameter search for each method, and report the mean and its standard error of test set metrics corresponding to each method’s best validation loss in Table 1. These results provide us with a rich set of observations. Of the three baselines, none catastrophically failed, with CVAE most often generating the cleanest samples. Across datasets and metrics, our approach achieves performance superior to the three baselines and our improved GAIL approach. By minimizing \(H(p, q_\theta )\), our approach results in higher Test \(-H(p, q_\theta )\) than all GAIL approaches, supporting the coverage and optimization hypotheses. We find that by incorporating our prior with nonzero \(\beta \), hypothesis 3 is supported: our model architectures can improve the quality of its samples as measured by the Test \(-H(q_\theta , \tilde{p})\). We observe that our GAIL optimization approach yields higher Test \(-H(p, q_\theta )\), supporting hypothesis 4. We plot means and its standard error of the minMSD metrics as a function of K in Fig. 10 for all 3 datasets.

Fig. 10.
figure 10

Test \(\min _k\) MSD vs. K on Cross, CaliForecasting, and Kitti.

We also find that qualitatively, our approach usually generates the best samples with diversity along multiple paths and precision in its tendency to avoid obstacles. Figure 8 illustrates results on our dataset for our method, CVAE, and our improved GAIL approach. Figure 9 illustrates qualitative examples for how incorporating nonzero \(\beta \) can improve sample quality.

5 Conclusions

This work has raised the previously under-appreciated issue of balancing diversity and precision in probabilistic trajectory forecasting. We have proposed a training a policy to induce a simulated-outcome distribution that minimizes a symmetrized cross-entropy objective. The key technical step that made this possible was a parameterizing the model distribution as the pushforward of a simple base distribution under the simulation operator. The relationship of this method to deep generative models was noted, and we showed that part of our full model enhances an existing deep imitation learning method. Empirically, we demonstrated that the pushforward parameterization enables reliable optimization of the objective, and that the optimized model has the desired characteristics of both covering the training data and generating high-quality samples. Finally, we introduced a novel large-scale, real-world dataset designed specifically for the vehicle ego-motion forecasting problem.