r2p2: A ReparameteRized Pushforward Policy for Diverse, Precise Generative Path Forecasting

Rhinehart, Nicholas; Kitani, Kris M.; Vernaza, Paul

doi:10.1007/978-3-030-01261-8_47

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11217))

Included in the following conference series:

European Conference on Computer Vision

3272 Accesses
86 Citations

Abstract

We propose a method to forecast a vehicle’s ego-motion as a distribution over spatiotemporal paths, conditioned on features (e.g., from LIDAR and images) embedded in an overhead map. The method learns a policy inducing a distribution over simulated trajectories that is both “diverse” (produces most of the likely paths) and “precise” (mostly produces likely paths). This balance is achieved through minimization of a symmetrized cross-entropy between the distribution and demonstration data. By viewing the simulated-outcome distribution as the pushforward of a simple distribution under a simulation operator, we obtain expressions for the cross-entropy metrics that can be efficiently evaluated and differentiated, enabling stochastic-gradient optimization. We propose concrete policy architectures for this model, discuss our evaluation metrics relative to previously-used degenerate metrics, and demonstrate the superiority of our method relative to state-of-the-art methods in both the Kitti dataset and a similar but novel and larger real-world dataset explicitly designed for the vehicle forecasting domain.

You have full access to this open access chapter, Download conference paper PDF

TridentNet: A Conditional Generative Model for Dynamic Trajectory Generation

StretchBEV: Stretching Future Instance Prediction Spatially and Temporally

M $$^2$$ Sim: A Long-Term Interactive Driving Simulator

Keywords

1 Introduction

We consider forecasting a vehicle’s trajectory (i.e., predicting future paths). Forecasts can be used to foresee and avoid dangerous scenarios, plan safe paths, and model driver behavior. Context from the environment informs prediction, e.g. a map populated with features from imagery and LIDAR. We would like to learn a context-conditioned distribution over spatiotemporal trajectories to represent the many possible outcomes of the vehicle’s future. With this distribution, we can perform inference tasks such as sampling a set of plausible paths, or assigning a likelihood to a particular observed path. Sampling suggests routes and visualizes the model; assigning likelihood helps measure the model’s quality.

Our key motivation is to learn a trajectory forecasting model that is simultaneously “diverse”—covering all the modes of the data distribution—and “precise” in the sense that it rarely generates bad trajectories, such as trajectories that intersect obstacles. Covering the modes ensures the model can generate samples similar to human behavior. High “precision” ensures the model rarely generates samples very different from human behavior, which is important when samples are used for a downstream task. Figure 1 contrasts a model trained to cover modes, versus a model trained to cover modes and generate good samples, which generates less samples that hit perceived obstacles. To these ends, we define our model $q_\pi $ as the trajectory distribution induced by rolling out (simulating) a stochastic one-step policy $\pi $ for T steps to produce a trajectory sample x, and we propose choosing $\pi $ to minimize the following symmetrized cross-entropy objective, where $\phi $ denotes the scene context:

(1)

The $H(p, q_\pi )$ term encourages the model $q_\pi $ to cover all the modes of the distribution of true driver behavior p, by heavily penalizing q for assigning a low density to any observed example from p. However, $H(p, q_\pi )$ is insensitive to samples from q, so optimizing it alone can yield a model that generates some “low-quality” samples. The $H(q_\pi , \tilde{p})$ term penalizes $q_\pi $ for generating “low-quality” samples (where an approximate data density $\tilde{p}$ is low). However, $H(q_\pi , \tilde{p})$ is insensitive to mode loss of $\tilde{p}$. Therefore, we optimize them simultaneously to collect the complementary benefits and mitigate the complementary shortcomings of each term. This motivation is illustrated in Fig. 2. As the true density function, p, is unavailable, we cannot evaluate $H(q_\pi , p)$. Instead, we substitute a learned approximation, $\tilde{p}$, that is simple and visually interpretable as a “cost map.”

In this work, we advocate using the symmetrized cross-entropy metrics for both training and evaluation of trajectory forecasting methods. This is made feasible by viewing the distribution $q_\pi $ as the pushforward of a base distribution under the function $g_\pi $ that rolls-out (simulates) a stochastic policy $\pi $ (see Fig. 3b). This idea (also known as the reparameterization trick, [9, 22]) enables optimization of model-sample quality metrics such as $H(q_\pi , \tilde{p})$ with SGD. Our representation also admits efficient accurate computation of $H(p, q_\pi )$, even when the policy is a very complex function of context and past state, such as a CNN.

We present the following novel contributions: (1) recognize and address the diversity-precision trade-off of generative forecasting models and formulating a symmetrized cross-entropy training objective to address it; (2) propose to train a policy to induce a roll-out distribution minimizing this objective; (3) use the pushforward parameterization to render inference and learning in this model efficient; (4) refine an existing deep imitation learning method (GAIL) based on our parameterization; (5) illuminate deficiencies of previously-used trajectory forecasting metrics; (6) outperform state-of-the-art forecasting and imitation learning methods, including our improvements to GAIL; (7) present CaliForecasting, a novel large scale dataset designed specifically for vehicle ego-motion forecasting.

2 Related Work

Trajectory Forecasting prior work spans two primary domains: trajectories of vehicles, and trajectories of people. The method of [26] predicts future trajectories of wide-receivers from surveillance video. In [5, 23, 28, 50] future pedestrian trajectories are predicted from surveillance video. Deterministic vehicle predictions are produced in [18], and deterministic pedestrian trajectories are produced in [3, 30, 34]. However, non-determinism is a key aspect of forecasting: the future is generally uncertain, with many plausible outcomes. While several approaches forecast distributions over trajectories [12, 25], global sample quality and likelihood have not been considered or measured, hindering performance evaluation.

Activity Forecasting is distinct from trajectory forecasting, as it predicts categorical activities. In [17, 24, 35, 36], future activities are predicted via classification-based approaches. In [33], a first-person camera wearer’s future goals are forecasted with Inverse Reinforcement Learning (IRL). IRL has been applied to predict and control robot, taxi, and pedestrian behavior [23, 31, 52].

Imitation Learning can be used to frame our problem: learn a model to mimic an agent’s behavior from a set of demonstrations [2]. One subtle difference is that in forecasting, we are not required to actually execute our plans in the real world. IRL is a form of imitation learning in which a reward function is learned to model demonstrated behavior. In the IRL method of [49], a cost map representation is used to plan vehicle trajectories. However, no time-profile is represented in the predictions, preventing use of time-profiled metrics and modeling. GAIL [16, 27] is also a form of IRL, yet its adversarial framework and policy optimization are difficult to tune and lead to slow convergence. By adding the assumption of model dynamics, we derive a new differentiable GAIL training approach, supplanting the noisy, inefficient policy gradient search procedure. We show this easier-to-train approach achieves better performance in our domain.

Image Forecasting methods generate full image or video representations of predictions, endowing their samples with interpretability. In [43,44,45], unsupervised model are learned to generate sequences and representations of future images. In [46], surveillance image predictions of vehicles are formed by smoothing a patch across the image. [42, 47] also predict future video frames with an intermediate pose prediction. In [10], predictions inform a robot’s behavior, and in [40], policy representations for imitation and reinforcement learning are guided by a future observation forecasting objective. In [7], image boundaries are predicted. One drawback to image-based forecasting methods is difficulty in measurement, a drawback shared by many popular generative models.

Generative models have surged in popularity [9, 13, 14, 16, 25, 44, 51]. However, one major difficulty is performance evaluation. Most popular models are quantified through heuristics that attempt to measures the “quality” of model samples [25]. In image generation, the Inception score is a popular heuristic [38]. These fail to measure the learned distribution’s likelihood, the gold standard of evaluating probabilistic models. Notable exceptions include [9, 20], which also leverage invertible pushforward models to perform exact likelihood inference.

3 Approach

We approach the forecasting problem from an imitation learning perspective, learning a policy (state-to-action mapping) $\pi $ that mimics the actions of an expert in varying contexts. We are given a set of training episodes (a short car path trajectory) $\{ (x,\phi )_n\}_{n=1}^{N}$. Each episode $(x,\phi )_n$ has $ x \in \mathbb R^{T \times 2}$ as a sequence of T two-dimensional future vehicle locations and $\phi $ as an associated set of side information. In our implementation, $\phi $ contains the past path of the car and a feature grid derived from LIDAR and semantic segmentation class scores. The grid is centered on the vehicle’s position at $t=0$ and is aligned with its heading.

Repeatedly applying the policy $\pi $ from a start state with the context $\phi $ results in a distribution $q_\pi (x|\phi )$ over trajectories x, since our policy is stochastic. Similarly, the training set is drawn from a data distribution $p(x|\phi )$. We therefore train $\pi $ so as to minimize a divergence between $q_\pi $ and p. This divergence consists of a weighted combination of the cross-entropies $H(p, q_\pi )$ and $H(q_\pi , \tilde{p})$. We precisely describe forms of $\tilde{p}$ in Sect. 3.1, for now, conceptualize it as a distribution that assigns low likelihood to trajectories passing through obstacles. In the following, $\varPhi $ denotes the distribution of ground-truth features:

$$\begin{aligned} \min _\pi \mathbb {E}_{\phi \sim \varPhi } \left[ - \mathbb {E}_{x \sim p(\cdot | \phi )} \log q_\pi (x | \phi ) - \beta \mathbb {E}_{x \sim q_\pi (\cdot | \phi )} \log \tilde{p}(x | \phi ) \right] . \end{aligned}$$

(2)

The motivation for this objective is illustrated in Fig. 2. The two factors are complementary. $H(p, q_\pi )$ is intuitively similar to recall in binary classification, in that it is very sensitive to the model’s ability to produce all of the examples in the dataset, but is relatively insensitive to whether the model produces examples that are unlikely under the data. $H(q_\pi , \tilde{p})$ is intuitively similar to precision in that it is very sensitive to whether the model produces samples likely under $\tilde{p}$, but is insensitive to $q_\pi $’s likelihood to produce all samples in the dataset.

3.1 Pushforward Distribution Modeling

Optimizing Eq. (2) presents at least two challenges: we must be able to evaluate $q_\pi (x|\phi )$ at arbitrary x in order to compute $H(p,q_\pi )$, and we must be able to differentiate the expression $\mathbb E_{x \sim q_\pi (\cdot | \phi )} \log \tilde{p}(x|\phi )$. We address these issues by constructing a learnable bijection, $g_\pi $ between samples from $q_\pi $ and samples from a simple noise distribution $q_0$, as illustrated in Fig. 3b; in our construction, the bijection is interpreted as a simulator mapping noise to simulated outcomes. This assumption allows us to evaluate the required expressions and derivatives via the change-of-variables formula and the reparameterization trick.

Specifically, let $g_\pi (z; \phi ) : \mathbb R^{T \times 2} \rightarrow \mathbb R^{T \times 2}$ be a simulator mapping noise sequences $z \sim q_0$ and scene context $\phi $ to forecasted outcomes x. Then the distribution of forecasted outcomes $q_\pi (x|\phi )$ is fully determined by $q_0$ and $g_\pi $: this distribution, $q_\pi $, is known as the pushforward of $q_0$ under $g_\pi $ in measure theory. If $g_\pi $ is differentiable and invertible ($z=g^{-1}_\pi (x; \phi )$), then $q_\pi $ is obtained by the change-of-variables formula for multivariate integration:

$$\begin{aligned} q_\pi (x | \phi ) = q_0\big (g_\pi ^{-1}(x; \phi )\big )\big |\mathrm {det}~ J_{g_\pi }(g_\pi ^{-1}(x;\phi )) \big |^{-1}, \end{aligned}$$

(3)

where $J_{g_\pi }(g_\pi ^{-1}(x;\phi ))$ is the Jacobian of $g_\pi $ evaluated at $g_\pi ^{-1}(x; \phi )$. This resolves both of the aforementioned issues: we can evaluate $q_\pi $ and we can rewrite $\mathbb E_{x \sim q_\pi } \log \tilde{p}(x)$ as $\mathbb E_{z \sim q_0} \log \tilde{p}(g_\pi (z; \phi ))$, since $g_\pi (z; \phi ) \sim q_\pi $. The latter allows us to move derivatives w.r.t. $\pi $ inside the expectation, as $q_0$ does not depend on $\pi $. Figure 3a illustrates how this aids learning. Equation (2) can then be rewritten as:

$$\begin{aligned} \min _\pi - \mathop {{}\mathbb {E}}_{\phi \sim \varPhi }&\mathop {{}\mathbb {E}}_{x \sim p(\cdot | \phi )} \log \frac{q_0(g_\pi ^{-1}(x; \phi ))}{\big |\mathrm {det}~ J_{g_\pi }(g_\pi ^{-1}(x;\phi )) \big |} - \beta \mathop {{}\mathbb {E}}_{z \sim q_0} \log \tilde{p}(g_\pi (z; \phi ) | \phi ). \end{aligned}$$

(4)

We note ours is not the only way to represent $q_\pi $ and optimize Eq. (2). As long as $q_\pi $ is analytically differentiable in the parameters, we may also apply REINFORCE [48] to obtain the required parameter derivatives. However, empirical evidence and some theoretical analysis suggests that the reparameterization-based gradient estimator typically yields lower-variance gradient estimates than REINFORCE [11]. This is consistent with the results we obtained in Sect. 4.

An Invertible, Differentiable Simulator. In order to exploit the pushforward density formula (3), we must ensure $g_\pi $ is invertible and differentiable. Inspired by [9, 21], we define $g_\pi $ as an autoregressive map, representing the evolution of a controlled, discrete-time stochastic dynamical system with additive noise. Denoting $[x_1,\dots ,x_{t-1}]$ as $x_{1:t-1}$, and $[x_{1:t-1}, \phi ]$ as $\psi _t$, the system is:

$$\begin{aligned} x_t \triangleq \mu ^\pi _t(\psi _t; \theta ) + \sigma ^\pi _t(\psi _t; \theta ) z_t, \end{aligned}$$

(5)

where $\mu ^\pi _t(\psi _t; \theta ) \in \mathbb R^2$ and $\sigma ^\pi _t(\psi _t; \theta ) \in \mathbb R^{2\times 2}$ represent the stochastic one-step policy, and $\theta $ its parameters. The context, $\phi $, is given in the form of a past trajectory $x_\mathrm {past}=x_{-H_\mathrm {past}+1:0} \in \mathbb R^{2H_\mathrm {past}}$, and overhead feature map $M \in \mathbb R^{H_\mathrm {map} \times W_\mathrm {map} \times C}$: $\phi = (x_\mathrm {past}, M)$. Note that the case $\sigma ^\pi = 0$ would correspond to simply evolving the state by repeatedly applying $\mu ^\pi $—though this case is not allowed, as then $g_\pi $ would not be invertible. However, as long as $\sigma ^\pi _t$ is invertible for all x, then $g_\pi $ is invertible, and it is differentiable in x as long as $\mu ^\pi $ and $\sigma ^\pi $ are differentiable in x. Since $x_{\tau _1}$ is not a function of $x_{\tau _2}$ for $\tau _1 < \tau _2$, the determinant of the Jacobian of this map is easily computed, because it is triangular (see supplement). Thus, we can easily compute terms in Eq. 4 via the following:

$$\begin{aligned}{}[g_\pi ^{-1}(x)]_t = z_t = \sigma ^\pi _t(\psi _t; \theta )^{-1}(x_t - \mu ^\pi _t(\psi _t; \theta )), \end{aligned}$$

(6)

$$\begin{aligned} \log \big |\mathrm {det}~ J_{g_\pi }(g_\pi ^{-1}(x;\phi )) \big |= \sum _t \log \big \vert \mathrm {det} \big (\sigma ^\pi _t(\psi _t; \theta )\big )\big \vert . \end{aligned}$$

(7)

We note that $q_\pi $ can also be computed via the chain rule of probability. For instance, if $z_t\sim $ is standard normal, then the marginal distributions are

$$\begin{aligned} q_\pi (x_t| \psi _t) = \mathcal N(x_t; \mu = \mu ^\pi _t(\psi _t; \theta ), \varSigma = \sigma ^\pi _t(\psi _t; \theta )\sigma ^\pi _t(\psi _t; \theta )^\top ). \end{aligned}$$

(8)

However, since it is still necessary to compute $g_\pi $ in order to optimize $H(q_\pi , \tilde{p})$, we find it simplifies the implementation to compute $q_\pi $ in terms of $g_\pi $.

Prior Approximation of the Data Distribution. Evaluating $H(q_\pi , p)$ directly is unfortunately impossible, since we cannot evaluate the data distribution p’s PDF. We therefore propose approximating it with a very simple density estimator $\tilde{p} \approx p$ trained independently and then fixed while training $q_\pi $. Simplicity reduces sample-induced variance in fitting $\tilde{p}$—crucial, because if $\tilde{p}$ severely underestimates p in some region R due to sampling error, then $H(q_\pi , \tilde{p})$ will erroneously assign a large penalty to samples from $q_\pi $ landing in R.

We consider two options for $\tilde{p}$—first, simply using a kernel density estimator with a relatively large bandwidth. Since we have only one training sample per episode, this reduces to a single-kernel model. Choosing an isotropic Gaussian kernel, $H(q_\pi , \tilde{p})$ becomes $\mathbb E_{\hat{x} \sim q_\pi (\cdot | \phi )} \Vert x - \hat{x} \Vert ^2/\sigma ^2$, where $(x, \phi )$ constitutes an episode from the data. The net objective (2) in this case corresponds to $H(p, q_\pi )$ plus a mean squared distance penalty between model samples and data samples.

The second possibility is making an i.i.d. approximation; i.e., parameterizing $\tilde{p}$ as $\tilde{p}(x \mid \phi ) = \prod _t \tilde{p}_c(x_t \mid \phi )$. We proceed by discretizing $x_t$ in a large finite region centered at the vehicle’s start location; $\tilde{p}_c$ then corresponds to a categorical distribution with L classes representing the L possible locations. Training the i.i.d. model can then be reduced to training $\tilde{p}_c$ via logistic regression:

$$\begin{aligned} \min _{\tilde{p}} - \mathbb E_{x \sim p} \log \tilde{p}(x) = \max _{\theta } \mathbb E_{x \sim p} \sum _t -C_\theta (x_t, \phi ) - \log \sum _{y=1}^L \exp -C_\theta (y, \phi ), \end{aligned}$$

(9)

where $C_\theta = -\log \tilde{p}_c$ can be thought of as a spatial cost function with parameters $\theta $. We found it useful to decompose $C_\theta (y)$ as a sum $C^0_\theta (y) + C^1_\theta (y, \phi )$, where $C^0_\theta \in \mathbb R^L$ is thought of as a non-contextual location prior, and $C^1_\theta (y, \phi )$ has the form of a convolutional neural network acting on the spatial feature grid in $\phi $ and producing a grid of scores $\in \mathbb R^L$. Figure 4 shows example learned $C^1_\theta (\cdot , \phi )$.

3.2 Policy Modeling

We turn to designing learnable functions $\mu ^\pi _t$ and $\sigma ^\pi _t$. Across our three models, we use the following expansion: $\mu ^\pi _t(\psi _t) = 2x_t - x_{t-1} + \hat{\mu }^\pi _t(\psi _t)$. The first terms correspond to a constant velocity step ($x_{t} + (x_{t} - x_{t-1})$), and let us interpret $\hat{\mu }^\pi _t$ as a deterministic acceleration. Altogether, the update equation (Eq. 5) mimics Verlet integration [41], used to integrate Newton’s equations of motion.

“Linear”: The simplest model uses $\hat{\mu }^\pi _t, S_t$ linear in $\psi _t$:

$$\begin{aligned} \hat{\mu }^\pi _t(\psi _t)&= Ah_{t} + b_0, \quad S_t(\psi _t) = Bh_{t} + b_1, \end{aligned}$$

(10)

with $A \in \mathbb R^{2 \times 2H}$, $h_{t} = x_{t-H:t-1} \in \mathbb R^{2H}$, $B \in \mathbb R^{4 \times 2H}$, $b_{i} \in \mathbb R^{2H}$, and $S_t(\psi _t) \in \mathbb R^{2 \times 2}$. To produce $\sigma ^\pi _t \in \mathrm {PD}$, we use the matrix exponential [29]: $\sigma ^\pi _t = \text {expm}(S_t + S_t^\top )$, which we found to optimize more efficiently than $\sigma ^\pi _t = S_tS_t^\top $.

“Field”: The Linear model ignores M: it has no environment perception. We designed a CNN model that takes in M and outputs $O \in \mathbb R^{H_\mathrm {map} \times W_\mathrm {map} \times 6}$. The 6 channels in O are used to form the 6 components of $\mu ^\pi _t$ and $S_t$ in the following way. To ensure differentiability, the values in O are bilinearly interpolated at the current rollout position, $x_t$ in the spatial dimensions ($H_\mathrm {map}$ and $W_\mathrm {map}$) of O.

“RNN”: The Linear and Field models reason with different contextual inputs: Linear uses the past, and CNN uses the feature map M. We developed a joint model to reason with both. M is passed through a CNN similar to Field’s. The past is encoded with a GRU-RNN. Both featurizations inform a GRU-RNN that produces $\mu ^\pi _t$, $S_t$. See Fig. 5 and the supplementary material for details.

3.3 GAIL and Differentiable GAIL

As a deep generative approach to imitation learning, our method is comparable to Generative Adversarial Imitation Learning (GAIL [16]). GAIL is model-free: it is agnostic to model dynamics. However, this flexibility requires an expensive model-free policy gradient method, whereas the approach we have proposed is fully differentiable. The model-free approach is significantly disadvantaged in sample complexity [19, 32] in theory and practice. By assuming the dynamics are known and differentiable, as described in Sect. 3.1, we can also derive a version of GAIL that does not require model-free RL, since we can apply the reparameterization trick to differentiate the generator objective with respect to the policy parameters. A similar idea was explored for general imitation learning in [6]. We refer to this method as R2P2 GAIL. As our experiments show, R2P2 GAIL significantly outperforms standard GAIL, and our main model (R2P2) significantly outperforms and is easier to train than both GAIL and R2P2 GAIL.

4 Experiments

We implemented R2P2 and baselines with the primary aim of testing the following hypotheses. (1) The ability to exactly evaluate the model PDF should help R2P2 obtain better solutions than methods that do not use exact PDF inference (which includes GAIL). (2) The optimization of $H(p, q_\theta )$ should be correlated with the model’s ability to cover the training data, in analogy to recall in binary classification. (3) Including $H(q_\theta , \tilde{p})$ in our objective should improve sample quality relative to methods without this term, as it serves a purpose analogous to precision in binary classification. (4) R2P2 GAIL will outperform GAIL through its more efficient optimization scheme.

4.1 The CaliForecasting Dataset

Current public datasets such as Kitti are suboptimal for the purpose of validating these hypotheses. Kitti is relatively small and was not designed with forecasting in mind. It contains relatively few episodes of subjectively interesting, nonlinear behavior. For this reason, we collected a novel dataset specifically designed for the ego-motion forecasting task, which we make public. The data is similar to Kitti in sensor modalities, but the data was collected so as to maximize the number of intersections, turning, and other subjectively interesting episodes. The data was collected with a sensor platform consisting of a Ford Transit Connect van with two Point Grey Flea3 cameras mounted on the roof in a wide-baseline configuration, in addition to a roof-mounted Velodyne VLP16 LIDAR unit and an IMU. The initial version of the dataset consists of three continuous driving sequences, each about one hour long, collected in mostly suburban areas of northern California (USA). The data was post-processed to produce a collection of episodes in the previously described format. The overhead feature map was populated by pretraining a semantic segmentation network [39], evaluating it on the sequences, correlating them with the LIDAR point cloud, and binning the resulting semantic segmentation scores in addition to a height-above-ground plane feature. With a subsampling scheme of 2 Hz, CaliForecasting consists of over 10,000 training, 1,200 validation and 1,200 testing examples. The Kitti splits, in comparison, are about 3,100 training, 140 validation, and slightly less than 500 test examples with a subsampling scheme of 1 Hz.

4.2 Metrics and Baselines

Metrics Our primary metrics are the cross-entropy distribution metrics $H(p, q_\theta )$ and $H(q_\theta , \tilde{p})$. Note that $H(p, q_\theta )$ is lower-bounded by the entropy of p, H(p), by Gibbs’ inequality. Subtracting this quantity (computing KL) would be ideal; unfortunately, since H(p) is unknown, we simply report $H(p, q_\theta )$. We also note that cross-entropy is not coordinate-invariant: we use path coordinates in an ego-centric frame that is a rotation and translation away from UTM coordinates (in meters) and report cross-entropy values for path distributions in this frame.

A subtle related issue is that $H(p, q_\theta )$ may be unbounded below since H(p) may be arbitrarily negative. This phenomenon arises when the support of p is restricted to a submanifold—for example, if for $x \sim p$ and $x_1 - x_2 = b$, the distribution achieves arbitrarily low values of $H(p, q_\theta )$. We resolve this by slightly perturbing training and testing samples from p: i.e. instead of computing $H(p, q_\theta )$, we compute $-\mathbb E_{\eta \sim \mathcal N(0,\epsilon I)} \mathbb E_{x \sim p} \log q(x + \eta )$ for $\epsilon =0.001$. This is lower-bounded by $H(\mathcal N(0, \epsilon I))$, which resolves the issue.

We include two commonly used sample metrics [3, 8, 15, 25, 37], despite the shortcomings illustrated in Fig. 6. We measure the quality of the “best” sample from K samples from $q_\theta $: $\hat{X}$, relative to the demonstrated sample x via $\mathbb E_{\hat{X}_k \sim q_\theta }\min _{\hat{x} \in \hat{X}_k}\Vert x - \hat{x} \Vert ^2$ (known as “minMSD”). This metric fails to measure the quality of all of the samples, and thus can be exploited by an approach that predicts samples that are mostly poor. Additionally, we measure the mean distance to the demonstration of all samples in $\hat{X}$: $\frac{1}{K}\sum _{k=1}^K\Vert x - \hat{x}_k \Vert ^2$ (known as “meanMSD”). This metric is misleading if the data is multimodal, as the metric rewards predicting the mean, as opposed to covering multiple outcomes. Due to the deficiencies of these common sample-based metrics for measuring the quality of multimodal predictions, we advocate supplementing sample-based metrics with the complementary cross-entropy metrics used in this work.

Baselines. We construct a simple a unimodal baseline: given the context, the distribution of trajectories is given as a sequence of Gaussian distributions. This is called the Gaussian Direct Cross-Entropy (DCE-G). As discussed in Sect. 3.3, we apply Generative Adversarial Imitation Learning (GAIL), along with our modified GAIL framework, R2P2 GAIL. We constructed several variants of GAIL: with and without the (improved) Wasserstein-GAN [4, 14] parameterization, with and without our novel R2P2 GAIL formulation, and using the standard MLP discriminator, versus a CNN-based discriminator with a similar architecture to the Field model (details in supplementary). Conditional Variational Autoencoders (CVAEs) are a popular approach for modeling generative distributions conditioned on context. We follow the CVAE construction of [25] in our implementation. One key distinguishing factor is that CVAEs cannot perform exact inference by construction: given an arbitrary sample, a CVAE cannot produce a PDF value. Quantification of CVAE performance is thus required to be approximation-based, or sample-based. Our approaches are implemented in Tensorflow [1]. Architectural details are given in the supplement.

4.3 Cross Trimodal Experiments

Our first set of experiments is designed to test the multimodal modeling capability of each approach in an easy domain. The contextual information is fixed – a single four-way intersection, along with three demonstrated outcomes: turning left, turning right, and going straight. Figure 7 shows qualitative and quantitative results. We see that several approaches fail to model multimodality well in this scenario. RNN. The models that can perform exact inference (all except CVAE) cover the modes with different success, as measured by Test $-H(p, q_\theta )$. We observe the models minimizing $H(p, q_\theta )$ cover the data well, supporting hypothesis 2 (coverage hypothesis), and outperform both GAIL approaches, supporting hypothesis 1 (exact inference hypothesis). We observer R2P2 GAIL outperforms GAIL in this scenario, supporting hypothesis 4 (optimization hypothesis). We also note the failure of DCE-G: its unimodal model is too restrictive for covering the diverse demonstrated behavior.

Table 1. CaliForecasting and Kitti evaluation, $K=12$

Full size table

4.4 CaliForecasting Experiments and KITTI Experiments

We conducted larger-scale experiments designed to test our hypotheses. First, we trained $\tilde{p}$ on each dataset by the procedure described in Sect. 3.1. As discussed, our goal was to develop a simple model to minimize overfitting: we used a 3-layer Fully Convolutional NN. In the resulting spatial “cost” maps, we observe the model’s ability to perceive obstacles in its assignment of low cost to on-road regions, and high-cost to clearly visible obstacles (e.g Fig. 4). We performed hyperparameter search for each method, and report the mean and its standard error of test set metrics corresponding to each method’s best validation loss in Table 1. These results provide us with a rich set of observations. Of the three baselines, none catastrophically failed, with CVAE most often generating the cleanest samples. Across datasets and metrics, our approach achieves performance superior to the three baselines and our improved GAIL approach. By minimizing $H(p, q_\theta )$, our approach results in higher Test $-H(p, q_\theta )$ than all GAIL approaches, supporting the coverage and optimization hypotheses. We find that by incorporating our prior with nonzero $\beta $, hypothesis 3 is supported: our model architectures can improve the quality of its samples as measured by the Test $-H(q_\theta , \tilde{p})$. We observe that our GAIL optimization approach yields higher Test $-H(p, q_\theta )$, supporting hypothesis 4. We plot means and its standard error of the minMSD metrics as a function of K in Fig. 10 for all 3 datasets.

We also find that qualitatively, our approach usually generates the best samples with diversity along multiple paths and precision in its tendency to avoid obstacles. Figure 8 illustrates results on our dataset for our method, CVAE, and our improved GAIL approach. Figure 9 illustrates qualitative examples for how incorporating nonzero $\beta $ can improve sample quality.

5 Conclusions

This work has raised the previously under-appreciated issue of balancing diversity and precision in probabilistic trajectory forecasting. We have proposed a training a policy to induce a simulated-outcome distribution that minimizes a symmetrized cross-entropy objective. The key technical step that made this possible was a parameterizing the model distribution as the pushforward of a simple base distribution under the simulation operator. The relationship of this method to deep generative models was noted, and we showed that part of our full model enhances an existing deep imitation learning method. Empirically, we demonstrated that the pushforward parameterization enables reliable optimization of the objective, and that the optimized model has the desired characteristics of both covering the training data and generating high-quality samples. Finally, we introduced a novel large-scale, real-world dataset designed specifically for the vehicle ego-motion forecasting problem.

References

Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)
Google Scholar
Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 1. ACM (2004)
Google Scholar
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–971 (2016)
Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv preprint arXiv:1701.07875 (2017)
Ballan, L., Castaldo, F., Alahi, A., Palmieri, F., Savarese, S.: Knowledge transfer for scene-specific motion prediction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 697–713. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_42
Chapter Google Scholar
Baram, N., Anschel, O., Caspi, I., Mannor, S.: End-to-end differentiable adversarial imitation learning. In: International Conference on Machine Learning, pp. 390–399 (2017)
Google Scholar
Bhattacharyya, A., Malinowski, M., Schiele, B., Fritz, M.: Long-term image boundary prediction. In: Thirty-Second AAAI Conference on Artificial Intelligence, AAAI (2017)
Google Scholar
Bhattacharyya, A., Schiele, B., Fritz, M.: Accurate and diverse sampling of sequences based on a “best of many” sample objective. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8485–8493 (2018)
Google Scholar
Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real nvp. arXiv preprint arXiv:1605.08803 (2016)
Finn, C., Levine, S.: Deep visual foresight for planning robot motion. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 2786–2793. IEEE (2017)
Google Scholar
Gal, Y.: Uncertainty in deep learning. Ph.D. thesis, University of Cambridge (2016)
Google Scholar
Galceran, E., Cunningham, A.G., Eustice, R.M., Olson, E.: Multipolicy decision-making for autonomous driving via changepoint-based behavior prediction. In: Robotics: Science and Systems XI, Sapienza University of Rome, Rome, 13–17 July 2015. http://www.roboticsproceedings.org/rss11/p43.html
Grover, A., Dhar, M., Ermon, S.: Flow-GAN: bridging implicit and prescribed learning in generative models. arXiv preprint arXiv:1705.08868 (2017)
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of Wasserstein GANs. In: Advances in Neural Information Processing Systems, pp. 5769–5779 (2017)
Google Scholar
Gupta, A., Johnson, J.: Social GAN: socially acceptable trajectories with generative adversarial networks (2018)
Google Scholar
Ho, J., Ermon, S.: Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems, pp. 4565–4573 (2016)
Google Scholar
Hoai, M., De la Torre, F.: Max-margin early event detectors. Int. J. Comput. Vis. 107(2), 191–202 (2014)
Article MathSciNet Google Scholar
Jain, A., Singh, A., Koppula, H.S., Soh, S., Saxena, A.: Recurrent neural networks for driver activity anticipation via sensory-fusion architecture. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 3118–3125. IEEE (2016)
Google Scholar
Kakade, S.M., et al.: On the sample complexity of reinforcement learning. Ph.D. thesis (2003)
Google Scholar
Kingma, D.P., Dhariwal, P.: Glow: generative flow with invertible $1\times 1$ convolutions. arXiv preprint arXiv:1807.03039 (2018)
Kingma, D.P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., Welling, M.: Improved variational inference with inverse autoregressive flow. In: Advances in Neural Information Processing Systems, pp. 4743–4751 (2016)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33765-9_15
Chapter Google Scholar
Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_45
Chapter Google Scholar
Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H., Chandraker, M.: Desire: Distant future prediction in dynamic scenes with interacting agents (2017)
Google Scholar
Lee, N., Kitani, K.M.: Predicting wide receiver trajectories in American football. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–9. IEEE (2016)
Google Scholar
Li, Y., Song, J., Ermon, S.: Infogail: interpretable imitation learning from visual demonstrations. In: Advances in Neural Information Processing Systems, pp. 3815–3825 (2017)
Google Scholar
Ma, W.C., Huang, D.A., Lee, N., Kitani, K.M.: Forecasting interactive dynamics of pedestrians with fictitious play. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4636–4644. IEEE (2017)
Google Scholar
Najfeld, I., Havel, T.F.: Derivatives of the matrix exponential and their computation. Adv. Appl. Math. 16(3), 321–375 (1995)
Article MathSciNet Google Scholar
Park, H.S., Hwang, J.J., Niu, Y., Shi, J.: Egocentric future localization. In: CVPR, vol. 2, p. 4 (2016)
Google Scholar
Ratliff, N.D., Bagnell, J.A., Zinkevich, M.A.: Maximum margin planning. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 729–736. ACM (2006)
Google Scholar
Recht, B.: The policy of truth (2018). http://www.argmin.net/2018/02/20/reinforce/
Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: The IEEE International Conference on Computer Vision (ICCV), October 2017
Google Scholar
Robicquet, A., Sadeghian, A., Alahi, A., Savarese, S.: Learning social etiquette: human trajectory understanding in crowded scenes. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 549–565. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_33
Chapter Google Scholar
Ryoo, M.S., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.H.: Robot-centric activity prediction from first-person videos: what will they do to me’. In: Proceedings of the Tenth Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2015, Portland, 2–5 March 2015, pp. 295–302 (2015). https://doi.org/10.1145/2696454.2696462
Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: IEEE International Conference on Computer Vision (ICCV), pp. 1036–1043. IEEE (2011)
Google Scholar
Sadeghian, A., Kosaraju, V., Gupta, A., Savarese, S., Alahi, A.: TrajNet: towards a benchmark for human trajectory prediction. arXiv preprint (2018)
Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR (2015)
Google Scholar
Venkatraman, A., et al.: Predictive-state decoders: encoding the future into recurrent networks. In: Advances in Neural Information Processing Systems, pp. 1172–1183 (2017)
Google Scholar
Verlet, L.: Computer “experiments” on classical fluids. I. Thermodynamical properties of Lennard-Jones molecules. Phys. Rev. 159(1), 98 (1967)
Article Google Scholar
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, 6–11 August 2017, pp. 3560–3569 (2017). http://proceedings.mlr.press/v70/villegas17a.html
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating visual representations from unlabeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 98–106 (2016)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems, pp. 613–621 (2016)
Google Scholar
Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, 21–26 July 2017, pp. 2992–3000 (2017). https://doi.org/10.1109/CVPR.2017.319
Walker, J., Gupta, A., Hebert, M.: Patch to the future: unsupervised visual prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3302–3309 (2014)
Google Scholar
Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: IEEE International Conference on Computer Vision (ICCV), pp. 3352–3361. IEEE (2017)
Google Scholar
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinf. Learn. 5–32 (1992)
Google Scholar
Wulfmeier, M., Rao, D., Wang, D.Z., Ondruska, P., Posner, I.: Large-scale cost function learning for path planning using deep inverse reinforcement learning. Int. J. Robot. Res. 36(10), 1073–1087 (2017)
Article Google Scholar
Xie, D., Todorovic, S., Zhu, S.C.: Inferring “dark matter” and “dark energy” from videos. In: IEEE International Conference on Computer Vision (ICCV), pp. 2224–2231. IEEE (2013)
Google Scholar
Zhu, J., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, 22–29 October 2017, pp. 2242–2251 (2017). https://doi.org/10.1109/ICCV.2017.244
Ziebart, B.D., Maas, A.L., Bagnell, J.A., Dey, A.K.: Maximum entropy inverse reinforcement learning. In: Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, 13–17 July 2008, pp. 1433–1438 (2008). http://www.aaai.org/Library/AAAI/2008/aaai08-227.php

Download references

Acknowledgment

This work was sponsored in part by JST CREST (JPMJCR14E1) and IARPA (D17PC00340).

Author information

Authors and Affiliations

Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Nicholas Rhinehart & Kris M. Kitani
NEC Labs America, Cupertino, CA, 95014, USA
Nicholas Rhinehart & Paul Vernaza

Authors

Nicholas Rhinehart
View author publications
You can also search for this author in PubMed Google Scholar
Kris M. Kitani
View author publications
You can also search for this author in PubMed Google Scholar
Paul Vernaza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicholas Rhinehart .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Supplementary material 1 (pdf 2617 KB)

Supplementary material 2 (m4v 32933 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Rhinehart, N., Kitani, K.M., Vernaza, P. (2018). r2p2: A ReparameteRized Pushforward Policy for Diverse, Precise Generative Path Forecasting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11217. Springer, Cham. https://doi.org/10.1007/978-3-030-01261-8_47

Download citation

DOI: https://doi.org/10.1007/978-3-030-01261-8_47
Published: 06 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01260-1
Online ISBN: 978-3-030-01261-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics