Keywords

1 Introduction

Path prediction is the task of estimating the path, or trajectory, along which a target (e.g., a pedestrian or vehicle) will move. Predicting paths from video is an important task receiving much attention as it is expected to have many potential applications, such as surveillance camera analysis, self-driving cars, and autonomous robot navigation.

Path prediction has to estimate much more information—such as information of the surrounding environment, moving direction, and status of prediction targets—than other simple image recognition tasks. As a result, prediction methods are often built on top of other computer vision tasks, such as pedestrian detection [1, 2], pedestrian attribute recognition [3], and semantic segmentation [4]. Moreover, in the prediction task, future observations of predicted paths are not available. In tasks of pedestrian detection and tracking, observations from the past to the present are used to locate and track the target in the current frame of the video. In contrast, the prediction task localizes and predicts the locations of the target in future frames of the video, using observations made until the present time and prior information on the surrounding environment and knowledge of the target motion.

Path prediction has been studied for decades in the field of robotics. At stations and airports, robots need to move without interfering with the many people present [5] and to plan a path of efficient motion in the environment. Path prediction is necessary to achieve such tasks. However, in addition to information from cameras, robots are able to use information from many types of sensor, such as a LIDAR sensor, to obtain the three-dimensional (3D) geometry of the scene. The environment in which the robot can move around is sometimes explicitly given as an environment map. The present survey is of path prediction methods involving video only as a computer vision task.

There is an alternative task called early recognition, which predicts future human behaviors in video. This task predicts future actions in the video but is excluded from the survey because the predicted categories are discrete whereas predicted paths are sequences of continuous locations.

Fig. 1.
figure 1

Overview of path prediction, modified from [6].

As the task of path prediction in the field of computer vision is difficult and challenging, a number of various methods have been proposed. A common approach is shown in Fig. 1. As input, a video (or a frame of video) is given in addition to the location of the target in the current frame or a sequence of locations over the past frames of several seconds. Features useful for prediction are then extracted from the video (or frames) to predict the path in future frames. There are two important parts to the overview of Fig. 1: (b) feature extraction where many features are extracted to understand the environment and target; and (c) path prediction where a variety of methods are proposed, categorized into four types.

In this paper, we survey path prediction methods taking video as input and systematically summarize feature extraction and prediction approaches and datasets used for evaluation. We explain feature extraction methods in Sect. 2 and categorize prediction methods in Sect. 3. In Sect. 4, we review datasets used in evaluating the performance of path prediction. We conclude the survey in Sect. 5.

Table 1. Categories of feature extraction for path prediction.

2 Feature Extraction from a Video

This section introduces methods of feature extraction from video for path prediction. The path that the pedestrian takes is implicitly affected by many factors of the surrounding environment and the status of the pedestrian his self or herself. The performance of path prediction is expected to be improve when using information that largely determines how the pedestrian decides the way to go. Given the video, such information is extracted prior to the prediction. Table 1 presents information extracted from video for path prediction. Such information can be broadly categorized into that of (1) the environment and (2) the target.

2.1 Environmental Features

The pedestrian decides the way and walks along a path while being affected by the surrounding environment. For example, we usually walk along the sidewalk while avoiding obstacles on the way (e.g., parked cars and trash cans) and drive a car on the roadway as is common social practice. The movement of the target is dynamically affected by the environment, and environmental features are therefore extracted from the video.

Semantic segmentation [18,19,20,21] is a task of assigning an object class to each pixel, which is the most common task in understanding the environment in the field of computer vision. Semantic segmentation can be conducted to estimate where obstacles exist in the scene and where there are regions available for walking. Kitani et al. [18] assumed that pedestrian paths are mainly affected by the physical environment, such as sidewalks, roadways, flower beds, and buildings, and predicted posterior probabilities of each label using hierarchical segmentation [7] as shown in Fig. 2. These probabilities are used as feature vectors to form scene feature maps, which are used for path prediction. Rehder et al. [20] used segmentation results obtained using a fully convolutional network [9, 10] for prediction.

Fig. 2.
figure 2

Examples of environmental attributes [18]

Alternative approaches do not explicitly use environmental features affecting paths but implicitly represent probabilities of paths as cost (or reward) functions [11, 22]. These methods create cost maps of the entire scene from cost functions independently estimated for each superpixel. Walker et al. [22] searched for patches that have similar texture from training samples using a nearest-neighbor approach, and assigned the costs of the training samples to superpixels to generate cost maps of the scene. Huang et al. [11] proposed a convolutional neural network (CNN) called the spatial matching network, which estimates rewards of local regions by comparing similarity between the patch of the target and surrounding superpixel patches.

Yet another approach represents the scene as a single feature vector, whereas the above approaches extract local features from superpixels. Assuming that similar scenes prompt similar paths, this approach retrieves similar scenes in a training dataset with feature vectors to predict paths using the paths of the retrieved scenes. To this end, CNNs are usually used to efficiently extract scene feature vectors because of the recent success of deep learning architectures. In predicting paths in first-person video, Park et al. [23] used AlexNet [12] to extract features when retrieving scenes, and transferred paths of the retrieved scenes for prediction. Su et al. [24] used an AlexNet-based Siamese network [13] to retrieve features.

Fig. 3.
figure 3

Estimation of head orientation [25]. (a) Detection of heads and bodies of pedestrians. (b) Estimation of the orientation of the head in eight directions.

2.2 Target Features

While environmental features strongly affect the target in terms of the path decision, internal factors of the target are also important. Specifically, attributes of the target, such as age, gender, and internal demand, affect the path decision. We herein introduce methods for extracting target features.

The most common target feature is the orientation of the target [11, 16, 25] because the estimated orientation can be used to predict in which direction the target is going. In other words, the orientation constrains the moving direction of the target and thus reduces errors of prediction. Kooij et al. [25] detected pedestrians employing a histogram of oriented gradients (HOG) and support vector machine (SVM) [14] and estimated the head orientation [15] to predict the path of a pedestrian in front of a car on which a camera was mounted, focusing on whether the pedestrian will stop before stepping forward onto the roadway as shown in Fig. 3. If the head faces the camera, then the pedestrian is assumed to notice the car and is predicted to slow down or stop before the roadway.

Physical attributes, such as age and gender, are also important to prediction. When walking in places where there are a number of people, pedestrians take actions to avoid colliding with each other. Aspects of such avoidance—when and where pedestrians start to avoid others—are different for pedestrians of different age and gender; e.g., a younger person walks faster and responds more rapidly to others than senior people. Wei et al. [16] used AlexNet to estimate the orientation, age, and gender of pedestrians as multi-task learning. Estimated attributes are used in deciding the walking speed of pedestrians.

Walker et al. [22] proposed unsupervised path prediction by extracting mid-level feature vectors directly from patches containing the target, instead of direct attributes.

Table 2. Categories of path prediction methods

3 Prediction Methods

Path prediction follows feature extraction from video. Table 2 summarizes methods of prediction, categorized according to their approach. This section describes each category and its properties.

3.1 Bayesian Models

The first approach uses online Bayes filters, such as Kalman filters (KFs) and particle filters, and infers the model to predict paths. Such modeling introduces internal states and observations as variables, and defines probabilistic models by assuming that the observations are the internal states contaminated by noise. This approach iterates the prediction step that computes the current internal states from the previous states, and the update step that updates the current states with the observations. In a common setup, internals states are actual coordinates of pedestrians, and observations are coordinates obtained by pedestrian detection. This is person tracking if we apply the approach to track from the past to present, and path prediction if we only repeat the prediction step to obtain the sequence of coordinates of the pedestrian, without the update step; i.e., there are no future observations.

Fig. 4.
figure 4

Graphical model of a DBN with an SLDS [25]

Schneider et al. [26] used the extended KF to update the internal state of the pedestrian in front of a car. This was an early work of path prediction and showed what kind of primitive information (e.g., the walking speed and acceleration) is useful for path prediction.

Instead of using online Bayes filters, some works have used a dynamic Bayesian network (DBN) [19, 25]. Kooij et al. [25] considered a more restricted case; estimating if the pedestrian will walk across a roadway in front of a car on which a camera is mounted. They defined a DBN model with a switching linear dynamical system (SLDS) that is shown in Fig. 4 and that uses features extracted from the movie, such as the pedestrian’s head orientation, distance to the car, and distance between the pedestrian and roadway. This method performs better than using coordinates of pedestrian detection only.

3.2 Energy Minimization

The Bayes approach described above is on-line in which it estimates the coordinates of the pedestrian frame by frame in the video. Another (off-line or batch) approach is an energy minimization approach that estimates the entire sequence of coordinates at the same time. This approach constructs a two-dimensional grid graph of the scene and assigns costs for moving to edges in the graph, and then finds the combination of edges that gives the minimum energy. This is formulated as a shortest path problem solved employing the Dijkstra method. The prediction accuracy is therefore largely affected by how the cost is defined.

Huang et al. [11] proposed a path prediction method using a single image. First, a patch containing the target is extracted to estimate the orientation of the target. Next, the cost for moving across the location of the patch is estimated by comparing the texture of surrounding patches. In addition to this cost, the estimated orientation of the target is used as a constraint and added to the edge weights. Walker et al. [22] compared the texture of superpixels using patches along the path that the target traced without involving any training procedure.

Appearance information (texture) of the scene can be used to define the cost function, but objects in the scene can also be used. Xie et al. [27] assumed that pedestrians have decided their goal (e.g., a food trunk) according to their potential demands (hunger), and defined cost maps where the pedestrians are attracted to objects in the scene.

3.3 Deep Learning

Deep learning methods such as those involving the CNN and long short-term memory (LSTM) have been used for path prediction since the emergence of deep learning frameworks. Methods of this type take as input the series of coordinates of the target over the last several frames, and produce a series of target coordinates in several successive frames. Feature extraction, described in the last section, is not explicitly performed as feature extraction and prediction are not explicitly separated in deep learning models.

Several methods thus use LSTM to deal with paths, which are sequences of two-dimensional coordinates, have been proposed. Alahi et al. [29] proposed the social-pooling (S-pooling) layer for avoiding collisions between pedestrians. A pedestrian is represented by LSTM, and hidden layer outputs of LSTMs of other people are connected to the S-pooling layer of the pedestrian. This layer allows the LSTM of the pedestrian to represent the spatial relationship with nearby people (e.g., the distance to each other), and thus predict the path avoiding collision.

LSTM has a limitation of long-term memory; i.e., paths in the distant future are difficult to predict. Fernando et al. [31] assumed the necessity of more elaborate long-term memory, and proposed the tree memory network that hierarchically selects useful information of the past stored in memory cells and performs better than other LSTM models.

Besides LSTM, the CNN is also used to directly make predictions. Yi et al. [28] proposed the behavior-CNN that predicts the future path from the past path. This method first creates three-dimensional sparse data whose channels store the pedestrian two-dimensional coordinates of the last several frames. The sparse 3D data are encoded using convolution and pooling layers and then decoded using deconvolution layers. They also added location bias maps to each channel of encoded information to account for different behaviors at different locations in the scene, such as the locations of entrances and obstacles.

Fig. 5.
figure 5

Overview of RL, modified from [38].

3.4 Inverse Reinforcement Learning

The three approaches above are examples of supervised or unsupervised learning, while the approach presented here is an example of reinforcement learning (RL). RL learns a policy to decide actions to be taken by an agent under the current status in an environment. RL is usually defined as a Markov decision process that learns the optimal policy to allow the agent to take the best actions maximizing the reward. Figure 5 shows that an agent of RL is the target of prediction, an environment is the scene given as video, a status is the pedestrian location, and an action is the movement of the pedestrian.

RL needs to define the reward of the action of moving from one state to another, which indicates how good the action taken by the agent is. However, it is difficult to explicitly define the reward function for practical problems such as the path prediction task. This problem is called the reward design problem, and inverse reinforcement learning (IRL) is one approach taken to solve the problem. IRL estimates rewards that reproduce optimal sequences of actions, and decides actions of the agent in the test phase with the estimated reward so that the agent can take similar actions.

IRL has been used to learn and control the optimal motion of robots [5]. Kitani et al. [18] first introduced IRL to vision-based path prediction. Instead of estimating target locations, they estimated actions that the agent may take at a certain time or location, and predicted possible paths by sequentially applying the estimated actions to the current target location. This task is therefore called activity forecasting, in contrast to path prediction that directly estimates locations of the target in the future. Activity forecasting is a much more complex and challenging task than path prediction while it has great potential in terms of having a variety of predictions adapted to each possible application.

Kitani et al. [18] assumed that the physical attributes of a scene strongly affect pedestrian paths, and used scene attributes estimated by semantic segmentation as feature maps. Rewards of each scene attribute are defined by the inner product of the feature maps and weight vectors, and the optimal weights are estimated from training data. For prediction, a sequence of actions that arrive at the predefined goal is generated by giving the goal and the current location of the target pedestrian. Lee et al. [32] used a similar approach to predict paths of football players in a game video. Wei et al. [16] introduced a game theory called fictitious play to predict paths of multiple pedestrians who arrive at a goal while avoiding collisions between pedestrians.

Without any predefined goals, Rehder et al. [20] proposed the destination network to estimate the goal of the target using the last several frames. The estimated goal and the environmental attributes obtained using a fully convolutional network were used to predict pedestrian paths.

For first-person vision, Bokhari et al. [33] used objects held by a person and the object states to predict goals in the future. While this work considered a limited scene (e.g., a kitchen), Rhinehart et al. [34] dealt with wider areas, such as a home including a kitchen, bathroom, and living room.

3.5 Other Approaches

Most prediction methods can be categorized into one of the four approaches described above, but there are other approaches.

Fig. 6.
figure 6

Prediction from first-person videos; (left) [23], (right) [24].

The social force model [39] assumes energy called a “social force” that acts between pedestrians and objects in the scene, and generates pedestrian movement through interaction via the force. Yamaguchi et al. [37] proposed a model with additional states, such as the preferences of pedestrians, walking speeds, goals, and the existence of other people walking together. This work was motivated by a desire to improve the accuracy of pedestrian tracking, but performed path prediction to evaluate the proposed model. Robicquet et al. [38] proposed social forces of multiple classes for avoiding collisions. They estimated “social sensitivity features” using the distances between other people, and applied K-means clustering of the features to get several clusters of avoidance behaviors. The cluster of the target behavior of avoidance was estimated using the target feature, and paths of the cluster were then projected back to the scene for prediction.

Optical flow extracted from car-mounted cameras was used by Keller et al. [35] to predict pedestrian paths. They used optical flows over the last several frames and computed orientation histograms as motion features of pedestrians. The sequence of histograms was used to retrieve similar scenes in the training set, and paths of the retrieved scenes were then mapped back to the scene for prediction.

The use of the Markov process framework was proposed by Rehder et al. [36]. They used normal and von-Mises distributions to represent the state (location) and speed of the pedestrian, and sequentially estimated the state by taking products of these distributions at each time step for prediction. To improve accuracy, the goal of the pedestrian was estimated from environmental attributes to constrain the direction of motion.

The retrieval-based approach shown in Fig. 6 was proposed by Park et al. [23] to predict the future path in a video showing the first-person view. They first extracted scene features using AlexNet and then found similar scenes in the training set by comparing extracted features. Paths of retrieved training samples were mapped onto the video. They predicted paths even in scenes with occlusions by estimating regions behind occluding objects, such as walls and obstacles. Su et al. [24] extended this work to the prediction of multiple basketball players in a game scene. In one first-person video, they estimated the region of “joint attention” to which multiple players commonly paid attention. Multiple paths were predicted by selecting the optimal path of each player and by minimizing an objective function defined by the estimated joint attention region, locations of players, and paths projected back to the scene.

Fig. 7.
figure 7

Datasets and results of prediction, taken and modified from [16, 18, 21, 23, 25, 28, 29, 31, 34, 38].

4 Datasets

This section briefly introduces datasets used to evaluate path prediction methods. Various datasets have been used as shown in Table 3 and Fig. 7. The diversity of datasets is due to the difficulty of using a single universal dataset for many different conditions, e.g., different numbers of scenes and paths needed for learning and different types of scenes. We therefore categorize datasets into four categories in terms of the viewpoint of the camera.

Table 3. Comparison of datasets

4.1 Videos of Entire Scenes

The most commonly used type of dataset is video that captures the entire scene taken by a wide-angle camera (for surveillance) at stations and market places. These datasets are usually used to evaluate pedestrian tracking methods; however, they are also used in evaluating path prediction because sequences of pedestrian locations are given as the ground truth.

Top view

The UCY Dataset [40] and ETH Dataset [41] contain videos of pedestrians walking along streets where no other moving objects exist, which is a relatively simple situation compared with situations of other datasets. The Edinburgh Informatics Forum Pedestrian Database [42] consists of videos of pedestrians walking at the campus of the University of Edinburgh taken by a fixed camera. This dataset is large and has more than 90,000 paths.

The above datasets are constructed for pedestrian tracking and crowd behavior analysis, while the Stanford Drone Dataset [38] focuses on path prediction. This dataset has videos taken by drones flying at eight sites of Stanford University, and provides annotations of moving objects, such as cyclists, skateboarders, and cars, as well as pedestrians.

Surveillance

Videos in the datasets described above are taken from a top view, while videos in the datasets shown in Fig. 7(e, f) are taken from a bird’s eye view; i.e., the videos are taken by surveillance cameras looking downward at an angle. The physical attributes of pedestrians are observable in these videos and can be used for prediction. The VIRAT Video Dataset [6] contains videos taken by surveillance cameras at parking lots, and provides the locations of pedestrians, cars, and objects in the scene and labels of activities, such as getting into a car and opening a trunk. It contains 11 scenes, which is the largest number of scenes among the datasets of surveillance cameras in Table 3. The Town Centre Dataset [43] contains videos of pedestrians and provides bounding boxes of each pedestrian as well as labels of head locations of pedestrians.

The Grand Central Station Dataset [44] contains videos taken by a fixed camera mounted at a station, as shown in Fig. 7(g). It has a single scene but is complex owing to the many people appearing in and disappearing from the scene because the motivation is to analyze the behaviors of many pedestrians.

4.2 Car-Mounted Cameras

Datasets of videos taken by cameras mounted on vehicles are used because path prediction is studied with the aim to develop automated driving. In this case, cameras are mounted in the front of the car to look forward, and the main objective is to predict paths of pedestrians in front of the car.

The Daimler Pedestrian Path Prediction Benchmark Dataset [26] consists of videos taken by car-mounted cameras. There are four classes of cases, including cases that the pedestrian walks across the roadway and cases that the pedestrian stops walking to avoid an accident. In addition to the videos themselves, depth information is available as the videos are taken by stereo cameras. There are relatively few pedestrians; however, the dataset contains videos that are rare in other datasets, such as videos of pedestrians crossing in front of moving cars.

The KITTI Vision Benchmark Suite [45] was constructed for the Intelligent Transport System, and is used for various evaluations such as those of the detection of pedestrians, vehicles, and white lines on the road. It contains not only RGB images but also stereo images, LIDER 3D data, GPS locations, and street maps, and it is therefore useful for path prediction that uses rich information to understand the environment.

4.3 First-Person View

Unlike videos of entire scenes and taken by car-mounted cameras for predicting paths of targets in the scene, videos taken from the first-person view are used to predict the path of the person taking the video. Park et al. [23] used first-person videos taken by wearable cameras moving through indoor and outdoor environments of 26 different scenes, such as on a street and inside a store. Rhinehart et al. [34] collected first-person videos taken by a person walking around office environments and assumed that an object held by the person (e.g., a mug or towel) indicate where the person is going (e.g., the kitchen or bathroom).

5 Conclusions

We reviewed vision-based path prediction methods and common datasets. We first categorized feature extraction methods of features used for prediction attributed to the environment or target appearance and dynamics. We then grouped prediction methods according to the approach taken. Bayesian methods define probabilistic models of the path and sequentially estimate internal states. Energy minimization methods define a two-dimensional grid graph by computing possibilities of pedestrians to move in each local region, and then solve the shortest-path problem. Deep learning methods take a series of locations of the target over the past several seconds and output a series of future locations. IRL uses the policy and reward estimated from training samples and then selects actions iteratively to produce a future path. These approaches are of course not exclusive and often used in combination [21]. Finally, we summarized datasets used in evaluating prediction methods. Some datasets are used for pedestrian detection and tracking, while others are used for path prediction.