1 Introduction

Geophysical, societal, ecological, organizational, agro-food and biological systems are situated systems, whose behavior affects and is affected by the surrounding environment. Considering urban systems, mobility dynamics are inherently dependent on calendar status and affected by large-scale events, roadblocks, and meteorological conditions, impacting demand and transport mode preferences (Cerqueira et al. 2021). Medical emergency needs are another paradigmatic case with well-established links to the calendrical context, weather factors, and public events (Channouf et al. 2007; McCarthy et al. 2008; Kam et al. 2010; Wong and Lai 2012).

As a consequence, research on context-aware predictive models for time series forecasting has received significant attention in recent years (Bi et al. 2022; Jozi et al. 2022; Guiguet et al. 2018; Schürholz et al. 2020). In the neural processing field, relationships between target and context variables have been explicitly modeled via graph neural networks (Xu et al. 2021; Fang et al. 2021), as well as multivariate gating units in recurrent neural networks (Guiguet et al. 2018; Zhu et al. 2021; Ruan et al. 2020) and multi-source embeddings (Kamarthi et al. 2021; Huang et al. 2019). Despite their relevance, most solely focus on the role of historical and static context, neglecting the important role of prospective situational context.

Sources of prospective context, such as planned events, calendrical information or weather forecasts (Sardinha et al. 2021), can arguably be positioned as pivotal in predictive tasks given their potential effect on systems’ behavior along the horizon of prediction. Nevertheless, significant challenges should be noticed. First, available sources of prospective context are often structurally heterogeneous (Tiam-Lee et al. 2022; Cerqueira et al. 2021), e.g., calendrical data recorded as georeferenced events while weather forecasts as multivariate time series. Second, prospective sources of context may not be accessible from structured sources, and the geographical and temporal footprint of prospective events may not be fully known apriori (Leite et al. 2020). Third, some sources of prospective context are incomplete, as well as susceptible to significant levels of noise. Considering weather forecasts, long-term forecasts may be absent, while short-term forecasts are subjected to arbitrary uncertainty levels (Cerqueira et al. 2021). Fourth, associations between prospective context and forecasts are hypothesized to be more relevant than those associations between prospective context against historical data, raising additional challenges to the development of effective neural processing principles and architectural choices.

This work proposes effective neural processing principles to incorporate both historical and prospective sources situational context in time series forecasting models. To this end, two major contributions are drawn. First, a multiple-input neural architecture consisting of a sequential composition of long short-term memory (LSTM) networks is proposed to incorporate heterogeneous sources of context. While historical context is inputted as auxiliary covariates to identify and correct externalities in the forecasting task, prospective context is incorporated at a later layering stage to act as denoiser of the forecasted series. Second, masking principles are drawn for the normative processing and integration of available context.

The inherent simplicity of the aforementioned principles is purposeful as our primary aim is to show that minimalist and elegant principles - multi-input sequential layering and masking - are sufficient to yield statistically significant improvements in forecasting tasks augmented with prospective context. As the proposed multi-input framework is parameterizable, LSTMs can be easily replaced by stacked recurrent units (Li et al. 2021), multivariate graph convolutional networks (GCNs) (Rico et al. 2021) or deep temporal convolutional networks (TCNs) (Chen et al. 2020). Although statistically significant differences (\(\alpha \)=1E-3) were not consistently observed with (regularized) deep layering, domain-specific adaptations are supported under the proposed architectural principles.

The proposed principles are integrated within a methodology that is experimentally validated using the demand for Lisbon’s emergency medical services (EMS) as the primary case study, and Lisbon’s bike sharing system (BSS) demand as a complementary application. Results show that the incorporation of external context variables, including calendrical and weather variables, can significantly reduce predictive errors, providing compelling empirical evidence in favor of the proposed forecasters against state-of-the-art alternatives. In particular, the incorporation of prospective context is the primary driver of efficacy gains, and essential to mitigate error increase along the horizon of prediction.

The manuscript is structured as follows. Section 2 introduces essentials of (multivariate) time series forecasting, while Sect. 3 surveys related contributions on context-aware predictive modeling. Section 4 introduces the proposed multi-input neural processing principles. Section 5 experimentally assesses the proposed methodology in urban data domains, discussing the gathered results. Finally, implications and major concluding remarks are drawn in Sect. 6.

2 Background

Problem formulation. The behavior of systems can be subject to a form of sensorization. A time series is a sequence of observations, usually measured at equally spaced points in time, \(\textbf{y}_{1..T} = (y_1,...,y_T)\), where each observation, \(y_t\), recorded at a given time step t, is either univariate, \(y_t\in \mathcal {Y}\), or multivariate, \(\textbf{y}_t \in \mathcal {Y}_1 \times \cdots \times \mathcal {Y}_n\), depending on the number of the monitored behaviors. Given a (multivariate) time series \(\textbf{y}_{1..T}\) and a target variable, \(\mathcal {Y}_k\), time series forecasting aims at estimating the measurements of the target variable along the next h steps, \((y_{T+1},..,y_{T+h})\), where \(y_{T+t} \in \mathcal {Y}_k\) and h is the horizon of prediction. Multivariate forecasts, \(\textbf{y}_{T+t}\), can be complementarily pursued in the presence of multiple target variables.

Situational context can be further monitored by measuring auxiliary behavior of a given system or the properties of its environment. Such endogenous and exogenous features can be captured using dependent variables, well-established in the previous multivariate time series formulation, \(\textbf{y}_t \in \mathcal {Y}_1 \times \cdots \times \mathcal {Y}_n\). In addition to historical context, monitored along 1..T steps, prospective situational context, such as planned or forecasted events falling in the horizon of prediction, may be available. In this context, given a context-enriched multivariate time series, \(\textbf{y}_{1..T}\), and available prospective time series, \(\textbf{z}_{T+1..T+h}\), the forecasting task can be augmented to estimate the targets of a variable of interest along a given horizon, \((y_{T+1},..y_{T+h})\) where \(y_{T+t} \in \mathcal {Y}_k\), from both \(\textbf{y}_{1..T}\) and \(\textbf{z}_{T+1..T+h}\).

Context data consolidation. Context data can be acquired from unstructured, semi-structured or structured sources. Social media, public administration repositories, weather portals, online calendars, cultural agendas, theatre sites, and online news can be periodically explored with the aim of retrieving historical or prospective context. Despite the presence of principles to this end (Wibisono et al. 2012; Tempelmeier et al. 2019; Tang et al. 2019), context acquisition from unstructured sources is generally subjected to uncertainties related with data quality and availability. As a result, municipalities and other entities have established efforts towards a more normative gathering and provision of (semi-)structured repositories with situational context (Lemonde et al. 2021). In the Lisbon city, the reservation of public spaces, including stadiums, auditoriums, large halls, arenas, amphitheaters, amongst others, is thoroughly updated at the Lisboa Aberta portal, and can be periodically inquired for context-aware urban data analytics (Leite et al. 2020).

Historical context may not always be effectively used as a proxy to infer prospective context data (Kuijpers et al. 2022), hence the relevance of sourcing available prospective context data. For instance, available weather forecasts by meteorology institutes may not be readily predicted from historical data acquired at meteorology stations as professional forecasts often rely on remote sensing inputs, atmospheric models, and background knowledge that may not be readily available. Complementarily, calendrical information is generally available in advance and not always predictable from historical calendars (e.g., moveable holidays). Similarly, prospective public events (e.g., large-scale concerts, symposiums, summits, sport matches) depend on extensive externalities and thus are hardly predictable from past event data.

Planned events in some context sources can be automatically annotated in accordance with their typology and duration (Cerqueira et al. 2021). The spatial extent, as well as the historical and prospective duration of some of these events (e.g., infrastructural interventions in the city) can be maintained in some of these repositories (Lemonde et al. 2021). In contrast, for public events without such information, rules can be dynamically inferred with expectations on the average event duration in accordance with its typology (Cerqueira et al. 2021). Illustrating, a concert can approximately impact urban mobility 60 min before its start and up to 40 min after its end. Context-specific deviations can be historically assessed in the presence of comparable events against expectations to determine the spatiotemporal footprint of an event in accordance with the principles placed by Cerqueira et al. (2021).

Context data, whether represented by events annotated with spatiotemporal footprints (e.g., gatherings) or by periodically collected/forecasted records (e.g., weather), can then be inputted to a learning system as-is or, in alternative, mapped onto a fused data structure more conducive to the subsequent learning needs. Principles for mapping georeferenced event sets as multivariate time series data structures have been explored in former works (Tiam-Lee et al. 2022; Neves et al. 2020).

Essentials on forecasting. Classical statistical methods, such as regression, estimate a given target, \(y_{T+t}\), from available pairs \((\textbf{x}^{(t)},y_t)\) where \(t\le T\) and \(\textbf{x}^{(t)}\) are features drawn from available data at time step t, i.e., \(\textbf{y}_{1..t}\), \(\textbf{z}_{t+1..T}\) and \(\textbf{z}_{T+1..T+h}\). Although regression methods are a natural context-aware candidate as \(\textbf{x}^{(t)}\) can capture available context at multiple periods, they generally neglect the rich temporal nature of the forecasting problem. Classic time series models are generally parametric descriptors of time series behavior. Series can be decomposed into major components, including trend and seasonality, then projected along the horizon of prediction to produce forecasts (Box et al. 2015). Complementarily, triple exponential smoothing, auto-regression, moving-average and differencing operations can be pursued to model and forecast non-stationary series (Holt 2004; Chatfield 2000). Although classic time series models have been extended to incorporate auxiliary variables, only historical covariates are accounted to assist forecasting (Pfeffermann and Allon 1989; Szeto et al. 2009).

To deal with the complex non-linearities underlying the behavior of real-world systems, artificial neural networks (ANNs) are currently the paradigmatic option to forecast the behavior of such systems. Recurrent neural networks (RNNs) (Rumelhart et al. 1985) are a class of ANNs with feedback (loop) connections, where an output from the previous step is fed as input to the current step. Long short-term memory (LSTM) networks (Hochreiter and Schmidhuber 1997) are specialized RNNs, developed to capture long-term serial dependencies (Fig. 1), thus natural candidates for time series forecasting. LSTMs, as well as gated recurrent units, are inherently prepared to process multivariate time series, being able to elegantly incorporate historical context variables to potentially aid forecasts.

Fig. 1
figure 1

Composition of an LSTM memory cell

3 Related work

Context-aware forecasting. In the neural processing field, contributions for context-aware forecasting have been mainly propelled by the modeling of historical relationships between the target variable and auxiliary context variables. In this context, temporal convolution networks (TCN) (Ruan et al. 2020) and graph neural networks (Rico et al. 2021) are the paradigmatic options to capture spatial and temporal dependencies between the target series and the surrounding context. In social and recommendation domains, auxiliary context variables can encode correlated information from related users, objects, and evaluations. In urban domains, auxiliary context variables often correspond to measurements at different points in the city, and their spatial relationships are captured within a graph structure. Considering urban mobility as a guiding case, auxiliary context variables may correspond to traffic measured at different locations. STJLA (Fang et al. 2021) is a context-aware neural architecture that applies linear attention to the spatiotemporal joint graph to capture correlations between all nodes. A graph convolutional network component based on spatial adjacency and functional similarity with context variables has been recently proposed (Xu et al. 2021) to incorporate transportation supply, demographic profile, and historical weather data in scooter-sharing demand prediction. While graph convolutions capture forms of node dependence from spatial and functional similarity, the extracted associations at different time points are then inputted to gated recurrent units to capture the temporal dependencies that are at the basis of the targeted predictive task.

Leveraging on the inherent ability of recurrent neural networks to handle multivariate time series (i.e., cross-series training), Zhu et al. (2021) tackled product demand forecasting in pharmaceutical domains by seeing the demand of related products as auxiliary context variables and further incorporating available domain knowledge.

Motivated by tensor factorization for context-aware recommender systems, Bi et al. (2022) propose a latent factor approach to sales forecasting that leverages on a single tensor factorization model across multiple products and stores. The results gathered from this recent work evidence the presence of synergistic principles between the targeted context-aware learning tasks and multi-source, multi-view, and multi-task learning (Zhang and Yang 2018; Ruder 2017). Probabilistic multi-view neural processing has been considered to learn intermediate representations from multiple data sources in forecasting tasks (Kamarthi et al. 2021). Classic series segmentation and decomposition principles have been also proved useful to assist context-aware forecasts (Ruan et al. 2020).

Context-aware forecasting has been applied across multiple domains. In the energy domain, Jozi et al. (2022) extended consumption forecasting models with context variables such as energy generation, temperature, and occupancy from building sensors to aid energy management in buildings. Cuncu et al. (2022) further exploited the role of inhabitant activities and the use of household appliances to assist this forecasting task. In computer vision, the forecast of future activities from video data has been augmented using activity and scene context (Chakraborty and Roy-Chowdhury 2014). In procurement and logistics, qualitative expert opinion has been integrated onto product forecasting tasks (Arvan et al. 2019). In trajectory prediction, road users and environments have been used to anticipate obstacles in complex driving scenarios (Schäfer et al. 2022), and human trajectories forecasted in crowded spaces considering the dynamics of other moving agents in the scene and static elements that might be perceived as points of attraction (Bartoli et al. 2018). Domain knowledge has been integrated into deep neural networks to aid traffic prediction in 5 G networks (Garrido et al. 2021). Context-aware embedding modules were explored in recurrent neural networks to include discrete exogenous features when forecasting smart card validations in public transport (Guiguet et al. 2018), as well as in hierarchical networks for forecasting traffic accidents (Huang et al. 2019). Air pollution forecasts have been augmented with context information from both surrounding pollution sources (e.g., bushfire incidents, traffic volumes) and user’s health profile (Schürholz et al. 2020).

Despite the inherent relevance of the aforementioned works to context-aware forecasting, they do not explore the role that sources of prospective context may yield in the predictive tasks. Furthermore, the available contributions are not easily extensible towards this end since prospective context variables cannot be straightforwardly modeled as serial covariates or adjacent nodes in graph structures given their disjoint occurrence from historical variables.

Multi-input network layering. The available measurements of a given system are in multiple input neural networks partitioned in accordance with their inherent properties, where each partition is inputted into one or more components of the network for dedicated processing and later merged for joint processing.

Different architectural principles can be found in this, with parallel, sequential and hybrid processing being common options. Naglah et al. (2021) propose a parallel multi-input convolutional neural network (CNN) to perform fusion of two magnetic resonance imaging modalities (diffusion weighted image and apparent diffusion coefficient map) as to enable independent convolution processes for each modality, which can increase the likelihood of detecting deep texture patterns. Oktay et al. (2016) further explored parallel multi-input CNNs by exploring dedicated processing paths for different viewing planes of three-dimensional cardiac imaging for morphology analysis. Related contributions can be found, including the exploration of implicit spectral-spatial information in hyperspectral images for feature extraction ends (Zhong et al. 2022), multi-input analysis of different medical exams for COVID-19 diagnosis (Zhang et al. 2021), or the exploitation of high degrees of correlation and complementary information among neighboring tomography images for denoising ends (Abbasi et al. 2019). Based on the U-Net model, Shi et al. (2021) proposes a multi-input fusion network based on the extraction and fusion of imaging features at different input resolution scales.

In sequential multi-input networks, some of the available measurements are integrated at later stages in the neural processing pipeline. Sánchez-Cauce et al. (2021) propose a multi-input network for cancer diagnosis where former layers are used to process imaging data for the extraction of features while later layers further received available complementary clinical and demographic data. Similar principles are explored by Apostolopoulos et al. (2021) for cardiovascular disease diagnosis using myocardial perfusion imaging and clinical data.

Hybrid architectural variants are also available. Wang et al. (2021) considered parallel processing of time-domain signals, frequency-domain signals, and time-frequency graph inputs for fault diagnosis. These isolated processing paths are connected using fully connected layers to process the previously processed features with additionally inputted bearing and damage features.

In the context of forecasting tasks, multiple-input networks have been used to combine temporal and static data, as well as to mitigate the issues related to generalization and meteorological effects (Madhiarasan et al. 2021). Xiong et al. (2021) considered ground motion sequences and building features as heterogeneous inputs to a multiple-input convolutional neural network for seismic damage assessment. Despite the relevance of available work, the absence of multi-input neural processing principles to standardly learn from heterogeneous sources of context data is notorious.

4 Multiple-input context-aware neural networks

To aid forecasting tasks in the presence of historical and prospective sources of context, this section proposes a simple yet effective multiple-input neural network architecture (Sect. 4.1), and further establishes masking principles for the effective incorporation of available context (Sects. 4.2 and 4.3).

4.1 Context-aware neural networks

The proposed architecture, schematized in Fig. 2, is a sequential composition of two components, \(C_1\) and \(C_2\), each composed of a default LSTM cell with 16 units followed by a dense layer. Considering available historical data, the \(C_1\) component takes the context-enriched multivariate series as input, \(\textbf{y}_{1..T}\), and returns a forecasted series as the output.

Fig. 2
figure 2

RNN architecture able incorporate historical and prospective external context variables through multiple inputs

Considering available prospective context, the \(C_2\) component takes as input the forecasted series from \(C_1\), \(\hat{y}_{T+1..T+h\mid C_1}\), and prospective sources of context along the horizon of prediction, \(\textbf{z}_{T+1..T+h}\), returning an adjusted forecasted series, \(\hat{y}_{T+1..T+h\mid C_2}\).

Components \(C_1\) and \(C_2\) can be parameterized with alternative network layering able to process multivariate time series, including stacked recurrent units, (deep) temporal convolutional networks, and graph convolutional networks.

Considering the given architecture, three major training possibilities are selected for this study: separate, alternating and joint training. In the joint training setting, a single loss function is considered at the end of the network. In separate and alternating training settings, a loss function is applied at both the end of component \(C_1\) (history-aware forecasting) and the end of component \(C_2\) (prospective context regularization). In the separate training setting, the parameters of component \(C_1\) are first optimized, followed by the optimization of the parameters of component \(C_2\), corresponding to the fully independent adjustment of the forecasted series using prospective context. In the alternating training setting, the optimization alternates between components for every iteration of the learning process under a fixed batch size.

Mean absolute error (MAE), mean squared error (MSE) and cosine are tested as viable loss functions for each setting. Variants of the proposed architecture, including the replacement of LSTM units by Gated Recurrent Units (GRU) are also considered. Adam optimizer with early stopping is selected to learn the target networks. Remaining relevant parameters, including the selected activation functions and applied forms of regularization, are subjected to hyperparameterization (Sect. 5).

4.2 Incorporating historical context

Taking both the target and auxiliary historical data as input, \(C_1\) is able to capture significant cross-variable dependencies, as well as their relationship with future targets, via multivariate memory cells and subsequent dense layering. An arbitrarily-high number of historical context variables can be integrated to guide the learning task. To this end, masking principles are necessary to compose the input multivariate time series, \(\textbf{y}_{1..T}\). For the purpose of illustrating the principles introduced along this section, consider the hourly forecast of the number of medical emergencies along a given region, \(\textbf{y}_{example}\)=(..., 11, 9, 8, ...).

In addition to the target variable, calendrical, event and weather variables are paradigmatic context sources in urban domains that can be integrated to guide the prediction. Calendrical variables inform about meaningful information related to the calendar, including:

  • time within the day to help capture daily seasonal patterns. Taking the illustrative series of hourly medical emergencies, we can enrich it by adding hour information, e.g., \(\textbf{y}_{example}\)=(..., (11, 10pm), (9, 11pm), (8, 12am), ...);

  • weekday information, in which the day when events occurred is incorporated in the series as, for instance, a nominal encoding of Monday to Sunday. Complementarily, weekend information may also be included for a coarser differentiation between weekdays and weekends. By doing so, we can extend the previous series, \(\textbf{y}_{example}\), to further incorporate weekday information, e.g., \(\textbf{y}_{example}\)=(..., (11, 10pmweekday), (9, 11pmweekday), (8, 12amweekdend), ...);

  • other types of calendrical information such as holidays, festivities, and academic calendars can be further included.

Similarly to the introduced calendar masks, event masks mark periods where events of interest occur, e.g., large-scale gatherings in urban domains such as festivals, concerts, and sport events. These sources of context are generally circumscribed to a specific geographical area. The corresponding series generally establish a measure of the impact of the event in space and time. In urban domains, the real-time acquisition of events from public repositories, as well as the dynamic inference of their spatiotemporal footprint, has been previously studied (Cerqueira et al. 2021).

Finally, weather conditions exert influence on diverse systems, including the illustrative urban ones. Rain, fog and snow increase the likelihood of traffic accidents (Yannis and Karlaftis 2010; Andreescu and Frost 1998), while extreme weather can negatively affect health, correlated with the demand for medical emergency services (Kjellstrom et al. 2010; Wong and Lai 2012). Meteorological variables collected from weather stations can thus be further incorporated to guide the learning process. These can include temperature, relative humidity, wind intensity, among others. Analogously, series can be augmented with this information, e.g., precipitation levels registered in the period of each observation, \(\textbf{y}_{\text{example}}\)=(..., (11, 1.6 mm), (9, 0.8 mm), (8, 0.7 mm), ...).

4.3 Incorporating prospective context

The inclusion of \(C_2\), which takes the forecasted series from \(C_1\) and the prospective external context along the forecasting horizon as input, allows the model to guide and adjust the forecasts along the forecasting horizon in the presence of prospective context information, producing the final forecasted series. In this context, \(C_2\) can be thought of as a context-aware denoiser or time-dependent regularizer of the forecasts. Hence, besides being able to learn relations between historical context and future behaviour of the target variable, the model can also learn relations between the target variable and prospective context variables under the same future periods. Although empirical analysis shows optimal performance of \(C_2\) under an LSTM unit, a gated recurrent unit (GRU) provides a competitive and less demanding needs of this stage. On the opposite pole, stacking of recurrent units and convolutional layering can be further considered for complex data domains, where prospective context is characterized by a high multivariate order.

Considering the illustrative case of medical emergency forecasting, severe weather conditions, specific calendrical festivities (e.g., Christmas period), seasonality factors (e.g., day time), and planned events (e.g., large-scale gatherings), are known to be strongly correlated with the demand observed for multiple types of medical emergencies (Silva et al. 2021). Prospective calendrical variables can be obtained through calendar information similarly as in the historical setting; prospective weather variables can be obtained through public databases and web APIs; and planned events can be derived from web content and (semi-)structured public repositories, such as cultural agendas and usage planning of public spaces. Principles for the autonomous retrieval and preprocessing of these sources of external context have been discussed in previous works (Cerqueira et al. 2021).

Once collected, masking principles introduced in Sect. 4.2 can similarly be applied for prospective context, to produce a novel input for \(C_2\) component, so that the preliminary forecasts, \(\hat{y}_{T+1..T+h\mid C_1}\), are further subjected to context-dependent corrections to improve predictive accuracy.

As some context sources are subject to varying rates of historical and prospective missingness, the availability and completeness of each source should be assessed. For instance, when considering planned public events, some event categories are extensively complete in advance (e.g. festivals, sport events) and can be used in accordance with the proposed masking principles, while event categories with higher missing predisposition should be further inquired. Incomplete context data can be divided according to whether missingness is predominant in the future or in both historical and future time periods. In the earlier case, historical context data can standardly be used. For time periods with high missing predisposition, the default masking principles can be followed under the premise that the partial set of gathered events can assist the forecaster. Nevertheless, as periods without recorded events can be mistakenly interpreted by periods without occurring events, sources with high missing rates can be excluded or, in alternative, dedicated masking symbols included to signal the presence of periods with incomplete information.

5 Results and discussion

Using case studies in the urban domain, this section experimentally assesses the proposed contribution, answering three major research questions:

  1. Q1

    To which extent does the incorporation of prospective context aid forecasts?

  2. Q2

    What are the benefits arising from each accessible source of context in the targeted urban domain?

  3. Q3

    How do context-aware multiple-input networks compare in terms of predictive accuracy against state-of-the-art alternatives?

The proposed context-aware forecasters are implemented in Python and made available via GitHub at:

https://github.com/joaopalet/multiple-input-context-aware-forecaster.

5.1 Validation methodology

For a robust evaluation of the forecasters, time-aware cross-validation is performed with the necessary care intrinsic to time series data partitioning (Fig. 3). Each dataset produced per iteration is further divided into train, validation and test sets. The training and validation sets comprise 80% of each dataset, from which 80% is training data and 20% is validation data. The remaining 20% of the datasets was used for testing. The partitions are preserved for the training, validation and testing of each forecaster.

For the learning of the proposed models, each partition is further segmented into a set of data instances, each in the form of an input–output pair. For the given urban scenarios, the input series is a full week of data (168 hour periods), and the output is the subsequent series with the length of the forecasting horizon, i.e., 24 h. Figure 4 illustrates the creation of these instances, ensuring that no testing instances precede training instances.

Table 1 lists the sources of context considered along the target case studies.

Fig. 3
figure 3

Creation of datasets for cross-validation of the time-series data

Fig. 4
figure 4

Input–output pair creation for the learning of the forecasting models

Table 1 Context data sources for the target urban case studies

5.2 Baseline models

The performance of the proposed multi-input neural network is assessed against an equivalent model that only incorporates historical context and a univariate model, both depicted in Figs. 5 and  6, respectively. The proposed models are further compared against Holt Winters’ exponential smoothing and a flexibly optimized feed forward neural network (FFNN). In addition, spatiotemporal graph convolutional networksFootnote 1 are selected as reference state-of-the-art forecasting baselines (Yu et al. 2017) given their inherent ability to account for complex interactions between the target and context variables expressed within a graph structure via graph convolutions, as well as to capture temporal dependencies through the incorporation of LSTM layers to perform forecasting on the graph. In the context of our work, the graph captures the pairwise cross-correlation between input variable series and is then postprocessed to either consider all associations (dense graph) or discard uncorrelated variables (sparse graph) when performing graph convolutions. A minimum graph density of 1/3 is fixed to avoid overly sparse graph representations so that dependencies with the available context data sources can be more comprehensively explored. Deep temporal convolutional networks (deep TCNs)Footnote 2 (Chen et al. 2020; Ruan et al. 2020) are further assessed in the presence and absence of historical and prospective sources of context. As TCN networks are not inherently prepared to handle prospective context, two settings are considered to this end: i) the proposed multi-input architecture with its components parameterized with the reference TCNs (Chen et al. 2020), and ii) single deep TCN with prospective context data given as complementary input series, convex with historical series whenever applicable.

Fig. 5
figure 5

Univariate RNN architecture with parameterizable \(C_1\) component

Fig. 6
figure 6

Multivariate RNN architecture, able to incorporate historical context variables, with parameterizable \(C_1\) component

Table 2 Baseline forecasting models

The assessed neural network models were parameterized with the mean squared error (MSE) loss and Adam optimizer (Kingma and Ba 2014). Sensitivity analysis was performed through manual search with different initial learning rates (\(1\textrm{e}{-3}\) to \(1\textrm{e}{-6}\)), values for the batch size (1, 2, 4, 8, and 16), and types of regularization (l1 and l2). All models were trained for a total of 400 epochs, using early stop criterion based on training versus validation loss. Table 2 describes all baseline models in greater detail.

When considering the proposed learning settings for the target sequential multi-input network – joint, separate and alternating training –, we observed that the joint training setting (single loss function) consistently yields the best results, an observation that is hypothesized to be driven by the inherent simplicity of the proposed architecture, together with the fact that the joint training offers greater sensitivity to the dependencies between the two components. Hence, unless stated otherwise, the displayed results for the target multi-input sequential network are in reference to the joint learning setting.

5.3 Case study 1: medical emergencies in Lisbon

The analysis of medical emergencies in Lisbon is introduced as the primary case study to validate the contributions. The pre-hospital medical emergencies in the Lisbon city were provided by Instituto Nacional de Emergência Médica (INEM) – the medical emergency service (EMS) provider in mainland Portugal. The data consists of whole registered emergency cases in Lisbon from 2016 to 2017, comprising a total of 180,234 medical emergencies. Following the principles illustrated in Fig. 3, nine datasets comprising a full year of data were created, with a step size of six weeks between them.

Calendrical context data were gathered from Lisboa Aberta,Footnote 3 while Lisbon’s weather records acquired from meteo\(\vert \)Técnico.Footnote 4 The collected variables include temperature, recorded in degrees Celsius (\(^{circ}\hbox {C}\)), precipitation, recorded in mm, and relative humidity, i.e., the concentration of water vapour present in the air (%). In addition to weather variables, we further compared the performance of models incorporating information regarding the hour of the day (hour), day of the week (weekday), weekend status (weekend), as well as a model that incorporates all three of these masks (all calendar).

Figure 7 shows the results obtained when incorporating historical context (historical architecture) with the ones obtained when considering both historical and prospective context (historical+prospective architecture), as well as no context (univariate architecture). Compared to the univariate architecture, all models produce significantly better results, with the exception of the models that only incorporate historical weekday and weekend information, producing comparable results. These results suggest that the hour of the day is the calendrical variable that most affects the volume of EMS demand. Incorporating prospective calendrical context along the forecasting horizon is shown to further improve the performance in comparison to the models that only incorporate context in the historical data. Incorporating all three calendrical variables on both historical and prospective data yield the best results.

Fig. 7
figure 7

Impact of incorporating calendrical context in the EMS demand forecasting models

Fig. 8
figure 8

Impact of incorporating weather context in the EMS demand forecasting models

Figure 8 presents the results obtained when incorporating weather variables in the models, while Fig. 9 presents the results obtained when incorporating both weather and calendrical variables. Temperature, precipitation, relative humidity (humidity), and all previous weather variables (all weather) are considered. We observe that, when incorporated alone, historical weather variables do not produce a significant impact on the results. However, improvements are achieved when the models consider prospective weather variables. When added on top of all calendrical variables, temperature and precipitation variables further moderately improve the performance of the models. Once again, results show that models also incorporate prospective context perform considerably better across the reference ones.

Fig. 9
figure 9

Impact of incorporating calendrical context together with weather context in the EMS demand forecasting models

Fig. 10
figure 10

Comparison of predictive errors (MAE) per time step along the 24-hforecasting horizon when comparing the univariate architecture, the architecture that only incorporates historical context, and the one that also incorporates prospective context

Table 3 Summary of results obtained (MAE and RMSE), for all EMS demand forecasters implemented, as mean ± standard deviation ()

Considering that we are working with a forecasting horizon of 24 hours, one of the main reasons to incorporate prospective context is to try to mitigate the increase in uncertainty as we move forward in the predictions along the forecast horizon. Taking the best performing model of each presented architecture, we analyzed how predictive errors evolve with the forecasting horizon. According to the collected results, the best forecasters sensitive to historical context are the ones incorporating hour, weekday, weekend, and precipitation context, while the best forecasters sensitive to both historical and prospective context incorporate hour, weekday, weekend, and temperature context. The assessment of predictor errors along the forecasting horizon is shown in Fig. 10. We can see that, in the first few horizon hours, the errors of the univariate architecture are already much higher, since the model has no sense of the situational context, while both context-aware models are able to keep the errors at a fairly constant rate. As we move forward in the predictions and the current context may start to differ from the historical context, we can observe that starting around the 8th hour period of the forecasting horizon, the errors of the model that only considers historical context start to increase while the model incorporating prospective context are still kept fairly constant.

Table 3 presents the forecasting errors (MAE and RMSE) of the proposed models and all suggested baseline methods for each combination of context variables. Results suggest that incorporating calendrical and weather information, particularly both historical and prospective, improve the performance of the forecasting models.

Graph convolutional networks show for the target case study an inherent difficulty to exploit relevant associations with historical context given the moderately lower performance against the context-unaware version. In fact, denser graph representations, where a high number of pairwise associations between input variables are considered to guide the learning, are shown to deteriorate performance on the given emergency domain. As the original architecture was proposed for the spatiotemporal forecasting of traffic data, where graph associations capture traffic-wise correlated roads, its adaptation to domains where the input variables are less correlated is shown to be less stable, potentially requiring further architectural tuning or the replacement of graph convolutions by alternative transformations. Complementarily, feed-forward neural networks show higher error variability along the prediction horizon (within-instance variability) than the simplistic recurrent layering. Although unable to adequately explore the temporal dependencies of the available context data under the proposed masking principles, their exhaustive optimization and limited series length ensured a more competitive performance regarding MAE.

Considering the proposed architecture, replacing the LSTM unit in the \(C_1\) component by a deep temporal convolutional network (TCN) produced soft improvements for the combined incorporation of historical and prospective context (statistically significant for MAE, non-significant for RMSE under \(\alpha \)=1E-3). Similarly, the stacking of an additional LSTM in \(C_1\) produced statistically significant improvements against the single context-aware LSTM counterpart (p-value<1E-3). In this context, the exploration of stacking or alternative layering within both components of the proposed architecture is suggested as future direction.

To assess the statistical significance of the previous results, we used the T-test when error estimate samples pass the Shappiro-Wilko normality test (\(\alpha \)=0.01), otherwise, the Mann–Whitney U test, the non-parametric peer. Table 4 summarizes the p-values of the statistical tests performed, for each combination of context variables tested. The obtained p-values show that, for 21 out of the 24 pairs tested, the p-values were lower or equal than the threshold (0.05), indicating that the difference between the great majority of the results obtained are statistically significant.

Table 4 P-values between the EMS demand forecasting errors of the models incorporating both historical and prospective context against two baseline methods: the univariate architecture and the model only historical context, for the the same context variables

Complementarily, Fig. 11 measures the impact of replacing the LSTM units by Gated Recurrent Units (GRU) in the compared models, showing that the adequacy of GRU-based network design become only competitive in the multi-input stacked model due to the inherent simplicity of this unit.

Fig. 11
figure 11

Comparing the predictive efficacy of GRU versus LSTM units in the introduced context-aware neural architectures

Finally, to measure the upper limits of introducing prospective weather data, Fig. 12 assesses the differences between the gains from incorporating actual weather observations in the future (no noise) versus the observed gains from incorporating professional weather forecasts (inherently susceptible to noise). Most of the observed differences are small to moderate, showing the relevance of weather forecasts as proxies for future weather, while bounded in their potential predictive value for the targeted forecasting problem.

Fig. 12
figure 12

Potential of prospective weather data for multi-input networks: ground truth prospective weather versus professional weather forecasts

5.4 Case Study 2: lisbon’s bike sharing system

As a complementary case study, the demand for Lisbon’s public bike sharing system (BSS), called GIRA, is introduced. Descriptive statistics of capacity-demand dynamics of GIRA network are publicly available.Footnote 5 Two major data sources are considered. First, a curated sample with all bike trip records from December 2018 until February 2019, along with timestamps for every change in station state and corresponding load value. Second, publicly available station state recordsFootnote 6 with the load state of each station at specific time points of a day from 2020 to 2022. Accordingly, we target two major forecasting tasks: the prediction of the hourly number of check-in events for the GIRA network along a single day using the first data source, and the hourly station-specific load along a single day using the second data source. Following the principles illustrated in Fig. 3, six (twelve) datasets were derived for the first (second) data source, each comprising six weeks (18 months) of data with a step size of one week (one month), were created.

Fig. 13
figure 13

Impact of incorporating calendrical context together with weather context in the BSS demand forecasting models

Considering the forecasting task of hourly check-in events, calendrical information and relative humidity were selected as the sources of context. During the accessible three month period, temperature showed stable daily bounds and oscillation, and precipitation was only observed on four days, hence excluded as understandably insufficient to learn relationships with bike demand. Figure 13 provides a view on the forecasting error (MAE) in the absence and presence of historical and prospective context. In the majority of scenarios, particularly under calendrical variables, we observe that the incorporation of prospective context benefits the results, although not as significantly as in the previous medical emergency domain. Statistical significance improvements against context-unaware forecasters (\(\alpha \)=1E-3) are only observed when introducing sources of prospective context. The low volume of available data, limiting the generalization ability of the underlying forecasters, is hypothesized as the major cause for the moderate improvements.

Fig. 14
figure 14

Comparison of predictive errors (MAE) along the 24-hour forecasting horizon when incorporating historical and prospective context

Fig. 15
figure 15

Statistical improvements from historical and prospective context incorporation for the load state forecasting of stations with distinct profiles using multi-input networks parameterized with default deep TCNs

Complementarily, Fig. 14 depicts the distribution of the errors along the prediction horizon, highlighting that the gains from incorporating prospective context data are generally statistically significant between 7AM and 8PM. The higher impact of calendrical information on bike demand throughout the daylight period, along with the greater weather influence on transport modal choices due to the adequate offer of public transport alternatives during this period, are arguably linked to this observation.

Moving from check-in forecasts to the prediction of station-specific hourly load states, we observe that the incorporation of prospective context provides statistically significant improvements for over 2/3 of the approximately 100 stations active during the target period (2020-2022). Figure 15a motivates the intrinsic difficulty of this task by disclosing the irregularity of the hourly load state for different stations along two consecutive weeks. For illustrative purposes, 4 stations with distinct profiles are selected: 2 stations at business centers (307 at Marques de Pombal and 406 at Praça Saldanha); 1 station at a central residential area (443 at Av. Roma); and 1 station at the city periphery (115 at Passeio dos Heróis do Mar). Figure 15b assesses the role of incorporating historical and prospective context sources using the proposed multi-input network (parameterized with deep TCNs) for these stations, showing generalized statistically significant improvements.

6 Conclusions

This work proposed neural processing principles to leverage the performance of prediction tasks in the presence of heterogeneous sources of historical and prospective context data. To this end, we introduce a multiple-input recurrent neural architecture that is a serial composition of two LSTM-based components, where the former component is placed to capture historical cross-variable relationships, while the second component uses prospective context for the time-dependent correction of forecasts returned by the former component. Masking principles to derive auxiliary endogeneous and exogenous series.

Demand analysis of medical emergencies and public bike sharing in the city of Lisbon was considered as case studies to validate the proposed methodology. Results showed that network models incorporating both historical and prospective context provide significantly more accurate forecasts than the counterparts that do not incorporate context or only consider historical context to guide the forecasts. Statistically significant improvements were further confirmed against state-of-the-art forecasters. The role of calendrical context variables, although often disregarded, are shown to be critical in the targeted urban domains. Finally, the proposed models are able to maintain the errors low as we move forward in the forecasting horizon.

The proposed neural processing units are simplistic, yet effective, providing a basis to capture relationships with available sources of historical and prospective context data, potentially embed within more complex neural-based predictive models. The introduced principles also form a sound and state-of-the-art reference for the assessment of upcoming contributions in the context-aware forecasting field.

The development of superior context-aware forecasters is highlighted as a future line of work. Similarly, we expect to assess the role of complementary masking principles. Finally, as some sources of context data are inherently sparse and their incorporation increases the multivariate order of the available, we further aim at more closely exploring how much context data are necessary to escape generalization difficulties.