Keywords

1 Introduction

The problem of time series forecasting, in its simplest form, deals with the prediction of a given quantity of interest in the future, given its historical values. Moreover, one could be interested in forecasting the immediate next value in the future (one-step-ahead forecasting) as well as being concerned with the estimation of a sequence of future values (multi-step-ahead forecasting). In a similar fashion, the problem might involve a single quantity (univariate forecasting), or several quantities at once (multivariate forecasting), in order to exploit potential interrelationships among them. In the context of finance, specific quantities of interest are: the stock price of a given company over time, its returns or the intensity of the fluctuations affecting the price (i.e. its volatility), among others. Specifically, in the case of stock markets, the underlying trend of the market influences all the stocks that are currently traded. As shown in [18], stock prices of firms acting on the same market often show similar patterns in the sequel of news that are important for the entire market. Moreover, analyzing global volatility transmission, Engle et al. [12] found evidence supporting volatility interdependence among the world’s major trading areas. For these reasons, while modeling these time dependent quantities of interest, a multivariate model appears to be a natural choice to incorporate interdependencies into the forecasting process.

Among all the quantities of interest, in the following, we will focus on the problem of multivariate volatility forecasting. In this specific case, the quantity of interest is a latent variable, which cannot be directly observed given the time series, but only estimated, according to the granularity and the type of the available data, through different measures, named volatility proxies [27]. According to the choice of the proxy, several approaches have been proposed to tackle this multivariate problem. The largest body of the volatility forecasting literature focus on multivariate extensions of the well known GARCH [2] on traditional stock market data, for instance, citing some recently published work: [13] and [3]. For a thorough review of the different univariate and multivariate methods, we refer the interested reader to the latter. Due to the steadily growth of the cryptocurrencies market capitalization [11], coupled with the currencies’ volatility, GARCH-like models [7, 32] have also been applied for non-traditional markets. The main problem of these approaches is that traditional multivariate models often suffers from the “curse of dimensionality”: as the number of dimensions increase, the number of parameters grows superlinearly in the number of dimensions, making the model estimation heavily computationally intensive, especially in the case of multiple step ahead forecasts.

In order to profit from the richness of a multivariate model, while maintaining a reasonable computational complexity, we propose to employ the DFML [4], a multivariate, multistep-ahead machine learning forecasting framework involving a dimensionality compression process, based on the dynamic factor model (DFM) principle [14]. The choice of this generic time series forecasting framework requires the usage of model-independent volatility proxies which will be discussed in Sect. 3, requiring us to dismiss GARCH as a proxy of volatility, due to his tight coupling between the proxy and the corresponding forecasting model to use, as discussed in [8].

At the time of writing, we had been able to find either multivariate techniques dealing with the forecasting of either cryptocurrencies prices [1, 6] or univariate techniques dealing with the forecasting of volatility either with a one-step ahead [7, 32] or multistep-ahead [10]. Nevertheless, we are not aware of any other work tackling both the problems of multivariate and multi-step ahead cryptocurrencies volatility forecasting, specifically in the case of large dimensionality and a reduced number of datapoints. Our technique will then be tested on two different benchmarks: one concerning cryptocurrencies and a second one, concerning a traditional regulated stock market (CAC40) being a de facto multivariate extension of [25].

The rest of the paper is structured as follows: Sect. 2 provides an oveview of the Dynamic Factor Machine Learner approach. Section 3 introduces the different tested multivariate models as well as the considered datasets and the formulation of the relevant forecast quantities. Section 4 concludes the paper with a discussion of the results and the future research directions.

2 Dynamic Factor Machine Learner

A Dynamic Factor Model (DFM) is a technique for multivariate forecasting originating in the economic forecasting community [14]. The basic idea of DFM is that a small number of unobserved series (or factors) can account for the temporal behavior of a much larger number of variables. If we are able to obtain accurate estimates of these factors, the forecasting endeavor can be made simpler by using the estimated dynamic factors for forecasting instead of using all series. In equations:

$$\begin{aligned} \mathbf {Y}_{t+1}&= \mathbf {W} \mathbf {Z}_{t+1} + \epsilon _{t+1} \end{aligned}$$
(1)
$$\begin{aligned} \mathbf {Z}_{t+1}&= \mathbf {A}_{t} \mathbf {Z}_{t}+\dots + \mathbf {A}_{t-m+1} \mathbf {Z}_{t-m+1} +\eta _{t+1} \end{aligned}$$
(2)

where \(\mathbf {Y}_t\) is a multivariate time series vector at time t, \(\mathbf {Z}_t\) is the vector of unobserved factors of size q (\(q\,\ll \,n\)), \(\mathbf {A}_i\) are \(q \times q\) coefficient matrices, \(\mathbf {W}\) is the matrix \((n \times q)\) of dynamic factor loadings and the vectors of disturbances terms \(\eta \) are assumed to be uncorrelated. As shown in Eq. 2, in the original DFM, the latent factors follow a VAR time series process. For a detailed review of DFM models, the interested reader could refer to [28].

Here, we propose to employ a machine learning extension of the DFM (called DFML - Dynamic Factor Machine Learner). The DFML, first proposed by Bontempi et al. [4] and further discussed in [9], relies on dimensionality reduction techniques to extract the factors. Then, the factors are forecast using a nonlinear model. Finally, the forecasts of the factors are transformed back to the original values by inverting the dimensionality reduction process. The basic architecture of the DFML is depicted in Fig. 1, along with the description of the different variants. Concerning dimensionality reduction, both linear (PCA) and nonlinear (autoencoder) techniques are employed in the DFML. Linear dimensionality reduction by PCA transforms the n original time series \(\mathbf {Y}[1]\), \(\dots \), \(\mathbf {Y}[n]\) into q new variables \(\mathbf {Z}[1]\), \(\dots \), \(\mathbf {Z}[q]\) (called principal components or factors) such that the new variables are uncorrelated with each other and account for decreasing portions of the variance of the original variables. The q principal components are then expressed as weighted sums of the elements of \(\mathbf {Y}\) with maximal variance, where the weights are normalized and constrained to ensure orthogonality:

$$\begin{aligned} \mathbf {Z}[p]=\sum _{j=1}^n w_{jp} \mathbf {Y}[j], \qquad p=1,\dots ,q \end{aligned}$$
(3)

Given the multivariate time series matrix \(\mathbf {Y}\), \(\mathbf {Z}= \mathbf {Y}\mathbf {W}\) represents the projection of the series onto the pth principal components and \(\hat{\mathbf {Y}}= \mathbf {Z}\mathbf {W}^T\) represent the reconstruction \(\hat{\mathbf {Y}}\) of the values of \(\mathbf {Y}\), based on the factors \(\mathbf {Z}\). On the other hand, nonlinear dimensionality reduction is performed through the use of autoencoders. Autoencoders are neural networks trained to learn identity mapping from inputs to outputs [31], through a constrained architecture to enforce dimensionality reduction. As such their input and output layer have the same number of neurons n as the number of input time series but their hidden layers contain a reduced number of neurons q. Autoencoders are composed of two stacked multi-layer networks: an encoder:

$$\begin{aligned} \mathbf {Z}_t= & {} f_\theta (\mathbf {Y}_t) \end{aligned}$$
(4)

that transforms inputs \(\mathbf {Y}_t\) into some latent (encoded) representation \(\mathbf {Z}_t\), and a decoder:

$$\begin{aligned} \hat{\mathbf {Y}}_t= & {} g_{\theta '}(\mathbf {Z}_t) \end{aligned}$$
(5)

that reconstructs an approximation \(\hat{\mathbf {Y}}_t\) of the input \(\mathbf {Y}_t\) on the basis of the latent feature \(\mathbf {Z}_t\) and where the mappings \(f_\theta \) and \(g_{\theta '}\) are non-linear. The network is usually trained using gradient descent techniques such as backpropagation, with the objective of minimizing the mean-squared error between the input \(\mathbf {Y}_t\) and the output (its reconstruction \(\hat{\mathbf {Y}}_t\)) [31]. Concerning the forecasting part, the original DFML paper [4] proposes to forecast each factor independently (given their orthogonality) using a nonlinear model (lazy learning [5]) and a univariate multi-step-ahead forecasting strategy. In addition to the basic forecaster, the paper also proposes an optimized version (DFML\('\)), performing a joint selection of the hyperparameters (number of factors for the dimensionality reduction, predictor and multi-step-ahead strategy for the forecaster) using out-of-sample assessment. Although we consider lazy methods for the forecaster, the modular architecture of this framework easily allows the replacement of the aforementioned technique with alternative supervised machine learning approaches (e.g. SVM, RNN).

Fig. 1.
figure 1

Schema of the DFML architecture with a summary of the different components as implemented in the different proposed methods.

3 Methodology

3.1 Multivariate Forecasting Methods

Multiple Univariate Techniques - {Naive, UNI}: In the case of a multivariate time series \(\mathbf {Y}\), univariate approaches are still of interest since the multivariate forecasting task can be decomposed in a number of single-output multi-input tasks (or equivalently in a set of NARX tasks with exogenous variables)

$$\begin{aligned} {\left\{ \begin{array}{ll} Y_{t+1}[1]&{}=f_1 (Y_{t}[1],\dots ,Y_{t-m+1}[1], \dots , \\ &{} Y_{t}[n],\dots ,Y_{t-m+1}[n] )+w_t[1]\\ \vdots \\ Y_{t+1}[n]&{}=f_n (Y_{t}[1],\dots ,Y_{t-m+1}[1], \dots , \\ &{} Y_{t}[n],\dots ,Y_{t-m+1}[n] )+w_t[n] \end{array}\right. } \end{aligned}$$
(6)

In this case the training set is used to learn the n mapping functions \(f_i\), \(i=1,\dots ,n\), with \(w_t[i]\) being uncorrelated disturbances. For large n the problem of large input dimensionality can be addressed by adopting a feature selection technique, selecting a reduced number q of most correlated features For these univariate techniques, we will also consider a naive method in which \(\forall i \in \{1,\dots ,n\}, f_i(t)=Y_{t-1}[i]\), i.e. for every series, the forecast for the following H steps is given by the last available value. These are the baseline methods against which we will compare the performances of our forecaster.

Partial Least Squares - PLS: Partial Least Squares [15] allows the joint forecasting of the H steps ahead of the multivariate time series on the basis of the lagged vectors \(\mathbf {Y}_{t},\dots ,\mathbf {Y}_{t-m}\). This is a multi-input multi-output regression task where the number of inputs amounts to nm and the number of outputs to Hn respectively, with n being the number of variables, m the embedding order of the model and H being the forecasting horizon. The benefit of PLS is that it allows at the same time a dimensionality reduction of the inputs and a joint prediction of the outputs, taking then into consideration the dependency between the future steps. An example of application of PLS in financial time series forecasting can be found in [22].

Recurrent Neural Networks - {RNN, LSTM}: Recurrent Neural networks (RNN) form a class of predictive models based on neural networks, in which recurrent connections to the network inputs allow to model dynamic temporal dependencies. In their simple form (also known as simple RNN) [17, 23], the recurrent connections come from a hidden state \(H_t\), which is also used for predicting future values \(Y_t\):

$$\begin{aligned} \mathbf {H}_t= & {} \sigma (\mathbf {W}_{HY}\mathbf {Y}_{t-1}+\mathbf {W}_{HH}\mathbf {H}_{t-1}+\mathbf {B}_H), \end{aligned}$$
(7)
$$\begin{aligned} \mathbf {Y}_t= & {} \mathbf {W}_{YH}\mathbf {H}_t+\mathbf {B}_Y \end{aligned}$$
(8)

The matrices \(\mathbf {W}_{HY}\), \(\mathbf {W}_{HH}\), \(\mathbf {W}_{YH}\), \(\mathbf {B}_H\) and \(\mathbf {B}_Y\) are the parameters (weights and biases) of the network, typically learned by gradient descent algorithms such as backpropagation through time [17]. A sigmoid activation function \(\sigma \) allows the modeling of nonlinear dependencies, while the recurrent connections allow the modeling of long-term temporal dependencies. Research on RNNs has recently been boosted by the advent of General Programming Graphic Processing Units (GPGPU), and improved design of the memory cell (Long-Short Term Memory cells [20]). These have allowed much more efficient RNN implementations, and effective training over multiple layers (deep RNNs). RNNs architectures have reached state-of-the-art performances for volatility either as part of an LSTM-GARCH hybrid model [21, 33] or as standalone model [26].

3.2 Datasets Description

CAC40: The available data consists of 1645 data points of the 40 time series composing the french stock market index CAC40 from 02/01/2012 to 08/06/2018 (approximately 6 years and 5 months) in OHLC (Opening, High, Low, Closing) format.

Cryptocurrencies: The available data comes from the Kaggle dataset “Every Cryptocurrency Daily Market Price”Footnote 1 constituted of 785,024 observation of 1644 different cryptotokens from 28/04/2013 to 06/06/2018. However the number of available datapoints per cryptotoken is inversely proportional to the lifespan of the token itself. In other words, the further we go into the past, the fewer values we have for our analysis, as depicted in Fig. 2. For these reasons, we restricted our analysis to the period from 28/01/2017 to 06/06/2018 for which we have complete OHLC data for 291 tokens.

Fig. 2.
figure 2

Number of available datapoints for the cryptocurrencies dataset as a function of time

3.3 Volatility Proxies

The OHLC available data is composed of several quantities of interest, each of them on a daily time scale: \(P^{(o)}_t,P^{(c)}_t,P^{(h)}_t,P^{(l)}_t\), respectively the stock prices at the opening, closing of the trading day and the maximum and minimum value for each trading day. In the absence of detailed information concerning the price movements within a given trading days, stock volatility becomes directly unobservable [30]. To cope with such problem, several different measures (also called proxies) have been proposed in the econometrics literature [16, 19, 24, 27] to capture this information. However, there is no consensus in the scientific literature upon what volatility proxy should be employed for a given purpose. Nevertheless, for an empirical analysis of the use of volatility proxies in the case of univariate forecasting, the interested reader could find more details in [8].

Volatility as Variance. The first family of proxies corresponds to the natural definition of volatility [27], that is, a rolling standard deviation of a given stock’s continuously compounded returns over a past time window of size n:

$$\begin{aligned} \sigma ^{SD,w}_t = \sqrt{\frac{1}{w-1} \sum _{i=0}^{w-1} (r_{t-i} - \bar{r}_w)^2} \end{aligned}$$
(9)

where

$$\begin{aligned} r_t = \ln \left( \frac{P^{(c)}_t}{P^{(c)}_{t-1}} \right) \end{aligned}$$
(10)

represents the daily continuously compounded return for day t computed from the closing prices \(P^{(c)}_t\) and \(\bar{r}_w\) represents the returns’ average over the period \(\{t,\cdots ,t-w\}\). In this formulation, w represents the degree of smoothing that is applied to the original time series. We will consider here \(w \in \{5,10,21\}\), representing respectively one week, two weeks and one month of trading.

Volatility as a Proxy of the Coarse Grained Intraday Information. The second family of proxies that we will consider is the \(\sigma ^{i}_t\) one, analytically derived by [16] by incorporating supplementary information (i.e. opening, maximum and minimum price for a given trading day) and trying to optimize the quality of the estimation. Among all the defined proxies, we will focus on:

$$\begin{aligned} \sigma ^{0}_t = \left[ \ln \left( \frac{P^{(c)}_{t+1}}{P^{(c)}_{t}} \right) \right] ^2 = r_t^2 \end{aligned}$$
(11)
$$\begin{aligned} u = \ln \left( \frac{P^{(h)}_t}{P^{(o)}_t} \right)&d = \ln \left( \frac{P^{(l)}_t}{P^{(o)}_t} \right)&c = \ln \left( \frac{P^{(c)}_t}{P^{(o)}_t} \right) \end{aligned}$$
(12)

where u is the normalized high price, d is the normalized low price and c is the normalized closing price.

$$\begin{aligned} \sigma ^{4}_t = 0.511 (u-d)^2 - 0.019[c(u+d) - 2ud] - 0.383c^2 \end{aligned}$$
(13)
$$\begin{aligned} \sigma ^{6}_t = \underbrace{\frac{a}{f} \cdot \log \left( \frac{P^{(o)}_{t+1}}{P^{(c)}_{t}} \right) ^2}_{\text {Nightly volatility}} + \underbrace{\frac{1-a}{1-f} \cdot \hat{\sigma }_4 (t)}_{\text {Intraday volatility}} \end{aligned}$$
(14)

The value of \(f \in [0,1]\) represents the fraction of the trading day in which the market is closed. It is by definition bounded in the interval [0, 1], In the case of CAC40, we have that \(f > 1-f\), since trading is only performed of roughly one third of the day. Here, a is a weighting parameter, whose optimal value, according to [16] is shown to be 0.17, regardless of the value of f.

After a preprocessing phase of the datasets, involving removal of missing values and proxy calculation for each time series, the data is restructured in a multivariate time series matrix form \(\mathbf {Y}\) having N (number of observations) rows and n (number of variables/time series) columns. For each proxy, this matrix is such that each row \(\mathbf {Y}_t\) represent a n-dimensional vector containing the value of the given proxy for of the n variables at the time t, and the scalar value \(Y_t[j]\) represent the value of jth (\(j=1,\dots ,n\)) variable at time t.

4 Experimental Results

The experimental study assessed and compared the methods previously discussed in the article. The methods are listed below together with the software used for the experiments. Note that, for the sake of assessment, we set the lag \(m=2\) and the maximum number of latent factors to \(q=3\) for all methods, unless specified otherwise.

  1. 1.

    NAIVE: univariate baseline method using the last observed value for each time series as prediction for the following H steps.

  2. 2.

    UNI: univariate multi-step-ahead Direct forecasting of each individual series (Eq. 6) with a feature selection process based on correlation.

  3. 3.

    PLS: partial-least-squares forecasting (Sect. 3.1) implemented by the function mvr of the R package pls. The optimal values for the size of the input space and the number of principal components q is determined through an out-of-sample criterion.

  4. 4.

    RNN: recurrent neural network implemented by the keras_predict function of kerasRFootnote 2, the R keras interface to the keras Deep Learning libraryFootnote 3 for Theano. The network is a fully-connected RNN with 10 hidden units. Since an automated setting of the number of units would not have been feasible due to an excessive computational time, this number has been set on the basis of trial and error over a small number of synthetic series.

  5. 5.

    LSTM: As RNN, the model is a fully connected RNN, with 10 hidden units implemented using kerasR. It differs from RNN as it employs LSTM cells [20] in the hidden layer, instead of regular neurons.

  6. 6.

    DFM: linear Dynamic Factor Model where PCA is used for factor estimation, the number of factors is set to q and the forecasting of the factors is carried out with a VAR method implemented by the estBlackBox function of the R package dse. The batch PCA is computed using the base R eigen function.

  7. 7.

    DFML\(_{PCA}\): Dynamic Factor Machine Learner where PCA is used for factor estimation, the number of factors is set to q and the forecasting of each factors is carried out in a univariate manner using a local learning predictor (lazy learning [5]) and a multi-step-ahead Direct strategy.

  8. 8.

    DFML\(_{A}\): it differs from DFML\(_{PCA}\) because of the use of an autoencoder instead of PCA in the process of factor estimation.

  9. 9.

    DFML’\(_{PCA}\): it differs from DFML\(_{PCA}\) because of the automatic selection strategy (described in [4]): the number of factors (in the range [1, q]) and the multi-step-ahead strategy (among Direct, Iterated and MIMO) and the lag m are selected by an out-of-sample strategy carried out on the training set.

4.1 Results Discussion

For each multivariate dataset we performed time series cross-validation following a rolling origin strategy [29]. The size of the training set is 2N/3 and a sequence of 50 different test sets of length H is considered.

For each test set, all methods are assessed in terms of the average Normalized Mean Squared Error:

$$ {\text {NMSE}}=\frac{\sum _{j=1}^n \text{ NMSE }[j] }{n}$$

where

$$ {\text {NMSE}}[j]=\frac{ \sum _{h=1}^H (Y_{T+h}[j]-\hat{Y}_{T+h}[j])^2}{ V[j] H} $$

V[j] is the variance of the series Y[j] and \(T+1\) is the starting index of the continuation set.

While dealing with high dimensionality (\(n=291\)) coupled with a relatively low number of observations (\(N=495\)), as in the case of the Cryptocurrency dataset (Table 1), using the \(\sigma ^i_t\) family of proxies, the DFML, even without hyperparameter optimisation, clearly outperforms all the concurrent methods. It should also be noted that some methods tested in the original DFML paper [4] (i.e. VAR, DSE, SSA) could not be tested due to numerical problems related to the limited number of available observations. The performances of DFML are mitigated while using proxies coming from the \(\sigma ^{SD,w}_t\) family, where the performance of the Naive method improves, even for forecasting horizons up to 20 steps ahead, as the smoothing provided by the window size parameter w increases. In both the cases, a linear dimensionality reduction technique with no optimization (DFM, DFML\(_{PCA}\)) is shown to improve the performances of the forecaster, compared to nonlinear (DFML\(_{A}\)) and optimized (DFML’\(_{PCA}\)) ones.

A similar ranking among the methods is observed in the case of the CAC40 dataset (Table 2), characterized by a lower dimensionality (\(n=40\)) but an higher number of points (\(N=1641\)). Here we can observe a generally higher average normalized NMSE, indicating a higher complexity of the forecasting problem. For the \(\sigma ^i_t\) family, PLS and DFM appears as competitive alternatives of the DFML, especially for longer horizons (\(h>15\)). As in the previous case, for the \(\sigma ^{SD,w}_t\) family of proxies, the performances of the DFML family are affected by the value of the smoothing factor w, where, the higher the smoothing factor is, the less effective the DFML becomes for shorter horizons, with the Naive method becoming the best one, while still maintaining good forecasting accuracy for longer horizons.

Fig. 3.
figure 3

Total computational time (model training + forecast) of the tested methods on the CAC40 - \(\sigma ^4\) (a) (\(n=40\)) and cryptocurrencies - \(\sigma ^4\) (b) (\(n=291\)) dataset-proxy combination.

Table 1. Cryptocurrencies - volatility time series: NMSE (averaged over all the continuation sets) of the different forecasting methods. The bold notation is used to highlight all techniques which are not significantly worse (pv = 0.05) than the one with the lowest NMSE score.
Table 2. CAC40 - volatility time series: NMSE (averaged over all the continuation sets) of the different forecasting methods. The bold notation is used to highlight all techniques which are not significantly worse (pv = 0.05) than the one with the lowest NMSE score.

In addition to forecasting accuracy, we also analyzed the total computational time required to produce a forecast. The total computational time is obtained by summing up the time required to train the considered model and the time needed to generate a forecast. Figure 3a shows that, for low dimensionalities (\(n=40\)) the total computational time of the different techniques is comparable, and independent of the forecasting horizon, except for the optimized DFML’\(_{PCA}\), where the comparison of different forecasting strategies require a computational time proportional to the length of the forecasting horizon. On the other hand, for higher dimensionalities (\(n=291\)), the computational time required to train multiple univariate models (UNI), neural based models (RNN and LSTM) and PLS increases considerably due to the increase of both dimensionality and forecasting horizons, while DFML models, thanks to the dimensionality reduction component, maintain a reduced computational time regardless of the forecasting horizon.

5 Conclusion and Future Work

The empirical analysis shows that DFML is able to produce accurate volatility forecasts, especially in the case of high-dimensional noisy series (i.e. Cryptocurrencies dataset) with non-smoothed volatility proxies \(\sigma ^{i}\), by summarizing well the intrinsic market correlations in a reduced number of factors. However, the presence of a smoothing factor (as in the \(\sigma ^{SD,w}\) proxies family) is shown to worsen the performances of the DFML methods. Moreover, we have shown that, thanks to the dimensionality reduction component, DFML methods can produce multi-step ahead forecasts with the same accuracy as concurrent methods with a great reduction in terms of computational cost. In order to further improve this framework we foresee different possible extensions. On one hand we believe that the use of additional volatility proxies, together with an automated variable selection process could further improve the forecasting performances. On the other hand, the use of incremental dimensionality reduction techniques could further improve the computational efficiency of the method at the expense of a small reduction in forecasting accuracy.