Keywords

1 Introduction

Modeling and forecasting financial time series continue to be a very difficult task [6, 16,17,18, 21]. Techniques to address this problem can be split into two categories: analytical models and Machine Learning. On one hand, analytical techniques include statistical and stochastic models, such as Linear Regression (LR), Multiple Linear Regression (MLR), Autoregressive Integrated Moving Average (ARIMA), Generalized Autoregressive Conditional Heteroskedasticity (GARCH/N-GARCH), Brownian Motion (BM), Diffusion, Poisson and Levy Processes. On the other hand, Machine Learning Techniques (MLT) include Decisions Trees, Artificial Neural Networks (ANN), Fuzzy Systems, Kernel Methods, Support Vector Machines (SVM) and recently, Deep Neural Networks (DNN), an extension of Artificial Neural Networks [6, 16, 20].

We selected Deep Learning because MLT are good while dealing with non-linearities and complexities, such as those present in financial data, in addition to its capabilities to handle large amounts of data, such as those present on HFD. Moreover, we decided to use wavelets as a feature generator because the sequential behavior of HFD, there are many transactions at the same price and changes (price jumps), under normal market conditions should occur with not high variance. Moreover, Deep Learning has been successfully applied in many different fields including price forecasting, as a result, we think that this kind of representation can improve previous results achieved in [1].

The paper is organized as follows: Sect. 2 presents a theoretical overview of Time Series, ARIMA, Artificial Neural Networks, and Wavelets. Section 3 presents the proposed model to forecast one-minute pseudo-log-returns. Section 4 presents some baseline models which are used to compare performance against the proposed method. Section 5 presents final results and their analysis. Finally, Sect. 6 presents conclusions and recommendations for future research.

2 Theory Overview

2.1 Financial Time Series

A time series is a sequence of successive data points with a time order. A Financial Time Series (FTS) is a sequence of financial data points, like price, volume (quantity of financial asset), or any transformation of the previous ones. A FST is a non-stationary process. Formally, a FST X is denoted: \(X = {X_t : t \in T}\).

Where T is an index set, where each element is labeled by a date time stamp, and it is associated only with one data point for a specific financial asset.

Forecasting models seek to predict aptly the next value of the series X without the knowledge of future. F, the predicted time series, is the sequence of predicted values. In order to assess model’s performance of F, it is necessary to determine the similarity between X and F. Popular similarity measures for time series include [14]:

  • Mean Squared Error (MSE): A scale dependent measure.

    $$\begin{aligned} MSE = \frac{1}{n} \sum _{t=1}^{n}{(X_t - F_t)^2} \end{aligned}$$
    (1)
  • Directional Accuracy (DA): A scale independent measure. Percent of predicted directions that matches the original time series. DA is widely used in finance [22].

    $$\begin{aligned} DA = \frac{100\%}{n-1} \sum _{t=2}^{n}{1_{sign(X_t - X_{t-1}) = sign(F_t - F_{t-1})}} \end{aligned}$$
    (2)

    Where 1 is an indicator function: \( 1_A = f(x) = {\left\{ \begin{array}{ll} 1, &{} A=\text {True}\\ 0, &{} A=\text {False} \end{array}\right. } \)

A common transformation to make a FST stationary consists of getting the Log-return or pseudo-log-return series.

Log-Return. Let \(p_{t}\) be the current closing price and \(p_{t-1}\) the previous closing price.

$$\begin{aligned} R = \ln {\frac{p_{t}}{p_{t-1}}} \cdot 100\% = (\ln {p_{t}} - \ln {p_{t-1}}) \cdot 100\% \end{aligned}$$
(3)

From a log-return R, the price \(p_{t}\) can be reconstructed as follows: \(p_{t} = p_{t-1} \cdot e^{\frac{R}{100\%}}\)

Pseudo-Log-Return. It is a logarithmic difference (log of quotient) between average prices on consecutive minutes. Let \(\overline{p_{t}}\) be the current one-minute average price and \(\overline{p_{t-1}}\) the previous one-minute average price.

$$\begin{aligned} \hat{R} = \ln {\frac{\overline{p_{t}}}{\overline{p_{t-1}}}} \cdot 100\% = (\ln {\overline{p_{t}}} - \ln {\overline{p_{t-1}}}) \cdot 100\% \end{aligned}$$
(4)

Pseudo-returns can be reconstructed just as log-returns: \(\overline{p_{t}} = \overline{p_{t-1}} \cdot e^{\frac{\hat{R}}{100\%}}\)

2.2 ARIMA

Traditionally, econometric models dominate the forecasting arena, where statistical linear methods such as ARIMA. ARIMA(p, d, q) and (Auto-Regressive Integrated Moving Average with orders p, d, q), are the most frequently used. In general, these models are a set of discrete time linear equations with noise, of the form:

$$\begin{aligned} \bigg (1-\sum _{k=1}^{p}\alpha _kL^k\bigg ) (1-L)^ dX_t=\bigg (1-\sum _{k=1}^{q}\beta _kL^k\bigg ) \epsilon _t. \end{aligned}$$
(5)

Particularly, ARIMA forecasting equation is a linear regression-type equation, in which the predictors consist of lags of a dependent variable and lags of predicted errors, where p is the number of autoregressive terms, d is the number of nonseasonal differences needed for stationarity, and q is the number of lagged forecast errors in the prediction equation.

Despite the relative success of these models, they have low capacity to capture market movements, given complexities and non-linear relationships [6] exhibit in financial markets. For these reasons machine learning methods have emerged as an important alternative to handle this kind of problem, since they are able to recognize complex patterns, and they have the ability to process large amounts of data.

2.3 Artificial Neural Networks

The first class of ANN was the Feed-forward Neural Network (FNN), which has multiple neurons connected to each other, but there are no cycles or loops in the network. Therefore, the information always moves forward from input to output nodes. A Multilayer Perceptron (MLP), which is a FNN subtype, has an input, multiple hidden layers, and one output layer. Each layer has a finite number of neurons that are fully connected to all neurons in the next layer [11].

Since the late 1980s, techniques using ANNs have been widely used to forecast financial time series, due to its ability to extract essential features and to learn complex information patterns in high dimensional spaces [8, 16, 21, 23].

Deep learning (DL) models have demonstrated a greater effectiveness in both classification and prediction tasks, in different domains such as video analysis, audio recognition, text analysis and image processing. Models based on DL attracts the interest of general public because they are able to learn useful representations from raw data, avoiding the local minimum issue of ANNs, by learning in a layered way using a combination of supervised and unsupervised learning to adjust weights W. Nevertheless its advantages, DL applications in computational finance are limited [3, 4, 7, 24, 25, 28].

Within DL there is a wide variety of architectures, the most simple one uses MLP. However, for the purpose of this paper, we will be using more complex ones such as Recurrent Neural Networks (RNN), Gate Recurrent Units (GRU) and Long Short Term Memory (LSTM). A RNN is an ANN that has connections from output units to input ones, such that a directed cycle is formed. Under this architecture, the network is feedback by the output data; this allows modeling temporal behavior dynamics when storing previous inputs or outputs in an internal memory [11, 13, 19]. The first known application of a RNN in finance “Stock price pattern recognition-a recurrent neural network approach” was published in [15].

Historically, RNN are better at learning time series, because they are designed to identify patterns through time. But they include a greater complexity than the MLP, and therefore they are more difficult to train. In 1997, the Long Short-Term Memory (LSTM), a kind of RNN, was proposed in [13]. It solved some issues of recurrent networks related to learning too much time dependencies; LSTMs are capable of learning in a balanced way both long and short-term dependencies [10]. Recently, [5] proposed Gated Recurrent Unit (GRU), a variation of LSTM. It combines several internal components of the LSTM; making it simpler than LSTM because it has fewer parameters to be fitted during the training phase.

2.4 Wavelets

A wavelet is a wave function with an average value of zero. One key difference is duration; wavelets, unlike sinusoids, have a finite duration, that is, they have a beginning and an end [9]. Wavelets are a useful way to extract key features from signals in order to reproduce them without having to save or keep the complete signal. Moreover, wavelets possess additional advantages that help to overcome non-stationarity associated with financial time series. In order to get a theoretical background on wavelets, it is important to start with Multi-resolution Analysis (MRA).

Definition 1:

A MRA on R is a sequence of subspaces \(\varvec{V_j}\), \({\varvec{j}} \in {\varvec{Z}}\) on functions \({\varvec{L}}^\mathbf 2 \) on R that satisfies the following properties [12, 26]:

  • \(\forall {\varvec{j}} \in {\varvec{Z}}, \varvec{V_j} \subset \varvec{V}_{\varvec{j}+\varvec{1}}\)

  • if f(x) is \({\varvec{C}}^\mathbf 0 _{\varvec{c}}\) on R, then f(x) \(\in \overline{{\varvec{span}}} \quad \varvec{V_j, j}\in \varvec{Z}\). Given \(\varvec{\epsilon } > \mathbf 0 , \exists \quad \varvec{j} \in \varvec{Z}\) and a function g(x) \(\in \varvec{V_j}\), such that \(\Vert \varvec{f-g}\Vert _{\varvec{2}}<\varvec{\epsilon }\)

  • \(\bigcap _{\varvec{j}\in \varvec{Z}} \varvec{V_j}=\mathbf 0 \)

  • A function \(\varvec{f(x)} \in \varvec{V_0}\) if and only if \(\varvec{f(2x)}\in \varvec{V}_{{\varvec{j}}+\mathbf 1 } \forall \varvec{j}\in \varvec{Z}\)

  • \(\exists \) a function \(\varvec{\varphi (x)}, \varvec{L^2}\) on R, called the scaling function, such that \(\varvec{\{\tau _n\varphi (x)\}}\) is an orthonormal system of translates and \(\varvec{V_0}=\overline{{\varvec{span}}}\varvec{\{\tau _n\varphi (x)\}}\)

MRA allows an exact calculation of the wavelet coefficients for an \({\varvec{L}}^\mathbf 2 \) function. Let \(\varvec{\{V_j\}}\) an MRA with scaling function \(\varvec{\varphi (x)}\), therefore [26]:

  • \(\varvec{\varphi (x)} = \sum _{\varvec{n}}{\varvec{h(n)\varphi _{1,n}(x)}}\), is the scaling function.

  • \(\varvec{\psi (x)}=\sum _{\varvec{n}}{\varvec{g(n)\varphi _{1,n}(x)}}\), is the corresponding wavelet.

Where \(\varvec{g(n)}=\varvec{(-1)^n}\overline{\varvec{h(1-n)}}\), is the wavelet filter.

Wavelets transforms could be either continuous or discrete. Since financial time series are discrete, Discrete Wavelet Transform (DWT) is more suitable to filter the data. DWT is defined as follows [26]:

Having a signal \(\varvec{c_o(n)}\), its DWT is a sequence collection: \(\varvec{\{d_j(k):1}\le {\varvec{j}}\le \varvec{J;k} \in \varvec{Z\}}\cup \varvec{\{c_j(k):k}\in \varvec{Z\}}\), where

(6)

Wavelet analysis offers the following advantages [12]:

  • It does not require a strong assumption about the data generation process.

  • It provides information in both time and frequency domains.

  • It has the ability to locate discontinuities in the data.

However, it has some disadvantages [12]: It requires that the length of the data be \(\varvec{2^j}\). It is not shift invariant. Finally, it may shift data peaks, causing wrong approximations when compared to the original data.

Because in a previous work [1], a simple DNN achieved good predictability on high frequency data for two different NYSE equities, we will use DWT as a feature selector, to test the DNN’s ability to improve forecasting capabilities when using as inputs features extracted from wavelet analysis of price sequences.

3 Proposed Method

3.1 Dataset Description

Data from the Dow Jones Industrial Average (Dow30) is used. Dow30 is the second-oldest U.S. market index and consists of 30 major US companies. Tick-by-tick data from 19 randomly chosen companies are used. Data were downloaded from January \(1^\mathrm{st}\), 2015 to July \(31^\mathrm{th}\), 2017. Table 1 shows some dataset descriptors.

Table 1. Dataset description

3.2 Preprocessing

HFD means high transaction frequency (microseconds to seconds), our approach includes representing a typical one minute for this kind of data. We know that prices can vary dramatically in financial markets, however, given the liquidity of \(\varvec{DOW 30's}\) instruments, under normal market conditions, they should exhibit the following characteristics:

  • There are many transactions, occurring within short intervals (usually milliseconds/microseconds).

  • Prices should not have much volatility between ticks.

  • There should not be many different prices.

  • Price jumps should not be very wide.

Fig. 1.
figure 1

Stock sequences

This means that for a typical one minute period, any instrument should exhibit similar behavior. The average number of price jumps (measured in cents) is lower as the price jump is larger, and usually, the less liquid stocks exhibit bigger jumps between ticks. There is a strong correlation between an average number of ticks per minute and the number of streaks or sequences occurring at the same price. That is \({\varvec{P}}_{\varvec{t}} = {\varvec{P}}_{{\varvec{t}}-\mathbf 1 }\). Moreover, we can see that nearly 50% of the ticks have the same price in most of the stocks, except for the more liquids (AAPL, CSCO, INTC), where this measure is near 72% on average. We infer that price discovery is more efficiently performed. As a result, there is a larger proportion of ticks at the same price. In contrast, less liquid stocks exhibit a lower percentage of ticks at the same price, and wider price jumps between ticks. Graphically, this behavior is seen in Fig. 1a and b; The charts have a step style that makes them very similar. This kind of behavior is suitable for wavelet decomposition since the signal exhibits stepped changes between points. This is the reason to use DWT, in order to choose Wavelets coefficients as inputs, because on HFD traded prices can be interpreted as price intervals. Next, we will illustrate in detail the steps to select inputs for the DNN.

All data were summarized at a one-minute level. For each minute, the one-minute pseudo-log-return is calculated. Average prices are calculated from all of the ticks for the particular minute. Moreover, the tick-by-tick series is compressed to 8 values using the DWT.

A log transformation is applied to all tick-by-tick prices on a minute, then iterated differences are performed. The resulting time series is decomposed until the last possible level using the DWT with a Haar filter. Finally, Wavelet and Scaling coefficients from the last two levels are selected. The final result is a vector of 8 values: 4 wavelet coefficients and 4 scaling coefficients.

The last level is composed of only 2 coefficients (1 wavelet coefficient and 1 scaling coefficient). But the one before could be composed of only 4 or 6 coefficients depending on the number of transactions that were made in a particular minute. When this is the case, there are 8 coefficients on the last two levels. When the second level is composed of only 8 coefficients, 2 zeros are appended to the vector in order to obtain 8 values. Table 2a shows decomposition of numbers ranging from 1 to 12. Table 2b shows number decomposition ranging from 1 to 8.

Table 2. Compressed tick-by-tick series using DWT: case I
Fig. 2.
figure 2

Example of 60 one-minute prices.

3.3 Modeling

Figure 2 shows the average price, a good descriptor of market behavior. The highest or lowest prices are usually found within a confidence range of average price, therefore the next highest and lowest prices can be estimated from a predicted average price. The closing price is the last price within a minute, it can be the highest/lowest price inclusive. Unlike the average price, the highest, lowest and close are exposed largely to market dynamics.

Since this work’s objective is to build the best possible forecaster, average prices could be more suitable for this purpose than closing prices. With a good average price forecast, it is known that traded prices will match at some point in time the predicted average during the next minute.

Selected inputs consist of 27 values: The last 3 one-minute pseudo-log-returns and the last 3 compressed tick-by-tick time series (8\(\,\times \,\)3). The selected output is the next predicted one-minute pseudo-log-returns.

The network architecture consists of one input layer, 5 hidden layers, and one output layer. For each hidden layer, the number of neurons decreases with a constant factor \(\frac{\mathbf {1}}{\mathbf {5}}\), (27, 22, 17, 11 and 6 neurons respectively). The output layer has one neuron. All neurons in hidden layers use a \({\varvec{Tanh}}\) activation function, whereas the output neuron uses a \({\varvec{Linear}}\) activation function. Figure 3 shows the architecture overview.

Fig. 3.
figure 3

DNN’s architecture

4 Experiments

The proposed method will be compared against the following methods or models, in order to verify its performance (Table 3).

Table 3. Experiment summary

ARIMA: Although it has good performance over in-sample datasets, it has bad performance over out-sample datasets. The time series were rescaled to a logarithmic scale to help to stabilize strong growth trends. And then, many ARIMA models were fitted with the Augmented Dickey-Fuller Test, the Auto and Cross-Covariance and Correlation Function Estimation (ACF) and the Partial Auto and Cross-Covariance and Correlation Function Estimation (PACF). Finally, for each dataset, the ARIMA model with the lowest AIC was selected.

DNN: [1, 2] proposed a DNN which also forecasts the next one-minute pseudo-log-return. DNN’s inputs are composed of four groups: the current time (hour and minute), the last \({\varvec{n}}\) one-minute pseudo-log-returns, the last \({\varvec{n}}\) one-minute standard deviations of prices and the last \({\varvec{n}}\) one-minute trend indicators, where \({\varvec{n}}\) is the window size. The trend indicator is a statistical measure computed as the linear model’s slope (\({\varvec{price}} = {\varvec{at}} + {\varvec{b}}\)) fitted on transaction prices for a particular minute. LSTM and GRU architectures only have two layers, whereas our base work exhibits a five layers network. The main reason to change the number of layers was training times. Given the memory effect of GRU and LSTM, complexity of these topologies is greater, therefore reducing the number of layers decreases training times.

5 Results and Analysis

The dataset was split up into two parts: an in-sample dataset (first 85%) and an out-sample dataset (last 15%). For each symbol and machine learning model, ten artificial networks were trained and then the average error was calculated and reported. Overall, all networks had homogeneous and stable results.

Figure 4 shows the Model Performances during the training and testing phases. Figure 5a shows the Model Performances per Symbol during the training and testing phases, whereas Fig. 5b shows the DA performance on a wider scale.

Fig. 4.
figure 4

Model’s performance during the training and testing phases

Fig. 5.
figure 5

Model’s performance per symbol during the training and testing phases

Fig. 6.
figure 6

Average number of ticks per minute vs. DA performance

Given the analysis of models performance over out-sample datasets, machine learning techniques had a much better performance compared to ARIMA. Meanwhile, networks using DWT are slightly better than the one without DWT; On average, DNN without DWT, DNN with DWT, GRU with DWT and LSTM with DWT achieved a MSE of 0.002026, 0.001963, 0.001941 and 0.001939, and a DA of 65.27%, 67.38%, 67.74% and 67.72% respectively during the testing phase. The best model was GRU, though its performance is almost equal to LSTM.

On the other hand, GRU is definitely better than LSTM, because although it has the same performance as LSTM, GRU has less complexity and fewer parameters than LSTM, therefore, the training time is reduced and this method is more suitable for use in a real environment.

Overall, DNNs can learn market dynamics with reasonable precision and accuracy over out-sample datasets. But DA performance may be correlated to liquidity. For instance, symbols AAPL, CSCO and INTC, which exhibit the most liquidity, got much better DA performance, close to 72%, 68%, and 68% respectively. Otherwise, other symbols that are less liquid, like MMM (10 ticks/minute) had a lower DA performance, close to 64%. It is important to mention that more liquidity means more data, the raw material for Machine Learning Techniques.

A Pearson Correlation Test and a Spearman Correlation Test were performed in order to verify the dependency between the average number of ticks per minute and DA performance archived by each machine learning model on all out-sample datasets. The Pearson correlation archived by DNN without DWT, DNN with DWT, GRU with DWT and LSTM with DWT on all symbols were 0.4407459, 0.4524886, 0.4695877 and 0.4648353 respectively. Whereas, the Spearman correlation achieved by DNN without DWT, DNN with DWT, GRU with DWT and LSTM with DWT over all symbols were 0.7684211, 0.7368421, 0.7649123 and 0.7701754 respectively.

The Spearman correlation suggests that there is a non-linear correlation between liquidity and DA Performance. Hence, as liquidity is higher, the proposed model has greater effectiveness. It is important to clarify that the model effectiveness will be stuck at some unknown point, in other words, the proposed models would never reach 100% precision despite the liquidity of a particular instrument.

Figure 6 shows the relationship between the average number of ticks per minute and DA Performance for each machine learning model. Stocks with higher liquidity, (AAPL, CSCO, and INTC), were excluded for better visualization. As we can see, the most traded the stock the best model predicts. Also, it draws a fitted Generalized Additive Model (GAM).

6 Conclusions

Traders collectively repeat the behavior of the traders that preceded them [27]. Those patterns can be learned by a DNN. The proposed strategy replicates the concept of predicting prices for short-time periods.

Feature selection is a very important step while building a machine learning model. Moreover, frequency domain with temporal resolution, produced by the DWT, allows the network to identify more complex patterns in the time series.

In liquid stocks there are many ticks in a minute, therefore the mean is more stable than another less liquid stocks. As a result, the final average price time series has less noise, therefore, a machine learning technique can perform better for these kinds of stocks.

Within the deep learning arena, there are more models than the ones depicted here. As a result, a possible research opportunity could be to evaluate model’s performance against other DL models such as Deep Belief Networks, Convolutional Networks, Deep Coding Networks, among others. Overall, Deep Learning techniques can learn market dynamics with a reasonable precision and accuracy, for instance, using recurrent networks increased model’s performance.

Another opportunity would be to explore higher resolution levels, other wavelet filters families such as Daubechies, Coiflets, Symlets, Discrete Meyer, Biorthogonal, among others, as well as other discrete transformations.