1 Introduction

There has been a growing interest in time series data mining due to the explosive increase in size and volume of time series data in recent years (Ratanamahatana et al. 2010). Multivariate time series data containing a large number of features becomes more and more common in various applications, such as in biology, multimedia, medicine, finance, and so on. For example, an Electro Encephalogram (EEG) from multiple electrodes placed on the scalp are measured to examine the correlation of genetic predisposition to alcoholism. If the EEG device utilizes 64 electrodes and the sampling rate is 256 Hz (Yoon et al. 2005), the EEG data will come at a rate of 983,040 values per minute with 64 features. Multivariate time series pose many challenges in storage, query, prediction and efficiency. Feature selection removing irrelevant and/or redundant features/variablesFootnote 1 provides a solution for many challenges.

Prediction is a primary task of time series data mining, which uses known historical values to forecast future values or trends. Feature selection is essential and crucial for accurate predictions (Tsai and Hsiao 2010). However, feature selection in time series data is different from feature selection in normal static data. The target value of the latter only relates to the current values of features, while the target value of the former relates to the values of features in the previous time stamps as well as in the current time stamp.

For example, given 7 feature time series \(X_{1}\), \(X_{2}\), ..., \(X_{7}\) from time \(t_{1}\), \(t_{2}\), ..., \(t_{n}\), and a target time series \(Y\) from time \(t_{1}\), \(t_{2}\), ..., \(t_{i}\), our objective is to identify relevant features to predict the remaining values of \(Y\), as shown in Fig. 1. We need to select features that are related to the target \(Y\) and the window sizes of these features indicating the effect of historical observations on the values of \(Y\). Therefore, the feature selection in multivariate time series is a two-dimensional problem and contains two tasks: identifying the features and finding the window sizes of the features. The existing approaches of feature selection in time series either consider selecting relevant features while keeping time window sizes invariant (or the same for all features) or selecting suitable windows sizes while maintaining the same set of features (see Sect. 2 for detailed discussions).

Fig. 1
figure 1

Time series \(X\) in blue color contains 7 variables and \(Y\) in red color is the target time series. A subset of \(X\) as well as the sliding window size \(w\) are both needed to predict unknown values of \(Y\) (This is a schematic figure. In practice, \(w\) of each feature varies, and we will discuss this in Sect. 3.1.)

The discovery of causal relationships is beneficial for good prediction and interpretation. In many prediction tasks, we need not only the accurate predictive results but also good interpretation of the prediction, i.e. which variables cause the change of the target. When past values of a time series have a significant influence on the future values of another time series, the relationship of the two is called “causality”. For accurate and interpretable predictions, it is necessary to identify causal relationships (Shibuya et al. 2009).

In this paper, we adopt causal discovery as a means for feature selection to improve the prediction accuracy of multivariate numerical time series data. We define feature selection in time series as a two-dimensional task, and show that two-dimensional selection produces better predictions than single dimensional selection extended from normal feature selection methods. We make use of the Granger causality (Granger 1969) as a means to uncover the causal feature subset because it considers the influence of lagged observations. Granger causality is based on the idea that a cause should be helpful in predicting the future effects, beyond that predicted solely based on their own past values. Therefore, the selected feature subset is not only useful for accurate predictions but also beneficial for interpretations. We have conducted the experiments on three data sets and a real-world case study, in comparison with existing feature selection methods and features selected by domain experts. The experimental results support that our proposed feature selection method is effective.

This work makes the following contributions.

  • We make use of the Granger causality model for feature selection in multivariate time series data. In our proposed method, we not only consider causal influence in the feature selection but also consider the window of effective lagged values for each feature.

  • We use a real world case study to show that the predictions based on our selected features are better than or the same as predictions based on features selected by domain experts.

The remainder of this paper is organized as follows: Sect. 2 provides background work in feature selection, causal discovery for time series, and a description of Granger causality. Section 3 describes our method using causal discovery for feature selection. Section 4 presents experimental comparisons on several data sets and a real-world case study. Section 5 concludes the paper and points out the future work.

2 Background

In this section, we provide a brief review of feature selection methods and causal discovery approaches in time series. We also discuss the main concept of Granger causality.

2.1 Brief review of feature selection and causal discovery methods in time series

The dimensionality of time series is usually very large. For example, data points of EEG in 1 h from one electrode approximate 1 million. Thus dimensionality reduction is an important processing step before a given time series data set is fed to a data mining algorithm (Chizi and Maimon 2010). Some approaches reduce the dimension by representing the high-dimensional data to a low-dimensional space, such as transformation (Ravi Kanth et al. 1998; Chan and Fu 1999) and aggregation (Keogh et al. 2001; Zhao and Zhang 2006). While feature selection approaches reduce the dimension by selecting a subset of input time series variables based on some measure of feature relevance. In this paper, we are mainly interested in feature selection and the following reviews are about feature selection.

Feature selection methods in time series can commonly fall into the filter and wrapper approaches (Kohavi and John 1997). The filter approach utilizes the data alone to decide which features should be kept. On the other hand, the wrapper approach wraps the feature search around a learning algorithm and utilizes its performance to select the features.

Filter approaches for feature selection are typified by principal component analysis (PCA) based methods (Rocchi et al. 2004; Lu et al. 2007; Zhang et al. 2009). The PCA-based feature selection normally involves two phases: (1) mapping the variables to a lower dimensional feature space by transformation, (2) selecting the features by the computation on principal components. The main problem is that the interpretation of features is difficult after the transformation. Beside that, some other filter approaches have been proposed. A feature selection based on Pearson correlation coefficients was introduced in Biesiada and Duch (2007). But this method is used only for nominal features. In Zoubek et al. (2007), the method uses data driven methods to select relevant features to perform accurate sleep/wake stages classification. In Han and Liu (2013), a filter method uses the mutual information matrix between variables as the features and ranks the variables according to the class separability. Fisher Criterion (FC) is also used in multivariate time series feature selection for classification and regression (Yoon et al. 2005). Filter approaches are flexible, and can be used for any classification and regression models. However, it needs experience to find matched feature selection method and model for the best classification/regression results.

The recursive feature elimination (RFE) is a typical wrapper method for feature selection and was introduced in Support Vector Machine (SVM) classifiers (SVMRFE) (Guyon et al. 2002). It has been used for multivariate time series (Yang et al. 2005; Yoon and Shahabi 2006; Maldonado et al. 2011). The procedure of the RFE-based method can be briefly described as follows: (1) training a classifier using SVM, (2) calculating the ranks of all features by the RFE, and (3) removing the feature with the lowest rank. This procedure is then iteratively performed until the required number of features remains. Main disadvantages of the RFE-based methods include that the number of remaining features is need to specified and the algorithm is time consuming. Beside that, some other wrapper approaches have been proposed. The method in Huang and Wu (2008) utilizes genetic algorithm to prune irrelevant and noisy features for forecasting. Neural network based feature selection methods were present in Crone and Kourentzes (2010), Wong and Versace (2012). These methods identify the feature subset according to a network’s predictive performance. But modeling or training a appropriate network is challenging.

The above feature selection methods only consider selecting relevant variables with the same time window size for all variables. In Hido and Morimura (2012), the proposed method chooses the optimal time windows and time lags of each variable using least-angle regression for time series prediction. It considers the temporal effects but maintains the same set of features. Overall, the existing methods do not consider selecting relevant variables and choosing the suitable time windows for variables together. They are one-dimensional feature selection methods. Our proposed method is a two-dimensional feature selection method.

In many prediction tasks, both accuracy of predictions and the knowledge of variables causing the changes in the target variable are important. Causality (also referred to as causation) is commonly defined based on a widely accepted assumption that an effect is preceded by its cause (Haufe et al. 2010). A correlation does not imply causation. Causal discovery is a process to infer the causes of an effect from observational data. A number of approaches have been proposed for causal discovery in time series. Granger causality (Granger 1969) is a well-known method and some of its extensions (Arnold et al. 2007; Lozano et al. 2009; Chen et al. 2004; Eichler 2012) have gained successes across many domains (especially in economics) due to their simplicity, robustness and extendibility (Brovelli et al. 2004; Hiemstra and Jones 1994; Kim 2012; Qiu et al. 2012).

A causal feature selection (short for feature selection based on causal discovery) algorithm solves the feature selection problem by directly or indirectly inducing causal structure and by exploiting formal connections between causation and prediction. Some prior research has been done on the causal feature selection for normal data recently (Aliferis et al. 2010; Guyon et al. 2007; Cawley 2008), but little work has considered causal feature selection for time series.

2.2 Granger causality

The Granger causality (Granger 1969), named after Nobel Prize laureate Clive Granger has been widely used to identify the causal interactions between continuous-valued time series. Granger causality is based on statistical hypothesis test, and the notion is that a cause helps predict its effects in the future, beyond what is possible with auto-regression. More specifically, a variable \(x\) is said to Granger cause \(y\), if the auto-regressive model for \(y\) in terms of past values of both \(x\) and \(y\) is significantly more accurate than that based just on the past values of \(y\). For illustration, let \(x_{t}\) and \(y_{t}\) be two stationary time series sequences, \(x_{t-k}\) and \(y_{t-k}\) are the past \(k\) values of \(x_{t}\) and \(y_{t}\) respectively. Then, the Granger causality is performed by firstly conducting the following two regressions:

$$\begin{aligned} \widehat{y_{t}}_{1}&= \sum _{k=1}^{l}a_{k}y_{t-k}+\varepsilon _{t},\end{aligned}$$
(1)
$$\begin{aligned} \widehat{y_{t}}_{2}&= \sum _{k=1}^{l}a_{k}y_{t-k}+\sum _{k=1}^{w}b_{k}x_{t-k}+\eta _{t}, \end{aligned}$$
(2)

where \(\widehat{y_{t}}_{1}\) and \(\widehat{y_{t}}_{2}\) respectively represent fitting values by the first and second regression, \(l\) and \(w\) are the maximum numbers of lagged observations of \(x_{t}\) and \(y_{t}\) respectively, \(a_{k}, b_{k} \in \mathbf {R}\) are regression coefficient vectors determined by least squares, \(\varepsilon _{t}, \eta _{t}\) are white noises (prediction errors). Note that \(w\) can be infinity but in practice, due to the finite length of the available data, \(w\) will be assumed finite and much shorter than the length of the given time series, usually determined by a model selection method such as Akaike information criterion (AIC) (Akaike 1974) or Bayesian information criterion (BIC) (Schwarz 1978). The F test (or some other similar tests such as conditional variance) is then applied to obtain a \(p\) value to indicate whether Eq. (2) results in a better regression model than Eq. (1) with statistically significant advantage. If yes, we call \(x\) “Granger causes” \(y\). We denote \(x\) causes \(y\) by \(x \rightarrow y\) means that if and only if \(x\) “Granger causes” \(y\) but \(y\) does not “Granger cause” \(x\).

3 Using causal discovery as a means for feature selection

In this section we first define a two-dimensional feature selection problem for multivariate time series prediction, and then propose a new feature selection method using Granger causality, and finally describe our algorithm.

3.1 Two-dimensional feature selection

As motivated in Fig. 1, we define feature selection in multivariate time series data as a two-dimensional problem, which includes the following two tasks.

  1. 1.

    Identify a variable subset to improve the prediction accuracy of target time series over the prediction on the whole set;

  2. 2.

    Find effective window sizes of these variables which have influence on target value.

Some prior research has demonstrated that causality discovery plays a crucial role in feature selection (Aliferis et al. 2010; Guyon et al. 2007; Cawley 2008). Inspired by this, we explore how to use causality discovery to achieve the tasks.

In our setup, we deal with multidimensional time series sequences \({\mathbf X} = \{X^i_t: t\in T\}_{i=1, \ldots d }\), numerical target time series sequence \(Y = \{Y_t: t\in T\}\), where \(T\) is the set of all time stamps. The dimension size \(d\) of input features is assumed to be large. The main goal is stated as follows: we attempt to find a low dimensional subsequence \({\mathbf X'} = \{X^j_t: t\in T'\}_{j=1 \ldots m }\) with \(m\) features where \(m\ll d\) and \(|T'| \ll |T|\). The prediction accuracy of \(Y\) using \(\mathbf X '\) is not worse than using \(\mathbf X \).

Specifically, we assume numerical target time series \(Y\), multivariate time series \(\mathbf X =\{X_{t}^{i}\}_{t=1,2, \ldots ,n; i=1,2, \ldots ,d}\), where \(n\) is the length of time series and \(d\) is the number of time series, i.e. \(x_{1}^{1}\) means the value of variable \(X^{1}\) at time point \(1\). \(\mathbf X '=\{X_{t-k}^{j}\}_{k=0,1, \ldots ,w_{j}; j\in [1,2, \ldots ,m]}\) is the feature subset identified from \(\mathbf X \), where \(w_{j}\) denotes the sliding window size of past observations of \(X^{j}\) and [1, 2, ..., \(m\)] is the set of indexes of features picked. The result produced by our method is a two-dimensional matrix as following:

$$\begin{aligned} \begin{pmatrix} x_{t}^{1} &{}\quad x_{t}^{2} &{}\quad \cdots &{}\quad x_{t}^{j} &{}\quad \cdots &{}\quad x_{t}^{m} \\ x_{t-1}^{1} &{}\quad x_{t-1}^{2} &{}\quad \cdots &{}\quad x_{t-1}^{j} &{}\quad \cdots &{}\quad x_{t-1}^{m} \\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ x_{t-w_{j}}^{1} &{}\quad x_{t-w_{j}}^{2} &{}\quad \cdots &{}\quad x_{t-w_{j}}^{j} &{}\quad \cdots &{}\quad x_{t-w_{j}}^{m}\\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ x_{t-w}^{1} &{}\quad x_{t-w}^{2} &{}\quad \cdots &{}\quad x_{t-w}^{j} &{}\quad \cdots &{}\quad x_{t-w}^{m} \end{pmatrix} \end{aligned}$$
(3)

where each column denotes one selected variable and its lagged values. Note that, we assume columns have the same maximum length \(w\) in Matrix (3). However, in practical computation, the \(w_{j}\) of each \(x_{j}\) varies. To deal with the various lengths of \(w_{j}\), we set 0 to elements in the matrix when \(w_{j}\) is greater than its real lagged value and smaller than \(w\).

In many real world multidimensional time series applications, we do not have the historical values of target time series except in the training data. For example, in our experimental data set, ultraviolet data is collected in real time using sensors, but associated chlorine values have to be obtained from a laboratory. In training data, both ultraviolet and chlorine data are collected, but in practice, we wish to use ultraviolet data to derive real time chlorine values. That is why our aim is to predict unknown values of target time series using known values of variables time series.

Now, for training data, we know historical values \(y_{t-w}\), \(y_{t-w+1}\), \(\ldots \), \(y_{t-1}\) of \(y_{t}\), and we know \(x_{t-w}^{i}\), \(x_{t-w+1}^{i}\), \(\ldots \), \(x_{t-1}^{i}\), \(x_{t}^{i}\) for all features. We will show that our prediction model using Matrix (3) is better than the model using all \(X_{t}\) by predicting \(y_{t}\).

Firstly, for all features, we can obtain the fitting value \(\widehat{y_{t}}_{1}\) of \(y_{t}\) using all features \(X^{i}\) at time \(t\) in the following equation:

$$\begin{aligned} \widehat{y_{t}}_{1}=\sum _{i=1}^{d}a_{i}x_{t}^{i}+\varepsilon _{t}. \end{aligned}$$
(4)

Alternatively, we can also predict \(y_{t}\) using its historical values as the following:

$$\begin{aligned} \widehat{y_{t}}_{1}=\sum _{k=1}^{l}b_{k}y_{t-k}+\eta _{t}. \end{aligned}$$
(5)

The above lists two estimations from different aspects. We may not have \(y_{t-k}\) in most cases. In our previous example, we have sensor data but we do not have chlorine data. So we will need to use Eq. (4) to estimate the level of chlorine. We assume that they give equally good estimations as our starting point.

$$\begin{aligned} \sum _{i=1}^{d}a_{i}x_{t}^{i}+\varepsilon _{t}=\sum _{k=1}^{l}b_{k}y_{t-k}+\eta _{t}. \end{aligned}$$
(6)

We now use this equation to identify which features could derive a better regression. The following equation is obtained after both sides of Eq. (6) are multiplied by \(m\):

$$\begin{aligned} m\sum _{i=1}^{d}a_{i}x_{t}^{i}+m\varepsilon _{t}=m\left( \sum _{k=1}^{l}b_{k}y_{t-k}+\eta _{t}\right) . \end{aligned}$$
(7)

Adding the same expression \(\sum _{i\in [1,\ldots ,m]}\sum _{j=1}^{w_{i}}c_{j}x_{t-j}^{i}\) on both sides, we obtain:

$$\begin{aligned}&m\sum _{i=1}^{d}a_{i}x_{t}^{i}+\sum _{i\in [1,\ldots ,m]}\sum _{j=1}^{w_{i}}c_{j}x_{t-j}^{i}+m\varepsilon _{t}\nonumber \\&\quad =m\sum _{k=1}^{l}b_{k}y_{t-k}+\sum _{i\in [1,\ldots ,m]}\sum _{j=1}^{w_{i}}c_{j}x_{t-j}^{i}+m\eta _{t}. \end{aligned}$$
(8)

Because the number of elements in \(\sum _{i\in [1,\ldots ,m]}\) is equal to \(m\) and independent of \(y_{t}\), Eq. (8) is transformed to:

$$\begin{aligned}&m\sum _{i=1}^{d}a_{i}x_{t}^{i}+\sum _{i\in [1,\ldots ,m]}\sum _{j=1}^{w_{i}}c_{j}x_{t-j}^{i}+m\varepsilon _{t}\nonumber \\&\quad =\sum _{i\in [1,\ldots ,m]}\left( \sum _{k=1}^{l}b_{k}y_{t-k}+\sum _{j=1}^{w_{i}}c_{j}x_{t-j}^{i}+\eta _{t}\right) . \end{aligned}$$
(9)

Then replacing the righthand side of Eq. (9) on the basis of Eq. (2), we obtain the following equation:

$$\begin{aligned}&m\sum _{i=1}^{d}a_{i}x_{t}^{i}+\sum _{i\in [1,\ldots ,m]}\sum _{j=1}^{w_{i}}c_{j}x_{t-j}^{i}+m\varepsilon _{t}=\sum _{i\in [1,\ldots ,m]}\widehat{y_{t}}_{2}=m\widehat{y_{t}}_{2}. \end{aligned}$$
(10)

According to the definition of Granger causality, the prediction error of \(\sum _{i\in [1,\ldots ,m]}\) \(\left( \sum _{k=1}^{l}b_{k}y_{t-k}\!+\!\sum _{j=1}^{w_{i}}c_{j}x_{t-j}^{i}\!+\!\eta _{t}\right) \) in Eq. (9) is smaller than that of \(m\left( \sum _{k=1}^{l}b_{k}y_{t-k}+\eta _{t}\right) \) in Eq. (7). In other words, \(\widehat{y_{t}}_{2}\) results in a better regression model than \(\widehat{y_{t}}_{1}\). Note that, there are \(d\) features in the leftmost expression \(m\sum _{i=1}^{d}a_{i}x_{t}^{i}\) of Eq. (10), but after pruning the non-causal features, the number of remaining causal features is reduced to \(m\). As a result, the final prediction formula extracted from the leftmost side and rightmost side of Eq. (10) can be expressed as following:

$$\begin{aligned} \widehat{y_{t}}=\sum _{i\in [1,\ldots ,m]}\left( a_{i}x_{t}^{i}+\frac{1}{m}\sum _{j=1}^{w_{i}}c_{j}x_{t-j}^{i}\right) +\varepsilon _{t}, \end{aligned}$$
(11)

where \(x^{i}\) \((i\in [1,\ldots ,m])\) are the causal features and \(x_{t-j}^{i}\) are their \(j\)-lagged values while \(w_{i}\) denotes the sliding window size of each \(x^{i}\), \(m\) is the size of feature subset, \(a_{i}\) and \(c_{j}\) are regression coefficients and \(\varepsilon _{t}\) is the prediction error. Therefore, the two tasks we defined at the beginning of this section are integrated by Eq. (11).

3.2 Proposed process and algorithm

We use Granger causality to do feature selection and we make some assumption as Granger causality: (a) no confounding, the variable and the target are not driven by some hidden or common factors simultaneously; (b) the error terms of the variables and the target are normally distributed.

The reasons for using the Granger causality discovery are listed as the following. (a) Granger causality considers the influence of past lagged observations; (b) a time series sequence is normally autocorrelated with its past values which is conformable to the concept of the Granger causality; (c) the features selected by the Granger causality are useful for prediction and interpretation. In general, Granger causality meets the requirements of the two-dimensional feature selection task by uncovering subset and the window size.

The process of our method consists of four steps: (1) transformation for stationary; (2) building an auto-regression model and an augmented auto-regression model for the target; (3) testing the cause and effect relationship between a variable time series and the target time series; (4) repeat the above three steps to remaining variable time series and obtain the feature subset with their window sizes. Next, we describe the process in detail.

Step 1: Transformation for stationary. We test whether a variable time series and the target time series are covariance stationary or not. The common methods include Augmented Dickey–Fuller (ADF) test (Said and Dickey 1984) and Phillips–Perron (PP) test (Phillips and Perron 1988). If the time series is non-stationary, a first (or higher) order difference transformation is utilized to make them stationary. Error correction model (ECM) (Engle and Granger 1987) is used to avoid the spurious regression caused by the transformation.

Step 2: Building an auto-regression model and an augmented auto-regression model for the target series. An univariate auto-regression of the target series is listed as the following.

$$\begin{aligned} y_{t}=a_{0}+a_{1}y_{t-1}+a_{2}y_{t-2}+\cdots +a_{l}y_{t-l}+\varepsilon _{t}. \end{aligned}$$

An appropriate number of lags (\(l\)) to be included is determined by an information criterion, such as BIC (Schwarz 1978), in our algorithm. Next, we obtain an augmented auto-regression by adding the lagged values of variable \(X^i\):

$$\begin{aligned} y_{t}=a_{0}+a_{1}y_{t-1}+a_{2}y_{t-2}+\cdots +a_{l}y_{t-l}+b_{1}x_{t-1}+\cdots +b_{w}x_{t-w}+\varepsilon _{t}. \end{aligned}$$

A lagged value of \(x\) is retained in the regression if it significantly improves the prediction of \(y_t\) by a \(t\) test. In the notation of the above regression, \(w\) is the lag length of \(x\) determined by BIC.

Step 3: Testing the cause and effect relationship between variable \(X^i\) and the target \(Y\). The null hypothesis that \(X^i\) does not Granger cause \(Y\) is accepted if and only if no lagged values of \(X^i\) are retained in the augmented regression. Next, we set \(Y\) as the variable and \(X^i\) as the target to test whether \(Y\) Granger causes \(X^i\) or not. \(X^i\) is selected to add into feature subset when \(X^i\) Granger causes \(Y\) but \(Y\) does not Granger cause \(X^i\) (based on the definition of the Granger causality, the latter variables are not helpful to predict the target).

Step 4: Repeat the above three steps to all remaining variable series and obtain feature subset and their window sizes.

figure a

The feature selection algorithm is described in Algorithm 1. Function Grangercause in Lines 5 and 7 is described in Steps 2 \(\sim \) 3 above. \(F_{X^{i} \rightarrow y}\) flags whether \(X^{i}\) Granger causes \(Y\) and \(F_{y \rightarrow X^{i}}\) indicates whether \(Y\) Granger causes \(X^{i}\). Note that in our feature selection task, \(X\) is the variables time series set and \(Y\) is the target time series. We select the variables that cause the target. We set the maximum window size of \(w\) to reduce the computation time in regression. To speed up this algorithm, we adopt an early abandon strategy. We continue if the first flag is yes. We break to the next loop when the flag is no. So we do not need to do Grangercause test for all variable time series.

In our algorithm, we adopt ADF (Said and Dickey 1984) to test covariance stationary of time series. If a series is covariance stationary then it has the deterministic trending mean and variance.

We adopt BIC to determine the lagged values \(l\) and \(w\) using the following formula (taking \(w\) as an example):

$$\begin{aligned} BIC(w)=n\cdot log(\sigma _{\epsilon }(w))+w\cdot log(n), \end{aligned}$$
(12)

where \(n\) is the number of data points in \(X\), \(\sigma _{\epsilon }(w)\) is the regression error variance and calculated by:

$$\begin{aligned} \sigma _{\epsilon }(w)=\frac{1}{n}\sum _{t=1}^{n}(y_{t}-\widehat{y_{t}(w)})^2, \end{aligned}$$

where \(\widehat{y_{t}(w)}\) is the prediction value calculated by regression function with different \(w\). We compute the BIC scores using Eq. (12), and select the \(w\) that yields the smallest BIC score as the lagged value.

The time complexity of Algorithm 1 is \(O(dw)\), where \(d\) is the number of variable time series before the feature selection and the \(w\) is the maximum search length of window size of lagged observations. The significance level value \(p\) is for significance test. The size of feature subset is different for different \(p\) values.

4 Experiments and results

In this section, firstly we introduce the data sets used in our experiments. Secondly, we present the evaluation methods and some indicators. Then, we examine and discuss the performances of the proposed approach and some existing non-causal approaches. Finally, we study a real-world case.

4.1 Data sets and experimental setup

In order to evaluate the effectiveness of our method, we conducted experiments on several data sets and study a real-world case. The data sets are described in Table 1. The data sets used in our experiments vary in size and dimension. We have removed the rows with the majority of missing values. The remaining missing values are replaced by the column means. Parkinsons Telemonitoring, Toms Hardware (a data set in Buzz in social media) and Communities and Crime data sets have been downloaded from the UCI Machine Learning data repository (Bache and Lichman 2013). Water Quality Monitoring is a real-world data set. For the first three data sets, we employ the 10 fold cross validation to perform the evaluation. For the real-word data set, the split of training and test data sets will be introduced in the following.

Table 1 A description of data sets

For the real-world time series data set, an on-line Ultraviolet (UV)–Visible (Vis) spectrolyser (S::CAN\(^{TM}\)) and a chlorine residual analyser were setup at a drinking water treatment plant to collect data for investigating the relationship between chlorine residuals in water and UV–Vis absorption spectra. Absorption spectra were collected every 2 min with 1 chlorine residual reading (in mg/L). 217 absorbance readings (in m\(^{-1}\)) across the range of 200–740 nm by 2.5 nm interval, the absorbance readings against wavelengths are denoted as \(A200\), \(A202.5\), \(A205\), ..., \(A737.5\), \(A740\). We divided the water quality monitoring data set into 4 training sets and 4 test sets, and the way of division is shown in Table 2. In order to reduce the errors caused by time fluctuation, the training and test data sets are matched by the same quarters from 2 years.

Table 2 A split of water quality monitoring data

All experiments were performed on a machine running 32-bit Windows Operating System with 2.53 GHz processor and 4 GB RAM.

4.2 Evaluation methods and indicators

To compare the quality of subsets of time series selected by our approach, by comparison approaches, and all variable time series for prediction, we employ several common predictive models including Linear regression, K-Nearest Neighbor (KNN) (Cleary and Trigg 1995), Support vector machine (SVM) (Chang and Lin 2011). These models are exemplars of practical and scalable methods and frequently used in many domains. The use of multiple models in the evaluation removes possible biases of some models with some data sets. The subsets of time series selected by different methods are not the same, so comparisons with the same parameters of a prediction model are not acceptable. For example, for the KNN method, the number of best \(k\) varies in different subsets. \(k\) was searched from [1, 20] in the experiments; for SVM, different kernel types such as linear and radial basis function kernel types with different parameters suit different subsets. The cost \(c\), gamma \(g\) was searched from the combination of [1, 10, 20, 30] and [0, 0.05, 0.1] for the SVM in the experiments. We use the training data sets to search the best parameters, and report the prediction performances on the test data sets using these parameters.

The performances of different selection methods are assessed by the following four criteria:

  1. 1.

    The number of features selected;

  2. 2.

    The proportion of features selected relative to the original number of features;

  3. 3.

    Time for feature selection in seconds;

  4. 4.

    Performances of different predictive models;

For the fourth criterion, we measure the performance of predictive model using the following indexes (the lower the values, the better the performance):

  • MAE: mean absolute error;

    $$\begin{aligned} MAE=\frac{1}{n}\sum _{t=1}^{n}|y_{t}-\widehat{y_{t}}|. \end{aligned}$$
  • RMSE: root mean square error;

    $$\begin{aligned} RMSE=\sqrt{\frac{1}{n}\sum _{t=1}^{n}(y_{t}-\widehat{y_{t}})^2}. \end{aligned}$$
  • RAE: relative absolute error;

    $$\begin{aligned} RAE=\frac{\sum _{t=1}^{n}|y_{t}-\widehat{y_{t}}|}{\sum _{t=1}^{n}|y_{t}-\overline{y}|}. \end{aligned}$$
  • RRSE: root relative square error;

    $$\begin{aligned} RRSE=\sqrt{\frac{\sum _{t=1}^{n}(y_{t}-\widehat{y_{t}})^2}{\sum _{t=1}^{n}(y_{t}-\overline{y})^2}}. \end{aligned}$$

where \(n\) represents the length of time series, \(y_{t}\) means the actual value, \(\widehat{y_{t}}\) is the prediction value of \(y_{t}\) and \(\overline{y}\) is the average value of \(y\).

4.3 Comparison with existing methods and original features

We implemented RFE-based (Yoon and Shahabi 2006), PCA-based (Lu et al. 2007), FC (Bishop 1995) and our method in Matlab. The functions of RFE and FC in a machine learning library The Spider (Weston et al. 2005) were employed in our implementation. Then we compare their performances under the criteria listed above.

We use \(p\) value = 0.05 as the significance level value in our method. RFE-based and FC methods require the numbers of features to be selected as parameters. For a fair comparison, we use the number of features of our method as the input umber of selected features for the RFE-based and FC methods. For the PCA-based method, the confidence coefficient is set as 95 %.

Firstly, the number of and the proportion of features selected and the computation time are shown in Table 3. The PCA-based method selects fewer features than our method except in data set Parkinsons Telemonitoring which has a low dimensionality. The RFE-based method takes the longest time, especially when the number of instances is large. Other three methods take similar time.

Table 3 The number of, the proportion of selected features and the computation time in different data sets by different methods

Then, we compare the prediction performances of the selected feature subsets using Linear regression, KNN and SVM methods. The average results of the three methods on the three data sets are summarized in Table 4, where entries with the lowest value in every row are highlighted. “Causality without lag” means that we make predictions using the current values of variables selected by our method, “Causality with lag” means that we make predictions using the current values and their lagged values of variables selected by our method, the “Original” means that we make predictions using the current values of all variables. Our method achieves lower prediction errors than other feature selection methods and the method without a feature selection. Figure 2 visualize the results in Table 4.

Fig. 2
figure 2

Prediction performance comparisons of the features selected by different methods and original features using MAE, RMSE, RAE and RRSE

Table 4 Prediction performances using linear regression, KNN and SVM with features selected by different methods

Through the experiments and analyses, we have the following observations and conclusions.

  1. 1.

    Our proposed method consistently achieves the best performance among all methods compared, especially in the linear regression and SVM. We note that using features selected by our method but without the lagged values does not improve the prediction performance. This shows the validity of our proposed two-dimensional feature selection problem identifying relevant variables and effective lagged time windows of variables.

  2. 2.

    The SVM is the best predictive method among the three methods, both before and after feature selection. Radial based function is the best-performing kernel when we using all features, however, linear kernel function performs best after feature selection by our method.

4.4 A real-world case study

In this section, we apply our method to a real-world case Water Quality Monitoring. Since the aim of our work is to find a set of possible causal time series of the target time series, one way to evaluate the validity of the selected variables is to compare them with domain expert’s choice and compare the prediction performance of our method with that of expert’s choice.

217 UV absorbance wavelengths are collected by different scan sensors in this data set. According to the empirical knowledge of domain experts, four transformation of wavelengths: \(\frac{A255}{A202.5}\), \(\frac{dA290}{d\lambda }\), \(\frac{d^{2}A310}{d\lambda ^{2}}\) and \(\ln A350\) have been used to predict the chlorine residual (Byrne et al. 2011). In our experiment, they were labeled as four feature subsets Num 1–Num 4. We combined all wavelengths used in the above features: A202.5, A255, A290, A310 and A350, as a feature subset, labeled as Num 5.

The number of and the proportion of features selected by existing methods and our method, and their computation time are shown in Table 5. The results are consistent with observations in the previous subsection.

Table 5 The number of, the proportion of selected features and the computation time in the real-world data set by different methods

The prediction performances of feature subsets using Linear regression, KNN and SVM are shown in Table 6 and visualized in Fig. 3, where entries with the lowest error in every row are highlighted. The features are selected by existing methods, our method, without selection and experts, respectively. Our method obtains the lowest prediction errors on the most cases.

Fig. 3
figure 3

Prediction performance comparisons of features selected by existing methods, our method, without selection and experts using MAE, RMSE, RAE and RRSE

Table 6 Prediction performances using linear regression, KNN and SVM with features selected by different methods

The feature subset selected by our method with \(p\) value = 0.05 is \(\{A202.5-A257.5\} \cup \{A300-A357.5\}\), which contains 4 out of 5 wavelengths used by domain experts. The features selected by our method are mainly in the low frequency wavelength end, and this is also consistent with expert’s conclusion (Byrne et al. 2011). Our method shows beneficial for finding the potentially causal features for domain experts assuming that they have no knowledge about the data yet, because the range of search is significantly narrowed down.

In conclusion of the experiments, our method achieves the goals of feature selection in multivariate time series: (a) improving the model prediction accuracy; (b) reducing the cost of obtaining and storing time series for predictions; (c) narrowing the candidates for causal-effect relationships between the variables and the target.

5 Conclusions and future work

In this paper, we have presented a feature selection method for multivariate numerical time series using the Granger causality discovery. We define the feature selection in multivariate time series as a two-dimensional problem containing two tasks: selecting features and determining the window sizes of effective lagged values of features. Our proposed method solves the two-dimensional feature selection problem. The features selected by our method also give potential causal relationships between the variables and the target time series. Experimental results on three time series data sets and a real-world case study show that the prediction performance of three modeling methods on features selected by the proposed method is better than that on features selected by other three existing feature selection methods. In the real world case study, the features selected by our method contain four out of five features used by domain experts and the prediction performance on our selected features is better than that on the feature sets used by domain experts.

Feature section based on the non-linear extension of the Granger causality and causal discovery based feature selection in multivariate nominal time series will be two directions of our future work.