Keywords

1 Introduction

Data mining is the process of extracting knowledge from massive data. In reality, most of the data are time series, so it has important theoretical and practical significance to mine potential useful knowledge from the time series data [1]. Time series data mining mainly includes classification, clustering, sequence pattern matching, similarity search and prediction. In many time, similarity search is the important foundation of the others, and it was proposed by Agrawal in 1993 [2] to find similar pattern for the given pattern in time series. Similarity search can help us to make useful decisions, for example, we can find a similar sales pattern in the sales records of various commodities to make a sales strategy [3], and we can forecast the natural disaster by searching similar precursor of natural disasters [4].

The traditional time series similarity search extracts data features firstly to reduce the data dimension [5,6,7], and then building index for them [8, 9]. Finally, based on similarity measure function [10,11,12,13], the sequences similar to the query sequence are retrieved in the index structure and displayed to the users. However, in the beginning of search, users usually can’t describe the query sequence clearly, so once search is unable to find the suitable similar sequences for users. Feedback-based strategy allows users to express their satisfaction or dissatisfaction on the query results, and make multiple queries to improve the query accuracy and users’ satisfaction.

Feedback technology was applied in information retrieval at first, and Keogh introduced it into the time series data mining in the literature [14], user can set different weights indicating the similarity or dissimilarity degree for the query result sequences and return them to search system, then the new query sequence for next query is generated by the feedback sequences by some strategy. Time series similarity search based on relevance feedback diversification was proposed in Literature [15], MMR [16] was used on the feedback sequences to ensure the diversity of the query results, and then new query sequence was generated by the feedback sequences.

The above time series similarity search methods combine initial, positive and negative relevant sequences directly to create new query series for the next query, it’s easy to make the query sequence change too much causing the worse query results. For lack of query topics, Wang [17] proposed a negative relevance feedback method in text retrieval, only the negative relevant feedback vectors and query vector were used to do the next query. Peltonen [18] proposed a negative feedback information retrieval system based on machine learning, allowing users to make positive and negative relevance feedback directly by the interactive visual interface. Studies had shown that making full use of negative feedback sequence can improve the retrieval accuracy.

In this paper, we propose a time series similarity search method based on positive and negative relevance feedback sequences separately. The contributions of this paper are summarized as follows:

  • A novel similarity search method based on relevance feedback for time series is proposed. By combining positive query and negative query, the characteristics hidden in positive and negative relevant sequences are easily extracted. Considering the multiple categories of negative relevant sequences, two strategies, single positive relevance feedback model and multi-positive relevance feedback model, can be used during negative query. The proposed method can improve the accuracy of the similarity search.

  • The proposed method is validated by a set of dedicated experiments. The experimental data are included in UCR archive [22], and experiments show that, compared with non-feedback and traditional feedback methods, the proposed method can improve the accuracy of the query on some data sets of UCR archive.

The rest of the paper is organized as follows. In Sect. 2, we review the relevance feedback strategy based on vector model. In Sect. 3, we describe our proposed time series similarity search based on positive and negative query. Experiments are analyzed and discussed in Sect. 4. We conclude this paper and discuss our future work in Sect. 5.

2 Relevance Feedback Strategy Based on Vector Model

Relevance feedback is one of query extension technologies which have become one of the key technologies to improve the recall and precision in information retrieval. Relevance feedback based on vector model is generally implemented by query modification which is proposed by Van Rijsbergen [19]. For the query results, user can label the similar vectors(called positive relevance feedback vectors) or dissimilar vectors(called negative relevance feedback vectors), query vector will be modified according to the original, positive and negative relevance feedback vectors, and then used to search similar vectors again, until user is satisfied with the results, or give up the search.

2.1 Rocchio Algorithm

The classical feedback algorithm based on vector model was proposed by Rocchio in the SMART system [20], new query vector is generated following Eq. 1

$$ \overrightarrow {{q_{new} }} =\upalpha \times \vec{q} + \beta \times \frac{1}{{\left| {D_{r} } \right|}}\sum\nolimits_{{\overrightarrow {{d_{j} }} \epsilon D_{r} }} {\overrightarrow {{d_{j} }} } - \gamma \times \frac{1}{{\left| {D_{nr} } \right|}}\sum\nolimits_{{\overrightarrow {{d_{j} }} \epsilon D_{nr} }} {\overrightarrow {{d_{j} }} } $$
(1)

Where, \( \overrightarrow {{q_{new} }} \) is the new query vector, \( \vec{q} \) is the original query vector, D r is the collection of positive relevant document vectors, D nr is the collection of negative relevant document vectors, α, β and γ are the weights to control their impact on new query vector. The optimal weight values can be assessed by knowledge of the data, or experimentally.

2.2 Ide dec-hi Algorithm

Ide [21] improved the Rocchio algorithm by using the most dissimilar negative relevant vector instead of the average of negative relevant vectors, the new query vector is generated following Eq. 2.

$$ \overrightarrow {{q_{new} }} =\upalpha \times \vec{q} + \beta \times \frac{1}{{\left| {D_{r} } \right|}}\sum\nolimits_{{\overrightarrow {{d_{j} }} \epsilon D_{r} }} {\overrightarrow {{d_{j} }} } - \gamma \times \mathop {\hbox{max} }\nolimits_{{\overrightarrow {{d_{j} }} \epsilon D_{nr} }} \overrightarrow {{d_{j} }} $$
(2)

Where, \( \mathop {\hbox{max} }\nolimits_{{\overrightarrow {{d_{j} }} \epsilon D_{nr} }} \overrightarrow {{d_{j} }} \) is the most dissimilar negative relevant vector. Both of the above feedback algorithms rely heavily on positive relevance feedback, but if the query topics are few so that the positive relevance vectors are few or no, the above methods will be difficult to work.

2.3 Negative Relevance Feedback Algorithm

Wang [17] proposed a negative relevance feedback algorithm for this extreme case in document retrieval. User labels the negative relevant vectors, and search system establishes a positive query (the query vector is the original query vector) and a negative query (the query vector is composed of the negative relevant vectors), finally, the similarity is obtained by combining them following Eq. 3.

$$ S_{combine} \left( {Q,D} \right) = S\left( {Q,D} \right){-}\beta \times S\left( {Q_{neg} ,D} \right) $$
(3)

Where, Q is the original query vector, D is the document vector to be checked, Q neg is the negative query vector generated from negative vectors by some strategy, S(Q, D) is the similarity measure of the positive query, S(Q neg , D) is the similarity measure of the negative query, S combine (Q, D) is the final similarity measure. β controls the impact of S(Q neg , D) to the final similarity measure.

3 Time Series Similarity Search Based on Positive and Negative Query

In this paper, we introduce the negative feedback strategy in time series similarity search, and propose a time series similarity search method based on positive and negative query. For the query results, user labels the positive and negative relevant sequences, and then search system makes positive query and negative query, finally combines the results of positive query and negative query to get final similar sequences, the above steps will be executed repeatly until user is satisfied with the query results or ignores the search. This method mainly includes: query sequence modification, positive and negative query, the combination of positive and negative query.

3.1 Query Sequence Modification

Time series similarity search can also be seemed as a kind of information retrieval. In this paper, new query sequence is generated following Eq. 4 based on the Rocchio algorithm. Negative relevant sequences are used to do negative query and do not participate in the modification, so \( \gamma \) is set to be 0.

$$ q_{new} =\upalpha \times q + \beta \times \frac{1}{{\left| {S_{PR} } \right|}}\sum\nolimits_{{s_{j} \epsilon S_{PR} }} {s_{j} } $$
(4)

Where, q new is new query sequence for next query, q is the original query sequence, S PR is the set of positive relevance sequences.

3.2 Positive and Negative Query

During the query, we perform positive query and negative query for every sequence s in data sequences. Positive query computes the similarity between s and q new , negative query computes the similarity between s and the positive relevant sequences.

Positive Query.

For every sequence s in data sequences, some similarity measure, such as Euclidean distance, will be used to compute the similarity, identified by Sim(q new , s), between s and q new .

Negative Query.

The main problem to be solved in our algorithm is to determine the similarity, identified by Sim(q neg , s), between s and the negative relevance sequences. We present two strategies to solve this problem, one is the single negative relevance feedback model, the other is the multi-negative relevance feedback model.

Single Negative Relevance Feedback Model.

All the negative relevance sequences were combined to generate the single average sequence \( q_{neg} \) by Eq. 5.

$$ q_{neg} = \sum\nolimits_{{s_{j} \in S_{NR} }} {s_{j} *w_{j} } $$
(5)

Where, \( S_{NR} \) is the set of negative relevance sequences, and \( \sum\nolimits_{j = 1}^{{\left| {S_{NR} } \right|}} {w_{j} } = 1 \). This strategy treats all the negative relevance sequences as the same category, and use the mean sequence represent the characteristics of most negative relevance sequences. The weight w j can be set according to the dissimilarity, specified by the user’s subjective setting, between the query sequence and each negative relevance sequence. For example, user can ranks all the negative feedback sequences by the dissimilarity subjectively as s 1 , s 2 , s 3 ,…, s n , n = |SNR|, then we can set \( {{w_{j} = \left( {n - j + 1} \right)} \mathord{\left/ {\vphantom {{w_{j} = \left( {n - j + 1} \right)} {\sum\nolimits_{j = 1}^{{\left| {S_{NR} } \right|}} j }}} \right. \kern-0pt} {\sum\nolimits_{j = 1}^{{\left| {S_{NR} } \right|}} j }} \). If user can’t rank the negative feedback sequences, then we can generate the single average sequence \( q_{neg} \) by Eq. 6.

$$ q_{neg} = \frac{1}{{\left| {S_{NR} } \right|}}\sum\nolimits_{{s_{j} \in S_{NR} }} {s_{j} } $$
(6)

That is, \( w_{j} = \frac{1}{{\left| {S_{NR} } \right|}} \).

Multi-negative Relevance Feedback Model.

The negative relevance sequences may belong to multiple categories. Single negative feedback model can’t reflect the difference of the negative relevance sequences between different categories. The multi-negative relevance model clusters the negative relevance sequences into n clusters, and merges the sequences in each cluster to generate the mean sequence q neg_i , i = 1, 2, 3, …, n, n = |S PR | to represent the characteristics of each cluster independently. For every sequence s in data sequences, some similarity measure, such as Euclidean distance, will be used to compute the similarity, identified by Sim(q neg_i , s), between s and q neg_i . We get the similarity between s and negative feedback sequences by Eq. 7.

$$ Sim\left( {q_{neg} ,s} \right) = F\left( {Sim\left( {q_{neg\_i} ,s} \right)} \right) $$
(7)

Where, F may be MAX, MIN or AVG operation. MAX(Sim(q neg_i , s)) means the max similarity between s and n representative sequences is chosen to represent the final similarity. MIN(Sim(q neg_i , s)) means the min similarity between s and n representative sequences is chosen to represent the final similarity. AVG(Sim(q neg_i , s)) means the average similarity between s and n representative sequences is chosen to represent the final similarity.

3.3 The Combination of Positive and Negative Query

In our method, the final query results should be similar to the query sequence, but also are dissimilar to the negative relevant sequences, so the final similarity measure following Eq. 8.

$$ S_{c} \left( {q_{new} ,s} \right) = Sim\left( {q_{new} ,s} \right) - \lambda \times Sim\left( {q_{neg} ,s} \right) $$
(8)

Sim(q new , s) measures the similarity between q new and the checked sequence s in positive query. Sim(q neg , s) measures the similarity between the negative relevance sequences and the checked sequence s in negative query, λ controls the magnitude of dissimilarity. Sc(q new , s) represents the final similarity degree between q new and s. It can be seen that to make Sc(q new , s) high, either Sim(q new , s) is high or Sim(q neg , s) is low, that is, either the similarity between q new and s is high or the similarity between q neg and s is low.

4 Experiments and Analysis

4.1 Experimental Data

In this paper, we use 17 sets of the UCR data [22] to make experiment, and all the sequences in each data set have been labeled category. The information of each data set used in our experiments is shown in Table 1.

Table 1. Data set

4.2 Method

Five experiments are performed on each data set, that is similarity searching based on no feedback, Rocchio algorithm, Ide dec-hi algorithm, single negative relevance feedback model (labeled SNRF) and multi-negative relevance feedback model (labeled MNRF). Euclidean distance is used as the distance measure.

Usually the initial query sequence can express most of user’s intent, the query sequence modification only do a few fine-tuning. In order to ensure that the modified query sequence dose not deviate from the original sequence largely, the value of α of Eq. 4 is fixed to 1, and the other parameters increment from 0 to 1 by step 0.1 to find the optimal value in our experiments. In our experiments, the negative relevant sequences are clustered according to the category of sequence directly in multi-negative relevance feedback model, and in single negative relevance feedback model, Eq. 6 is used to generate the representative sequence.

In order to validate our proposed single negative and multi-negative relevance feedback similarity search model, kNN and leave-one-out cross validation are performed. P-R value, which is the integration of recall and precision, is used to assess the quality of the query, P represents the accuracy rate and R represents the recall rate. For each query sequence, kNN is performed to find r, which is called recall number, relevant sequences. P-R value of data set for r j , noted as PR rj , is calculated by the following Eq. 9.

$$ PR_{rj} = \frac{1}{{N_{q} }}\sum\nolimits_{i = 1}^{{N_{q} }} {\frac{{r_{j} }}{{k_{i} }}} $$
(9)

where N q is the number of query sequences, k i is the nearest neighbor number for the i-th query sequence to find r j relevant sequences. The P-R average value of data set is the average of all PR rj , where r j = 1,2,…,10. In the similarity search with feedback, for each query sequence, we carried out two times feedback, the optimal P-R average value and the corresponding parameters will be retained.

4.3 Discussion and Comparison

Figure 1 shows the precision-recall curve (according to the optimal P-R average value shown in Tables 2, 3, 4, 5 and 6) of some data sets, it can be seen that, precision-recall of the methods based on feedback are obviously better than that of no feedback. When the number of recall is small, the precision of SNRF and MNRF are similar to that of Rocchio and Ide dec-hi. However, when the number of recall increases, the precision of SNRF and MNRF is better than that of Rocchio and Ide dec-hi.

Fig. 1.
figure 1

Precision-recall

Table 2. P-R of no feedback model
Table 3. P-R of single negative relevance feedback model
Table 4. P-R of multi-negative relevance feedback model
Table 5. P-R of feedback based on Rocchio
Table 6. P-R average value of feedback based on Ide dec-hi

Tables 2, 3, 4, 5 and 6 show the optimal P-R average value of the methods used in our experiments, and the corresponding parameters. Figure 2 shows the P-R average value of 17 data sets. It can be seen that, compared to the no feedback, the similarity search with feedback can obviously improve the accuracy. The results of single-negative feedback model and multi-negative feedback model are all better than that of Rocchio and Ide dec-hi except the results on Two_Patterns and 50words, but the results of five methods on Two_Patterns and 50words are very close. There are some differences between single negative feedback model and multi-negative feedback model, at the confidence level of 0.05, the Wilcoxon symbol rank sum test shows p = 0.9758 (bilateral test), indicating that for the 17 data sets, single negative feedback model and multi-negative feedback model performs basically the same.

Fig. 2.
figure 2

P-R average value of data sets

However, in Table 7 and Fig. 3, it can been seen that when the number of categories is relatively small, the single negative relevance feedback model is better than the multi-negative relevance feedback model. When the categories is more, the multi-negative relevance feedback model is better than single negative relevance feedback model, it maybe because the single average sequence q neg can’t represent the characteristics of the negative relevance sequences fully when the negative relevance sequences are more dispersed, that is, the number of categories is more.

Table 7. P-R difference value between SNF and MNF
Fig. 3.
figure 3

P-R difference value and Number of categories of data set

5 Conclusion

Based on relevance feedback, we proposed a time series similarity search method which fully utilizes the positive and negative relevance sequences. The positive relevant sequences and negative relevant sequences are used to search similar sequences independently, after that, the similarity degree of positive query and negative query are combined to get the final similarity. Experiments show that the proposed method can improve the accuracy of the query on some data sets of UCR archive. Furtherly, we can study a more principled strategy to use the negative feedback sequences to conduct constrained query.