Skip to main content
Log in

Ranking episodes using a partition model

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

One of the biggest setbacks in traditional frequent pattern mining is that overwhelmingly many of the discovered patterns are redundant. A prototypical example of such redundancy is a freerider pattern where the pattern contains a true pattern and some additional noise events. A technique for filtering freerider patterns that has proved to be efficient in ranking itemsets is to use a partition model where a pattern is divided into two subpatterns and the observed support is compared to the expected support under the assumption that these two subpatterns occur independently. In this paper we develop a partition model for episodes, patterns discovered from sequential data. An episode is essentially a set of events, with possible restrictions on the order of events. Unlike with itemset mining, computing the expected support of an episode requires surprisingly sophisticated methods. In order to construct the model, we partition the episode into two subepisodes. We then model how likely the events in each subepisode occur close to each other. If this probability is high—which is often the case if the subepisode has a high support—then we can expect that when one event from a subepisode occurs, then the remaining events occur also close by. This approach increases the expected support of the episode, and if this increase explains the observed support, then we can deem the episode uninteresting. We demonstrate in our experiments that using the partition model can effectively and efficiently reduce the redundancy in episodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Or more rarely if \(t_i\) are small.

  2. Consequently, \(F\) contains at least two vertices from \(W\).

  3. http://www.gutenberg.org/etext/15.

  4. http://jmlr.csail.mit.edu/.

  5. http://www.bartleby.com/124/.

  6. The implementation is available at http://research.ics.aalto.fi/dmg/.

References

  • Achar A, Laxman S, Viswanathan R, Sastry PS (2012) Discovering injective episodes with general partial orders. Data Min Knowl Discov 25(1):67–108

    Article  MathSciNet  MATH  Google Scholar 

  • Gwadera R, Atallah MJ, Szpankowski W (2005a) Markov models for identification of significant episodes. In: Proceedings of the 5th SIAM international conference on data mining (SDM), Newport Beach, CA, pp 404–414

  • Gwadera R, Atallah MJ, Szpankowski W (2005b) Reliable detection of episodes in event sequences. Knowl Inf Syst 7(4):415–437

    Article  Google Scholar 

  • Kullback S (1959) Information theory and statistics. Wiley, New York

    MATH  Google Scholar 

  • Lam HT, Mörchen F, Fradkin D, Calders T (2014) Mining compressing sequential patterns. Stat Anal Data Min 7(1):34–52

    Article  MathSciNet  Google Scholar 

  • Low-Kam C, Raïssi C, Kaytoue M, Pei J (2013) Mining statistically significant sequential patterns. In: Proceedings of the 13th IEEE international conference on data mining (ICDM), Dallas, TX, pp 488–497

  • Laxman S, Sastry PS, Unnikrishnan KP (2007) A fast algorithm for finding frequent episodes in event streams. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 410–419

  • Mannila H, Meek C (2000) Global partial orders from sequential data. In: Proceedings of the 6th ACM international conference on knowledge discovery and data mining (SIGKDD), Boston, MA, pp 161–168

  • Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289

    Article  Google Scholar 

  • Pei J, Wang H, Liu J, Wang K, Wang J, Yu PS (2006) Discovering frequent closed partial orders from strings. IEEE Trans Knowl Data Eng 18(11):1467–1481

    Article  Google Scholar 

  • Tatti N (2014) Discovering episodes with compact minimal windows. Data Min Knowl Discov 28(4):1046–1077

    Article  MathSciNet  Google Scholar 

  • Tatti N, Cule B (2011) Mining closed episodes with simultaneous events. In: Proceedings of the 17th ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, pp 1172–1180

  • Tatti N, Cule B (2012) Mining closed strict episodes. Data Min Knowl Discov 25(1):34–66

    Article  MathSciNet  MATH  Google Scholar 

  • Tatti N, Vreeken J (2012) The long and the short of it: summarising event sequences with serial episodes. In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD), Beijing, China, pp 462–470

  • Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: Proceedings of the 20th international conference on data dngineering (ICDE), Boston, MA, pp 79–90

  • Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33

    Article  Google Scholar 

  • Webb GI (2010) Self-sufficient itemsets: an approach to screening potentially interesting associations between items. ACM Trans Knowl Discov Data 4(1):3

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge and Concha Bielza.

Appendices

Appendix 1: Proof of Proposition 2

In order to prove the proposition we need the following lemma which we will state without the proof.

Lemma 2

Assume that a sequence \(S = s_1, \ldots , s_n\) covers an episode \(G\). If there is a source vertex \(v\) such that \(s_1 = { lab }\mathopen {}\left( v\right) \), then \(s_2, \ldots , s_n\) covers \(G \setminus v\). Otherwise, \(s_2, \ldots , s_n\) covers \(G\).

Proof

(of Proposition 2) We need to prove only ”only if” case. Assume that \(S = s_1, \ldots , s_n\) covers an episode \(G\).

We will prove the proposition by induction over \(n\). Obviously, the result holds for \(n = 0\). Write \(S' = s_2, \ldots , s_n\).

If there is no source vertex in \(G\) with a label \(s_1\), then \( { gr }\mathopen {}\left( M, S\right) = { gr }\mathopen {}\left( M, S'\right) \). Now the lemma implies that \(S'\) covers \(G\) and the induction assumption implies that \( { gr }\mathopen {}\left( M, S'\right) = G\).

If there is a source vertex \(v\) in \(G\) such that \( { lab }\mathopen {}\left( v\right) = s_1\), then \( { gr }\mathopen {}\left( M, S\right) = { gr }\mathopen {}\left( M, S', G(v)\right) \). Note that the \(G(v)\) and its descendants form exactly \( { M }\mathopen {}\left( H\right) \), where \(H = G \setminus v\). That is, \( { gr }\mathopen {}\left( M, S\right) = G\) if and only if \( { gr }\mathopen {}\left( M(H), S'\right) = H\). The lemma implies that \(S'\) covers \(H\) and the induction assumption implies that \( { gr }\mathopen {}\left( M(H), S'\right) = H\) which proves the proposition. \(\square \)

Appendix 2: Proof of Proposition 5

In order to prove the proposition we need the following proposition, which essentially describes the properties of a log-likelihood of a log-linear model. The proof of this proposition can be found, for example, in Kullback (1959).

Proposition 8

Assume that we are given a set of \(k\) functions \({T_i}:{\Omega } \rightarrow {{\mathbb {R}}}\), mapping an object from some space \(\Omega \) to a real number. For \(n\) real numbers, \(r_1, \ldots , r_k\), define

$$\begin{aligned} Z(r_1, \ldots , r_k) = \sum _{\omega \in \Omega } \exp {\sum _{i = 1}^k r_iT_i(\omega )}. \end{aligned}$$

Define a distribution

$$\begin{aligned} p(\omega ) = \frac{\exp {\sum _{i = 1}^k r_iT_i(\omega )}}{Z(r_1, \ldots , r_k)}. \end{aligned}$$

Let \(X\) be a multiset of events from \(\Omega \). Define

$$\begin{aligned} c(r_1, \ldots , r_k) = \sum _{\omega \in X} \log p(\omega ). \end{aligned}$$

Then \(c\) is a concave function of \(r_1, \ldots , r_k\). In fact

$$\begin{aligned} \frac{\partial c}{\partial r_i} = \sum _{\omega \in X} (T_i(\omega ) - {\text {E}}_{p}\mathopen {}\left[ T_i\right] ) \end{aligned}$$

and

$$\begin{aligned} \frac{\partial c}{\partial r_i r_j} = {\left| X\right| }({\text {E}}_{p}\mathopen {}\left[ T_i\right] {\text {E}}_{p}\mathopen {}\left[ T_j\right] - {\text {E}}_{p}\mathopen {}\left[ T_i T_j\right] ). \end{aligned}$$

Proof

(of Proposition 5) In order to prove the result we need to rearrange the terms in \(\log p(\mathcal {S})\) based on current state. In order to do that, let us define \(L_H\) to be a multiset of labels that occur in \(\mathcal {S}\) while the current state is \(H\), that is,

$$\begin{aligned} L_H = {\mathop {\mathop {\bigcup }\limits _{s_1, \ldots , s_n = S_i}}\limits _{i = 1, \ldots , m}} \left\{ s_j \mid { gr }\mathopen {}\left( s_1, \ldots , s_{j - 1}\right) = H\right\} . \end{aligned}$$

We can now rewrite the log-likelihood as

$$\begin{aligned} \log p(\mathcal {S}) = \sum _{H \in V(M)} \sum _{l \in L_H} \log p(l \mid H). \end{aligned}$$
(2)

All we need to show now is that each term can be expressed in the form given in Proposition 8. In order to do that, define for each label \(l\) an indicator function

$$\begin{aligned} T_l(s) = {\left\{ \begin{array}{ll} 1, &{} {\text {if }} l = s, \\ 0, &{} {\text {otherwise}}. \end{array}\right. } \end{aligned}$$

Also, define indicator functions whether the transition is in \(C_1\) or \(C_2\), that is, define \(T_1\) and \(T_2\) as

$$\begin{aligned} T_i(s) = {\left\{ \begin{array}{ll} 1, &{} {\text {if there is }} (H, F) \in C_i {\text { with }} { lab }\mathopen {}\left( H, F\right) = s , \\ 0, &{} {\text {otherwise}}. \end{array}\right. } \end{aligned}$$

We have now

$$\begin{aligned} p(l \mid H) = \frac{1}{Z_H}\exp \bigg (t_1T_1(l) + t_2T_2(l) + \sum _{s \in \Sigma } u_sT_s(l)\bigg ). \end{aligned}$$

Since \(Z_H\) corresponds exactly to the normalization constant in Proposition 8, we have shown that

$$\begin{aligned} \sum _{l \in L_H} \log p(l \mid H) \end{aligned}$$

is a concave function. The sum of concave functions is concave, proving the result. \(\square \)

The proof also reveals how to compute the gradient and the Hessian matrix. These are needed if we are optimize \(\log p(\mathcal {S})\). Since \(\log p(\mathcal {S})\) is a sum of functions given in Proposition 8 the gradient and the Hessian matrix of \(\log p(\mathcal {S})\) can be obtained by summing gradients and Hessian matrices of individual terms of Eq. 2.

Appendix 3: Proof of Proposition 4

Proof

Let \(F = { gr }\mathopen {}\left( M, s_1, \ldots , s_{n - 1}\right) \).

If \(F = H\), then we remain in \(H\) only if \(s_n\) is not a label of an outgoing edge. The probability of this is equal to \(q\).

If \(F \ne H\), the only way \( { gr }\mathopen {}\left( M, S\right) = H\), is that \(F\) is a parent of \(H\) and the label connecting \(F\) to \(H\) is equal to \(s_n\). This gives us the result. \(\square \)

Appendix 4: Computing gradient descent

We use Newton–Raphson method to fit the model. In order to do this we need to compute the gradient and the Hessian matrix with respect to the parameters. This can be done efficiently as described by the following proposition.

Proposition 9

Let \(G\) be an episode and let \(M = { M }\mathopen {}\left( G\right) \). Let \(H\) be a state in \(M\).

Let \(C_1\) and \(C_2\) be two disjoint subsets of \(E(M)\). Define \(L_i\) to be the set of labels such that \(l \in L_i\) if and only if there is an edge \((H, F) \in C_i\) labelled as \(l\). Let \(J\) be a matrix of size \(2 \times {\left| \Sigma \right| }\) such that \(J_{il} = 1\) if \(l \in L_i\), and \(0\) otherwise.

Let \(v\) be a vector of length \({\left| \Sigma \right| }\) such that \(v_l = p(l \mid H)\) is equal to the probability of generating label \(l\). Define \(w = Jv\).

Let \(c\) be the count of how often we stay in \(H\),

$$\begin{aligned} c = {\left| \left\{ (i, j) \mid s = S_j, H = { gr }\mathopen {}\left( M, (s_1, \ldots , s_i)\right) \right\} \right| }. \end{aligned}$$

Let \(n\) be a vector of length \({\left| \Sigma \right| }\),

$$\begin{aligned} n_l = {\left| \left\{ (i, j) \mid s = S_j, H = { gr }\mathopen {}\left( M, (s_1, \ldots , s_{i - 1})\right) , s_i = l\right\} \right| }, \end{aligned}$$

to contain the number of symbols labelled as \(l\) visited in \(\mathcal {S}\) while being in the state \(H\).

Let \(V = {diag}\mathopen {}\left( v\right) \) and let \(W = {diag}\mathopen {}\left( w\right) \). Define

$$\begin{aligned} d_H = \left[ \begin{matrix} n - cv \\ J(n - cv) \\ \end{matrix}\right] \quad \mathrm{and}\quad B_H = c\left[ \begin{matrix} V - vv^T&{} VJ^T - vw^T \\ JV - wv^T &{} W - ww^T\\ \end{matrix}\right] . \end{aligned}$$

Then the gradient and hessian of \(\log p(\mathcal {S})\) at \(\left\{ u_i\right\} \), \(t_1\) and \(t_2\) is equal to

$$\begin{aligned} d = \sum _{H \in V(M)} d_H \quad \mathrm{and}\quad B = \sum _{H \in V(M)} B_H. \end{aligned}$$

Proof

Proposition 8 and Proposition 5 imply that the gradient of \(q_H = \sum _{l \in L_H} \log p(l \mid H)\) is equal to \(d_H\) and the hessian is equal to \(B_H\). Since \(\log p(\mathcal {S}) = \sum _{H \in V(M)} q_H\), the result follows. \(\square \)

In order to obtain additional speed-ups, first notice that Proposition 9 implies that we do need to scan the original sequence set every time. Instead it is enough to compute the vector \(n\) and a scalar \(c\) for each state \(H\). Moreover, for a fixed episode \(G\), the rank does not depend on probabilities of individual labels that do not occur in \(G\). In other words, we can treat all labels that do not occur in \(G\) as one label. This will reduce the length of the gradient and the size of the hessian from \({\left| \Sigma \right| } + 2\) to \({\left| V(G)\right| } + 3\), at most. These speed-ups make solving the model very fast in practice.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tatti, N. Ranking episodes using a partition model. Data Min Knowl Disc 29, 1312–1342 (2015). https://doi.org/10.1007/s10618-015-0419-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-015-0419-9

Keywords

Navigation