Ranking episodes using a partition model

Tatti, Nikolaj

doi:10.1007/s10618-015-0419-9

Ranking episodes using a partition model

Published: 15 May 2015

Volume 29, pages 1312–1342, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Nikolaj Tatti¹

436 Accesses
8 Citations
1 Altmetric
Explore all metrics

Abstract

One of the biggest setbacks in traditional frequent pattern mining is that overwhelmingly many of the discovered patterns are redundant. A prototypical example of such redundancy is a freerider pattern where the pattern contains a true pattern and some additional noise events. A technique for filtering freerider patterns that has proved to be efficient in ranking itemsets is to use a partition model where a pattern is divided into two subpatterns and the observed support is compared to the expected support under the assumption that these two subpatterns occur independently. In this paper we develop a partition model for episodes, patterns discovered from sequential data. An episode is essentially a set of events, with possible restrictions on the order of events. Unlike with itemset mining, computing the expected support of an episode requires surprisingly sophisticated methods. In order to construct the model, we partition the episode into two subepisodes. We then model how likely the events in each subepisode occur close to each other. If this probability is high—which is often the case if the subepisode has a high support—then we can expect that when one event from a subepisode occurs, then the remaining events occur also close by. This approach increases the expected support of the episode, and if this increase explains the observed support, then we can deem the episode uninteresting. We demonstrate in our experiments that using the partition model can effectively and efficiently reduce the redundancy in episodes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Discovering frequent chain episodes

Article 15 March 2019

Avinash Achar & P. S. Sastry

Interesting Patterns

Efficiently mining cohesion-based patterns and rules in event sequences

Article 12 April 2019

Boris Cule, Len Feremans & Bart Goethals

Notes

Or more rarely if $t_i$ are small.
Consequently, $F$ contains at least two vertices from $W$.
http://www.gutenberg.org/etext/15.
http://jmlr.csail.mit.edu/.
http://www.bartleby.com/124/.
The implementation is available at http://research.ics.aalto.fi/dmg/.

References

Achar A, Laxman S, Viswanathan R, Sastry PS (2012) Discovering injective episodes with general partial orders. Data Min Knowl Discov 25(1):67–108
Article MathSciNet MATH Google Scholar
Gwadera R, Atallah MJ, Szpankowski W (2005a) Markov models for identification of significant episodes. In: Proceedings of the 5th SIAM international conference on data mining (SDM), Newport Beach, CA, pp 404–414
Gwadera R, Atallah MJ, Szpankowski W (2005b) Reliable detection of episodes in event sequences. Knowl Inf Syst 7(4):415–437
Article Google Scholar
Kullback S (1959) Information theory and statistics. Wiley, New York
MATH Google Scholar
Lam HT, Mörchen F, Fradkin D, Calders T (2014) Mining compressing sequential patterns. Stat Anal Data Min 7(1):34–52
Article MathSciNet Google Scholar
Low-Kam C, Raïssi C, Kaytoue M, Pei J (2013) Mining statistically significant sequential patterns. In: Proceedings of the 13th IEEE international conference on data mining (ICDM), Dallas, TX, pp 488–497
Laxman S, Sastry PS, Unnikrishnan KP (2007) A fast algorithm for finding frequent episodes in event streams. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 410–419
Mannila H, Meek C (2000) Global partial orders from sequential data. In: Proceedings of the 6th ACM international conference on knowledge discovery and data mining (SIGKDD), Boston, MA, pp 161–168
Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289
Article Google Scholar
Pei J, Wang H, Liu J, Wang K, Wang J, Yu PS (2006) Discovering frequent closed partial orders from strings. IEEE Trans Knowl Data Eng 18(11):1467–1481
Article Google Scholar
Tatti N (2014) Discovering episodes with compact minimal windows. Data Min Knowl Discov 28(4):1046–1077
Article MathSciNet Google Scholar
Tatti N, Cule B (2011) Mining closed episodes with simultaneous events. In: Proceedings of the 17th ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, pp 1172–1180
Tatti N, Cule B (2012) Mining closed strict episodes. Data Min Knowl Discov 25(1):34–66
Article MathSciNet MATH Google Scholar
Tatti N, Vreeken J (2012) The long and the short of it: summarising event sequences with serial episodes. In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD), Beijing, China, pp 462–470
Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: Proceedings of the 20th international conference on data dngineering (ICDE), Boston, MA, pp 79–90
Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33
Article Google Scholar
Webb GI (2010) Self-sufficient itemsets: an approach to screening potentially interesting associations between items. ACM Trans Knowl Discov Data 4(1):3
Article Google Scholar

Download references

Author information

Authors and Affiliations

HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
Nikolaj Tatti

Authors

Nikolaj Tatti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge and Concha Bielza.

Appendices

Appendix 1: Proof of Proposition 2

In order to prove the proposition we need the following lemma which we will state without the proof.

Lemma 2

Assume that a sequence $S = s_1, \ldots , s_n$ covers an episode $G$. If there is a source vertex $v$ such that $s_1 = { lab }\mathopen {}\left( v\right) $, then $s_2, \ldots , s_n$ covers $G \setminus v$. Otherwise, $s_2, \ldots , s_n$ covers $G$.

Proof

(of Proposition 2) We need to prove only ”only if” case. Assume that $S = s_1, \ldots , s_n$ covers an episode $G$.

We will prove the proposition by induction over $n$. Obviously, the result holds for $n = 0$. Write $S' = s_2, \ldots , s_n$.

If there is no source vertex in $G$ with a label $s_1$, then $ { gr }\mathopen {}\left( M, S\right) = { gr }\mathopen {}\left( M, S'\right) $. Now the lemma implies that $S'$ covers $G$ and the induction assumption implies that $ { gr }\mathopen {}\left( M, S'\right) = G$.

If there is a source vertex $v$ in $G$ such that $ { lab }\mathopen {}\left( v\right) = s_1$, then $ { gr }\mathopen {}\left( M, S\right) = { gr }\mathopen {}\left( M, S', G(v)\right) $. Note that the $G(v)$ and its descendants form exactly $ { M }\mathopen {}\left( H\right) $, where $H = G \setminus v$. That is, $ { gr }\mathopen {}\left( M, S\right) = G$ if and only if $ { gr }\mathopen {}\left( M(H), S'\right) = H$. The lemma implies that $S'$ covers $H$ and the induction assumption implies that $ { gr }\mathopen {}\left( M(H), S'\right) = H$ which proves the proposition. $\square $

Appendix 2: Proof of Proposition 5

In order to prove the proposition we need the following proposition, which essentially describes the properties of a log-likelihood of a log-linear model. The proof of this proposition can be found, for example, in Kullback (1959).

Proposition 8

Assume that we are given a set of $k$ functions ${T_i}:{\Omega } \rightarrow {{\mathbb {R}}}$, mapping an object from some space $\Omega $ to a real number. For $n$ real numbers, $r_1, \ldots , r_k$, define

$$\begin{aligned} Z(r_1, \ldots , r_k) = \sum _{\omega \in \Omega } \exp {\sum _{i = 1}^k r_iT_i(\omega )}. \end{aligned}$$

Define a distribution

$$\begin{aligned} p(\omega ) = \frac{\exp {\sum _{i = 1}^k r_iT_i(\omega )}}{Z(r_1, \ldots , r_k)}. \end{aligned}$$

Let $X$ be a multiset of events from $\Omega $. Define

$$\begin{aligned} c(r_1, \ldots , r_k) = \sum _{\omega \in X} \log p(\omega ). \end{aligned}$$

Then $c$ is a concave function of $r_1, \ldots , r_k$. In fact

$$\begin{aligned} \frac{\partial c}{\partial r_i} = \sum _{\omega \in X} (T_i(\omega ) - {\text {E}}_{p}\mathopen {}\left[ T_i\right] ) \end{aligned}$$

and

$$\begin{aligned} \frac{\partial c}{\partial r_i r_j} = {\left| X\right| }({\text {E}}_{p}\mathopen {}\left[ T_i\right] {\text {E}}_{p}\mathopen {}\left[ T_j\right] - {\text {E}}_{p}\mathopen {}\left[ T_i T_j\right] ). \end{aligned}$$

Proof

(of Proposition 5) In order to prove the result we need to rearrange the terms in $\log p(\mathcal {S})$ based on current state. In order to do that, let us define $L_H$ to be a multiset of labels that occur in $\mathcal {S}$ while the current state is $H$, that is,

$$\begin{aligned} L_H = {\mathop {\mathop {\bigcup }\limits _{s_1, \ldots , s_n = S_i}}\limits _{i = 1, \ldots , m}} \left\{ s_j \mid { gr }\mathopen {}\left( s_1, \ldots , s_{j - 1}\right) = H\right\} . \end{aligned}$$

We can now rewrite the log-likelihood as

$$\begin{aligned} \log p(\mathcal {S}) = \sum _{H \in V(M)} \sum _{l \in L_H} \log p(l \mid H). \end{aligned}$$

(2)

All we need to show now is that each term can be expressed in the form given in Proposition 8. In order to do that, define for each label $l$ an indicator function

$$\begin{aligned} T_l(s) = {\left\{ \begin{array}{ll} 1, &{} {\text {if }} l = s, \\ 0, &{} {\text {otherwise}}. \end{array}\right. } \end{aligned}$$

Also, define indicator functions whether the transition is in $C_1$ or $C_2$, that is, define $T_1$ and $T_2$ as

$$\begin{aligned} T_i(s) = {\left\{ \begin{array}{ll} 1, &{} {\text {if there is }} (H, F) \in C_i {\text { with }} { lab }\mathopen {}\left( H, F\right) = s , \\ 0, &{} {\text {otherwise}}. \end{array}\right. } \end{aligned}$$

We have now

$$\begin{aligned} p(l \mid H) = \frac{1}{Z_H}\exp \bigg (t_1T_1(l) + t_2T_2(l) + \sum _{s \in \Sigma } u_sT_s(l)\bigg ). \end{aligned}$$

Since $Z_H$ corresponds exactly to the normalization constant in Proposition 8, we have shown that

$$\begin{aligned} \sum _{l \in L_H} \log p(l \mid H) \end{aligned}$$

is a concave function. The sum of concave functions is concave, proving the result. $\square $

The proof also reveals how to compute the gradient and the Hessian matrix. These are needed if we are optimize $\log p(\mathcal {S})$. Since $\log p(\mathcal {S})$ is a sum of functions given in Proposition 8 the gradient and the Hessian matrix of $\log p(\mathcal {S})$ can be obtained by summing gradients and Hessian matrices of individual terms of Eq. 2.

Appendix 3: Proof of Proposition 4

Proof

Let $F = { gr }\mathopen {}\left( M, s_1, \ldots , s_{n - 1}\right) $.

If $F = H$, then we remain in $H$ only if $s_n$ is not a label of an outgoing edge. The probability of this is equal to $q$.

If $F \ne H$, the only way $ { gr }\mathopen {}\left( M, S\right) = H$, is that $F$ is a parent of $H$ and the label connecting $F$ to $H$ is equal to $s_n$. This gives us the result. $\square $

Appendix 4: Computing gradient descent

We use Newton–Raphson method to fit the model. In order to do this we need to compute the gradient and the Hessian matrix with respect to the parameters. This can be done efficiently as described by the following proposition.

Proposition 9

Let $G$ be an episode and let $M = { M }\mathopen {}\left( G\right) $. Let $H$ be a state in $M$.

Let $C_1$ and $C_2$ be two disjoint subsets of $E(M)$. Define $L_i$ to be the set of labels such that $l \in L_i$ if and only if there is an edge $(H, F) \in C_i$ labelled as $l$. Let $J$ be a matrix of size $2 \times {\left| \Sigma \right| }$ such that $J_{il} = 1$ if $l \in L_i$, and $0$ otherwise.

Let $v$ be a vector of length ${\left| \Sigma \right| }$ such that $v_l = p(l \mid H)$ is equal to the probability of generating label $l$. Define $w = Jv$.

Let $c$ be the count of how often we stay in $H$,

$$\begin{aligned} c = {\left| \left\{ (i, j) \mid s = S_j, H = { gr }\mathopen {}\left( M, (s_1, \ldots , s_i)\right) \right\} \right| }. \end{aligned}$$

Let $n$ be a vector of length ${\left| \Sigma \right| }$,

$$\begin{aligned} n_l = {\left| \left\{ (i, j) \mid s = S_j, H = { gr }\mathopen {}\left( M, (s_1, \ldots , s_{i - 1})\right) , s_i = l\right\} \right| }, \end{aligned}$$

to contain the number of symbols labelled as $l$ visited in $\mathcal {S}$ while being in the state $H$.

Let $V = {diag}\mathopen {}\left( v\right) $ and let $W = {diag}\mathopen {}\left( w\right) $. Define

$$\begin{aligned} d_H = \left[ \begin{matrix} n - cv \\ J(n - cv) \\ \end{matrix}\right] \quad \mathrm{and}\quad B_H = c\left[ \begin{matrix} V - vv^T&{} VJ^T - vw^T \\ JV - wv^T &{} W - ww^T\\ \end{matrix}\right] . \end{aligned}$$

Then the gradient and hessian of $\log p(\mathcal {S})$ at $\left\{ u_i\right\} $, $t_1$ and $t_2$ is equal to

$$\begin{aligned} d = \sum _{H \in V(M)} d_H \quad \mathrm{and}\quad B = \sum _{H \in V(M)} B_H. \end{aligned}$$

Proof

Proposition 8 and Proposition 5 imply that the gradient of $q_H = \sum _{l \in L_H} \log p(l \mid H)$ is equal to $d_H$ and the hessian is equal to $B_H$. Since $\log p(\mathcal {S}) = \sum _{H \in V(M)} q_H$, the result follows. $\square $

In order to obtain additional speed-ups, first notice that Proposition 9 implies that we do need to scan the original sequence set every time. Instead it is enough to compute the vector $n$ and a scalar $c$ for each state $H$. Moreover, for a fixed episode $G$, the rank does not depend on probabilities of individual labels that do not occur in $G$. In other words, we can treat all labels that do not occur in $G$ as one label. This will reduce the length of the gradient and the size of the hessian from ${\left| \Sigma \right| } + 2$ to ${\left| V(G)\right| } + 3$, at most. These speed-ups make solving the model very fast in practice.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatti, N. Ranking episodes using a partition model. Data Min Knowl Disc 29, 1312–1342 (2015). https://doi.org/10.1007/s10618-015-0419-9

Download citation

Received: 24 May 2014
Accepted: 25 April 2015
Published: 15 May 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10618-015-0419-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Ranking episodes using a partition model

Abstract

Access this article

Similar content being viewed by others

Discovering frequent chain episodes

Interesting Patterns

Efficiently mining cohesion-based patterns and rules in event sequences

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Proof of Proposition 2

Lemma 2

Proof

Appendix 2: Proof of Proposition 5

Proposition 8

Proof

Appendix 3: Proof of Proposition 4

Proof

Appendix 4: Computing gradient descent

Proposition 9

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Ranking episodes using a partition model

Abstract

Access this article

Similar content being viewed by others

Discovering frequent chain episodes

Interesting Patterns

Efficiently mining cohesion-based patterns and rules in event sequences

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: Proof of Proposition 2

Lemma 2

Proof

Appendix 2: Proof of Proposition 5

Proposition 8

Proof

Appendix 3: Proof of Proposition 4

Proof

Appendix 4: Computing gradient descent

Proposition 9

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation