Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

NegPSpan: efficient extraction of negative sequential patterns with embedding constraints

Abstract

Sequential pattern mining is concerned with the extraction of frequent or recurrent behaviors, modeled as subsequences, from a sequence dataset. Such patterns inform about which events are frequently observed in sequences, i.e. events that really happen. Sometimes, knowing that some specific event does not happen is more informative than extracting observed events. Negative sequential patterns (NSPs) capture recurrent behaviors by patterns having the form of sequences mentioning both observed events and absence of events. Few approaches have been proposed to mine such NSPs. In addition, the syntax and semantics of NSPs differ in the different methods which makes it difficult to compare them. This article provides a unified framework for the formulation of the syntax and the semantics of NSPs. Then, we introduce a new algorithm, NegPSpan, that extracts NSPs using a prefix-based depth-first scheme, enabling maxgap constraints that other approaches do not take into account. The formal framework highlights the differences between the proposed approach and methods from the literature, especially against the state of the art approach eNSP. Intensive experiments on synthetic and real datasets show that NegPSpan can extract meaningful NSPs and that it can process bigger datasets than eNSP thanks to significantly lower memory requirements and better computation times.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. 1.

    Called the maximal positive subsequence in PNSP (Hsueh et al. 2008) and NegGSP (Zheng et al. 2009) or the positive element id-set in eNSP.

  2. 2.

    Actually, though not clearly stated, it seems that the negative elements of NegGSP patterns consist of items rather than itemsets. In this case, total and partial inclusion are equivalent (see Proposition 6).

  3. 3.

    Code, data generator and synthetic benchmark datasets can be tested online and downloaded here: http://people.irisa.fr/Thomas.Guyet/negativepatterns/.

  4. 4.

    http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.

  5. 5.

    The Instacart dataset is available on Kaggle: https://www.kaggle.com/c/instacart-market-basket-analysis.

  6. 6.

    ATC: Anatomical Therapeutic Chemical Classification System is a drug classification system that classifies the active ingredients of drugs.

  7. 7.

    Note that we proved that this path is actually unique.

References

  1. Bosc G, Boulicaut JF, Raïssi C, Kaytoue M (2018) Anytime discovery of a diverse set of patterns with monte carlo tree search. Data Min Knowl Discov 32(3):604–650

  2. Cao L, Yu PS, Kumar V (2015) Nonoccurring behavior analytics: a new area. Intell Syst 30(6):4–11

  3. Cao L, Dong X, Zheng Z (2016) e-NSP: efficient negative sequential pattern mining. Artif Intell 235:156–182

  4. Dauxais Y, Guyet T, Gross-Amblard D, Happe A (2017) Discriminant chronicles mining—application to care pathways analytics. In: Proceedings of 16th conference on artificial intelligence in medicine (AIME), pp 234–244

  5. Dong X, Gong Y, Cao L (2018a) e-RNSP: an efficient method for mining repetition negative sequential patterns. IEEE Trans Cybern. https://doi.org/10.1109/TCYB.2018.2869907

  6. Dong X, Gong Y, Cao L (2018b) F-NSP\(+\): a fast negative sequential patterns mining method with self-adaptive data storage. Pattern Recognit 84:13–27

  7. Giannotti F, Nanni M, Pedreschi D (2006) Efficient mining of temporally annotated sequences. In: Proceedings of the SIAM international conference on data mining, pp 348–359

  8. Gong Y, Xu T, Dong X, Lv G (2017) e-NSPFI: efficient mining negative sequential pattern from both frequent and infrequent positive sequential patterns. Int J Pattern Recognit Artif Intell 31(02):1750002

  9. Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of the sixth international conference on knowledge discovery and data mining (SIGKDD), pp 355–359

  10. Hsueh SC, Lin MY, Chen CL (2008) Mining negative sequential patterns for e-commerce recommendations. In: Proceedings of Asia-Pacific services computing conference, pp 1213–1218

  11. Kamepalli S, Sekhara R, Kurra R (2014) Frequent negative sequential patterns - a survey. Int J Comput Eng Technol 5(3):115–121

  12. Lin JCW, Fournier-Viger P, Gan W (2016) FHN: an efficient algorithm for mining high-utility itemsets with negative unit profits. Knowl Based Syst 111:283–298

  13. Liu C, Dong X, Li C, Li Y (2015) SAPNSP: select actionable positive and negative sequential patterns based on a contribution metric. In: Proceedings of the 12th international conference on fuzzy systems and knowledge discovery, pp 811–815

  14. Mallick B, Garg D, Grover PS (2013) CRM customer value based on constrained sequential pattern mining. Int J Comput Appl 64(9):21–29

  15. Mooney CH, Roddick JF (2013) Sequential pattern mining—approaches and algorithms. ACM Comput Surv 45(2):1–39

  16. Moulis G, Lapeyre-Mestre M, Palmaro A, Pugnet G, Montastruc JL, Sailler L (2015) French health insurance databases: what interest for medical research? La Revue de Médecine Interne 36:411–417

  17. Negrevergne B, Guns T (2015) Constraint-based sequence mining using constraint programming. In: Michel L (ed) Proceedings of the conference on integration of AI and OR techniques in constraint programming (CPAIOR). Springer International Publishing, Cham, pp 288–305

  18. Ngai E, Xiu L, Chau D (2009) Application of data mining techniques in customer relationship management: a literature review and classification. Expert Syst Appl 36(2, Part 2):2592–2602

  19. Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu MC (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440

  20. Pei J, Han J, Wang W (2007) Constraint-based sequential pattern mining: the pattern-growth methods. J Intell Inf Syst 28(2):133–160

  21. Polard E, Nowak E, Happe A, Biraben A, Oger E (2015) Brand name to generic substitution of antiepileptic drugs does not lead to seizure-related hospitalization: a population-based case-crossover study. Pharmacoepidemiol Drug Saf 24:1161–1169

  22. Qiu P, Zhao L, Dong X (2017) NegI-NSP: negative sequential pattern mining based on loose constraints. In: Proceedings of the 43rd annual conference of the IEEE industrial electronics society (IECON), pp 3419–3425

  23. Srikant R, Agrawal R (1996) Mining sequential patterns: Generalizations and performance improvements. In: Proceedings of the international conference on extending database technology (EDBT). Springer, pp 1–17

  24. Wang W, Cao L (2019) Negative sequences analysis: a review. ACM Comput Surv 52(2):1–39

  25. Xu T, Dong X, Xu J, Dong X (2017a) Mining high utility sequential patterns with negative item values. Int J Pattern Recognit Artif Intell 31(10):1750035

  26. Xu T, Dong X, Xu J, Gong Y (2017b) E-msNSP: efficient negative sequential patterns mining based on multiple minimum supports. Int J Pattern Recognit Artif Intell 31(02):1750003

  27. Xu T, Li T, Dong X (2018) Efficient high utility negative sequential patterns mining in smart campus. IEEE Access 6:23839–23847

  28. Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn 42(1/2):31–60

  29. Zheng Z, Zhao Y, Zuo Z, Cao L (2009) Negative-GSP: an efficient method for mining negative sequential patterns. In: Proceedings of the Australasian data mining conference, pp 63–67

  30. Zheng Z, Zhao Y, Zuo Z, Cao L (2010) An efficient GA-based algorithm for mining negative sequential patterns. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining (PAKDD). Springer, pp 262–273

Download references

Acknowledgements

The authors would like to thank REPERES Team from Rennes University Hospital for spending time to discuss our case study results. We also would like to thanks M. Boumghar, L. Pierre and D. Lagarde for raising interesting issues and providing the dataset about customer relationship management. Finally, we would also like to thanks the reviewers for their insightful comments.

Author information

Correspondence to Thomas Guyet.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Responsible editor: Panagiotis Papapetrou.

Appendices

Proofs

Proof of Proposition 2

Let \(\varvec{s}=\langle s_1\ \dots \ s_n\rangle \) be a sequence and \(\varvec{p}=\langle p_1\ \dots \ p_m\rangle \) be a negative sequential pattern. Let \(\varvec{e}=(e_i)_{i\in [m]}\in [n]^m\) be a soft-embedding of pattern \(\varvec{p}\) in sequence \(\varvec{s}\). Then, the definition matches the one for strict-embedding if \(p_i\) is positive. If \(p_i\) is negative then \(\forall j\in [e_{i-1}+1,e_{i+1}-1],\; p_i \nsubseteq _Ds_j\), i.e. \(\forall j\in [e_{i-1}+1,e_{i+1}-1],\; \forall \alpha \in p_i,\; \alpha \notin s_j\) and then \(\forall \alpha \in p_i,\; \forall j\in [e_{i-1}+1,e_{i+1}-1],\; \alpha \notin s_j\). Thus, it implies that \(\forall \alpha \in p_i,\; \alpha \notin \bigcup _{j\in [e_{i-1}+1,e_{i+1}-1]} s_j\), i.e. by definition, \(p_i \nsubseteq _D\bigcup _{j\in [e_{i-1}+1,e_{i+1}-1]} s_j\).

The exact same reasoning is done in the reverse way to prove the equivalence. \(\square \)

Proof of Proposition 3 (Anti-monotonicity of NSP)

Let \(\varvec{p} = \langle p_1\ \lnot q_1\ p_2\ \lnot q_2\ \dots p_{k-1}\ \)\(\lnot q_{k-1}\ p_{k}\rangle \) and \(\varvec{p}' =\langle p'_1\ \lnot q'_1\ p'_2\ \lnot q'_2\ \dots p'_{k'-1}\ \lnot q'_{k'-1}\ p'_{k'}\rangle \) be two NSP s.t. \(\varvec{p}\lhd \varvec{p}'\). By definition we have that \(k\le k'\).

To prove the anti-monotonicity, we prove that any embedding \((e_i)_{i\in [k]}\) of \(\varvec{p}'\) in a sequence \(\varvec{s}\) generates an embedding of \(\varvec{p}\) in \(\varvec{s}\).

Let \(\varvec{s}=\langle s_1\ \dots \ s_n\rangle \) be a sequence s.t. \( \varvec{p}'\preceq \varvec{s}\), i.e. there exists an embedding \((e_i)_{i\in [k']}\):

  • \(\forall i,\; e_{i+1}>e_i\),

  • \(\forall i,\; p'_i\subseteq s_{e_i}\),

  • \(\forall j\in [e_i+1,e_{i+1}-1],\; q'_i\nsubseteq _Ds_{e_j}\)

By definitions of \(\lhd \) and embedding,

  1. (i)

    \(\forall i\in [k],\; p_i\subseteq p'_i\subseteq s_{e_i}\), (\(e_i\) exists because \(k\le k'\))

  2. (ii)

    \(\forall i\in [k-1],\;\forall j\in [e_i+1,e_{i+1}-1],\; q'_j\nsubseteq _Ds_{e_i}\), and thus \(q_j\nsubseteq _Ds_{e_i}\) (because \(\nsubseteq _D\) is anti-monotone and \(q_i\subseteq q'i\))

This means that \((e_i)_{i\in [k]}\) is an embedding of \( \varvec{p}\) in \(\varvec{s}\). \(\square \)

Proof of Proposition 4 (Anti-monotonicity of the support of Constrained NSP)

The proof of this property is similar to the proof of Proposition 3.

Let \(\varvec{p} = \langle p_1\ \lnot q_1\ p_2\ \lnot q_2\ \dots p_{k-1}\ \lnot q_{k-1}\ p_{k}\rangle \) and \(\varvec{p}' =\langle p'_1\ \lnot q'_1\ p'_2\ \lnot q'_2\ \dots p'_{k'-1}\)\(\lnot q'_{k'-1} p'_{k'}\rangle \) be two NSP s.t. \(\varvec{p}\lhd \varvec{p}'\). And let \(\varvec{s}=\langle s_1\ \dots \ s_n\rangle \) be a sequence s.t. \( \varvec{p}'\preceq \varvec{s}\), i.e. there exists an embedding \((e_i)_{i\in [k']}\):

  • \(\forall i,\; e_{i+1}>e_i\) (embedding), \(e_{i+1}-e_i\le \theta \) (maxgap) and \(e_{k'}-e_1\le \tau \) (maxspan),

  • \(\forall i,\; p'_i\subseteq s_{e_i}\),

  • \(\forall j\in [e_i+1,e_{i+1}-1],\; q'_i\nsubseteq _Ds_{e_j}\)

To prove that \(\varvec{p}\preceq \varvec{s}\), we prove that \((e_i)_{i\in [k]}\) is an embedding of \( \varvec{p}\) in \(\varvec{s}\).

Let us first consider that \(k=k'\), then by definitions of \(\lhd \) and the embedding,

  1. (i)

    \(\forall i\in [k],\; p_i\subseteq p'_i\subseteq s_{e_i}\),

  2. (ii)

    \(\forall i\in [k-1],\;\forall j\in [e_i+1,e_{e+1}-1],\; q'_j\nsubseteq _Ds_{e_i}\), and thus \(q_j\nsubseteq _Ds_{e_i}\) (because of anti-monotonicity of \(\nsubseteq _D\) and \(q_i\subseteq q'i\))

In addition, we know that maxgap and maxspan constraints are satisfied by the embedding, i.e. 

  1. (iv)

    \(\forall i\in [k],\;e_{i+1}-e_i\le \theta \)

  2. (v)

    \(e_{k}-e_1=e_{k'}-e_1\le \tau \)

Let us now consider that \(k'>k\), (i), (ii) and (iii) still holds, and we know that if \(e_{k}<e_{k'}\) (embedding property), then \(e_{k}-e_i<\theta \).

This means that \((e_i)_{i\in [k]}\) is an embedding of \(\varvec{p}\) in \(\varvec{s}\) that satisfies gap constraints. \(\square \)

Proof of Proposition 5

(Complete and correct algorithm) The correction of the algorithm is given by lines 2–3 of Algorithm 1. A pattern is outputted only if it is frequent (line 2).

We now prove the completeness of the algorithm. First of all, we have to prove that any pattern can be reached using a path of elementary transformations (\(\leadsto \in \{\leadsto _n, \leadsto _s, \leadsto _c\}\)). Let \(\varvec{p}'=\langle p'_1 \dots p'_{m}\rangle \) be a pattern with a total amount of n items, \(n>0\), then it is possible to define \(\varvec{p}\) such that \(\varvec{p} \leadsto \varvec{p}'\) where \(\leadsto \in \{\leadsto _n, \leadsto _s, \leadsto _c\}\), and \(\varvec{p}\) will have exactly \(n-1\) items:

  • if the last itemset of \(\varvec{p}'\) is such that \(|p'_{m}|>1\) we define \(\varvec{p}=\langle p'_1 \dots p'_{m-1}\ p_{m}\rangle \) as the pattern with the same prefix as \(\varvec{p}'\) and an additional itemset, \(p_{m}\) such that \(|p_{m}|=|p_{m}'|-1\) and \(p_{m} \subset p_{m}'\): then \(\varvec{p} \leadsto _c \varvec{p}'\)

  • if the last itemset of \(\varvec{p}'\) is such that \(|p'_{m}|=1\) and \(p'_{m-1}\) is positive then we define \(\varvec{p}=\langle p'_1 \dots p'_{m-2}p'_{m-1}\rangle \): then \(\varvec{p} \leadsto _s \varvec{p}'\)

  • if the last itemset of \(\varvec{p}'\) is such that \(|p'_{m}|=1\) and \(p'_{m-1}\) is negative (non-empty) then we define \(\varvec{p}=\langle p'_1 \dots p_{m-1}\ p'_{m}\rangle \) where \(p_{m-1}\) is such that \(|p_{m-1}|=|p_{m-1}'|-1\) and \(p_{m-1} \subset p_{m-1}'\): then \(\varvec{p} \leadsto _n \varvec{p}'\)

Applying this rules recursively, for any pattern \(\varvec{p}\) there is a path from the empty sequence to \(\varvec{p}\): \(\emptyset \leadsto ^* \varvec{p}\). Also, exactly one of the three extensions can be used at any step, meaning that these path is unique. This prove that our algorithm is not redundant.

Second, the pruning strategy is correct, so, no frequent pattern will be missed. This comes from the anti-monotonicity property.

Let \(\varvec{p}\) and \(\varvec{p}'\) be two patterns such that \(\varvec{p} \leadsto \varvec{p}'\) where \(\leadsto \in \{\leadsto _n, \leadsto _s, \leadsto _c\}\), then is is quite obvious that \(\varvec{p} \lhd \varvec{p}'\). Let us now consider that \(\varvec{p} \leadsto ^* \varvec{p}'\) from \(\varvec{p}\) to \(\varvec{p}'\) then, by transitivity of \(\lhd \), we also have that \(\varvec{p} \lhd \varvec{p}'\). And then by anti-monotonicity of the support, we have that \(supp(\varvec{p}) \ge supp(\varvec{p}') \).

Let us now proceed by absurd and consider that \(\varvec{p}'\) is a pattern with support \(supp(\varvec{p}')\ge \sigma \) that was not found by the algorithm. This means that for all pathsFootnote 7\(\emptyset \leadsto ^* \varvec{p}'\) there exists some pattern \(\varvec{p}\) such that \(\emptyset \leadsto ^* \varvec{p} \leadsto ^* \varvec{p}'\) with \(supp(\varvec{p})<\sigma \). \(\varvec{p}\) is the pattern that has been used to prune the search for this path to \(\varvec{p}'\). This is not possible considering that \(\varvec{p} \leadsto ^* \varvec{p}'\) and thus \(supp(\varvec{p}) \ge supp(\varvec{p}') \ge \sigma \). \(\square \)

NegPSpan extracts a superset of eNSP

Proposition 6

Soft-embedding \(\implies \) strict-embedding for patterns consisting of items.

Proof

Let \(\varvec{s}=\langle s_1 \dots s_n\rangle \) be a sequence and \(\varvec{p}=\langle p_1 \dots p_m\rangle \) be a NSP s.t. \(|p_i|=1\) for all \(i\in [n]\) and \(\varvec{p}\) occurs in \(\varvec{s}\) according to the soft-embedding semantics.

There exists \(\epsilon =(e_i)_{i\in [m]}\in [n]^m\) s.t. for all \(i\in [n]\), \(p_i\) is positive implies \(p_i\in s_{e_i}\) and \(p_i\) is negative implies that for all \(j\in [e_{i-1}+1,e_{i+1}-1],\; p_i\notin s_j\) (items only) then \(p_i\notin \bigcup _{j\in [e_{i-1}+1,e_{i+1}-1]}{s_j}\) i.e. \(p_i \nsubseteq _{*}\bigcup _{j\in [e_{i-1}+1,e_{i+1}-1]}{s_j}\) (whatever \(\nsubseteq _G\) or \(\nsubseteq _D\)). As a consequence \(\epsilon \) is a strict-embedding of p. \(\square \)

Proposition 7

Let \({\mathcal {D}}\) be a dataset containing sequences made of items and \(\varvec{p}=\langle p_1\ \dots \ p_m\rangle \) be a sequential pattern extracted by eNSP. Then, without embedding constraints \(\varvec{p}\) is extracted by NegPSpan with the same minimum support.

Proof

If \(\varvec{p}\) is extracted by eNSP, then its positive partner is frequent in the dataset \({\mathcal {D}}\). As a consequence, each \(p_i\), \(i\in [m]\) is a singleton itemset.

According to the search space of NegPSpan defined by \(\lhd \) if \(\varvec{p}\) is frequent then it will be reached by the depth-first search. Then it is sufficient to prove that for any sequence \(\varvec{s}=\langle s_1 \dots s_n\rangle \in {\mathcal {D}}\) such that \(\varvec{p}\) occurs in \(\varvec{s}\) according to eNSP semantics (strict-embedding, strong absence), then \(\varvec{p}\) also occurs in \(\varvec{s}\) according to the NegPSpan semantics (soft-embedding, weak absence). Consequently, considering the same minimum support threshold, \(\varvec{p}\) is frequent according to NegPSpan. Proposition 6 gives this result. \(\square \)

Then we conclude that NegPSpan extracts more patterns than eNSP on sequences of items. In fact, NegPSpan can extract patterns with negative itemsets larger than 2.

eNSP extracts patterns that are not extracted by NegPSpan on sequences of itemsets. Practically, NegPSpan uses a size limit for negative itemsets \(\nu \ge 1\). eNSP extracts patterns whose positive partners are frequent. The positive partner, extracted by PrefixSpan may hold itemsets larger than \(\nu \), and if the pattern with negated itemset is also frequent, then this pattern is extracted by eNSP, but not by NegPSpan.

Additional experiments

Influence of vocabulary size

Fig. 6
figure6

Comparison of eNSP and NegPSpan computation time (left) and memory consumption (right) wrt vocabulary size. The dashed line shows the limit for timeout executions

Figure 6 shows the computation time and the memory consumption with respect to vocabulary size: eNSP is run with different values of \(\varsigma \), the minimal frequency of the positive partner of negative patterns (\(100\%\), \(80\%\) and \(20\%\) of the minimal frequency threshold) and NegPSpan is run with a maxgap of 10 or without. The timeout is set to 5 min.

Similarly to the experiments of Sect. 5, with no constraint, NegPSpan is less time-efficient than eNSP but it becomes more time-efficient with a gap constraint whatever the vocabulary size. Memory consumption curves show that NegPSpan requires significantly less memory than eNSP.

Figure 6 clearly shows that the smaller the vocabulary is, the more frequent patterns there are, and thus the more memory is required and time is high. Indeed, the smaller vocabulary (with a given sequence length), the higher the probability to extract some sequential pattern. This is the case for positive patterns as well as for negative patterns considering that the positive part of the negative pattern may more likely occur in a raw (and may necessarily satisfy the negative constraints). Then, generated datasets have more positive and negative patterns to extract.

More especially, there are more positive patterns to extract and thus the memory required by eNSP increases because it requires to store all positive patterns (with a support above \(\varsigma \)). We can see that when the vocabulary size decreases, the memory required by eNSP increases very quickly (faster than exponential growth), while NegPSpan requires almost the same amount of memory for any vocabulary size.

The use of a maxgap constraint (\(\tau =10\)) makes NegPSpan several orders of magnitude more time-efficient for small vocabulary size. With very small vocabulary size (\(<20\)), the number of negative patterns extracted by NegPSpan explodes and the execution time exceeds the timeout of 5 min (300 s). For greater vocabulary size, the differences between algorithms disappear. NegPSpan (\(\tau =\infty \)) is more efficient than eNSP for big vocabulary size because only few frequent negative patterns should be extracted, but there are many positive patterns: in this case eNSP has to evaluate the support of potential negative sequential patterns on the basis of the positive patterns while NegPSpan stops the exploration as soon as an unfrequent negative pattern is found. Thus, eNSP with \(\varsigma =0.8\sigma \), which explores more positive patterns than eNSP with \(\varsigma =\sigma \), is less time-efficient.

Influence of average sequence length

Fig. 7
figure7

Comparison of computation time (left) and memory consumption (right) between eNSP and NegPSpan wrt average sequence length. The dashed line shows the limit for timeout executions

Figure 7 shows the computation time and memory consumption with respect to average length of sequences with a minimal support \(\sigma =20\%\). eNSP is run with different values for \(\varsigma \), the minimal frequency of the positive partner of negative patterns (\(100\%\), \(80\%\) and \(20\%\) of the minimal frequency threshold) and NegPSpan is run with a maxgap \(\tau =10\) or without maxgap constraint. The timeout is set to 5 min.

Computation times and memory consumptions are exponential with respect to the average sequence length. Curves differ by their factors of exponential growth. In the remainder of this section \(\alpha \) represents the exponential growth.

Figure 7 on the left compares the computation times. The exponential growth of NegPSpan without maxgap (\(\alpha \approx 10^{-11}\)) is high and the timeout is reached for datasets with an average sequence length of about 30 itemsets. eNSP is one order of magnitude more time-efficient and can analyze dataset with an average sequence length about 45 itemsets. But, parameter \(\varsigma \) does not change the computation time significantly. Indeed, the exponential growths are close to each other (\(\alpha _{\varsigma =\sigma }\approx 5.08\times 10^{-12}\), \(\alpha _{\varsigma =0.8\sigma }\approx 2.42\times 10^{-12}\), \(\alpha _{\varsigma =0.5\sigma }\approx 2.72\times 10^{-12}\)). In contrast, the use of the maxgap constraint (\(\tau =10\)) changes significantly the difficulty of the task: the exponential growth is significantly lower (\(\alpha \approx 10^{-53}\)) and the timeout is not reached even for sequence containing about 120 itemsets. This result was expected considering that the maxgap constraint avoids to explore the full sequence to evaluate the pattern support. Using NegPSpan with maxgap constraints is thus very time-efficient to mine negative patterns in long sequences.

Figure 7 on the right compares the memory consumption of the two algorithms. The curves look very similar to the computation time curves, but it is important to note that the memory consumption requirement increases significantly slower with NegPSpan than with eNSP, whatever the use of maxgap constraint. The explanation is similar to previous benchmark results: the depth-first search strategy does not store patterns while eNSP does to evaluate the support of negative patterns.

Computational performances on case study datasets

This appendix presents the computational performances of eNSP and NegPSpan on two datasets of the case studies (see Sect.  6): instacart and care pathway analysis.

Instacart data

Fig. 8
figure8

Computation times (on top), memory requirements (in the middle) and numbers of NSPs (at the bottom) extracted from the Instacart dataset with respect to the gap constraint \(\tau \) for NegPSpan (on the left) and with respect to frequent positive ratio (\(\varsigma \)) for eNSP (on the right)

Figure 8 shows the computation times, the memory requirements and the number of NSPs extracted by both algorithms on the Instacart dataset (see Sect. 6.1). On this dataset, we can first note that the number of patterns extracted by NegPSpan is about two orders of magnitude larger. As a consequence, the computation time is higher even with strong gap constraints: NegPSpan takes about 1000 s with \(\tau =2\) while eNSP takes always less than 500 s to extract \(1\%\) NSPs (whatever \(\varsigma \)). These results can be explained by the dataset features. With a large vocabulary, the support of patterns decreases rapidly when the pattern length increases. This means that eNSP prunes a lot of patterns while extracting the positive partners. This explains that few patterns are extracted. In contrast with this strong pruning strategy, NegPSpan explores lots of potential negative extensions because most of them could be frequent due to the relative low frequency of each item. eNSP seems more efficient on this dataset, but we recall that it failed to explore large datasets, not for time reason, but for heavy memory requirements. Figure 8 middle, shows this limitation well: the memory requirement is several orders larger for eNSP than for NegPSpan and it increases exponentially when \(\varsigma \) decreases.

Care pathway analysis

Figure 9 gives a comparison of the computation performances (time and memory usage) between eNSP and NegPSpan with respect to the minimal frequency threshold (\(\sigma \in [0.08,0.3]\)) on the care pathway dataset (see Sect. 6.2). Each algorithm is run with different settings: the maxgap constraint \(\tau \in \{3,5,8\}\) for NegPSpan and the minimal support of positive partners \(\varsigma \in \{0.4\sigma ,0.8\sigma ,\sigma \}\) for eNSP. The maximal pattern length is set to \(l=5\). The results obtained on real data confirm the results obtained in Sect. 5 on synthetic data. On the one hand, NegPSpan requires several orders of magnitude less memory than eNSP. eNSP does not terminate with lowest \(\sigma \) and \(\varsigma \) values. Its memory requirement exceeds the computer capacity (8Go). On the other hand, this heavy memory requirement has consequences on computation times and NegPSpan is several orders of magnitude more time-efficient than eNSP whatever the settings. We observe that the computation time increases exponentially when the frequency threshold (\(\sigma \)) decreases, and, the lower maxgap, the lower the computation time. This is mainly due to the number of extracted patterns that grows also exponentially when the frequency threshold decreases.

Fig. 9
figure9

Time, memory usage and number of extracted patterns for NSP and NegPSpan with respect to the minimal frequency support, and different algorithm settings. (care pathways dataset, see Sect. 6.2)

List of extracted patterns for the CRM dataset

This Appendix provides the complete list of negative sequential patterns involving 3SER or 2VIE negative items extracted by eNSP or NegPSpan (Tables 9, 10, 11).

Table 9 NSP extracted by NegPSpan but not by eNSP
Table 10 NSP extracted by eNSP but not by NegPSpan
Table 11 NSP extracted by both eNSP and NegPSpan

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Guyet, T., Quiniou, R. NegPSpan: efficient extraction of negative sequential patterns with embedding constraints. Data Min Knowl Disc 34, 563–609 (2020). https://doi.org/10.1007/s10618-019-00672-w

Download citation

Keywords

  • Sequential patterns mining
  • Pattern semantics
  • Absence modeling
  • Negative containment