Efficient mining of the most significant patterns with permutation testing

Abstract

The extraction of patterns displaying significant association with a class label is a key data mining task with wide application in many domains. We introduce and study a variant of the problem that requires to mine the top-k statistically significant patterns, thus providing tight control on the number of patterns reported in output. We develop TopKWY, the first algorithm to mine the top-k significant patterns while rigorously controlling the family-wise error rate of the output, and provide theoretical evidence of its effectiveness. TopKWY crucially relies on a novel strategy to explore statistically significant patterns and on several key implementation choices, which may be of independent interest. Our extensive experimental evaluation shows that TopKWY enables the extraction of the most significant patterns from large datasets which could not be analyzed by the state-of-the-art. In addition, TopKWY improves over the state-of-the-art even for the extraction of all significant patterns.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. 1.

    This assumes that the search tree for patterns has the property that the children of a node have support not greater than the node itself, which is a usual property of pattern mining algorithms (Han et al. 2007; Uno et al. 2005; Nijssen and Kok 2004) and is required by WYlight as well.

  2. 2.

    http://fimi.ua.ac.be.

  3. 3.

    https://archive.ics.uci.edu/ml/index.php.

  4. 4.

    https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

  5. 5.

    http://www.philippe-fournier-viger.com/spmf.

  6. 6.

    https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets.

References

  1. Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. SIGMOD Rec 22:207–216

    Article  Google Scholar 

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international confereence on very large data bases (VLDB ’94), San Francisco, CA, USA. Morgan Kaufmann Publishers Inc, pp 487–499

  3. Atzmueller M (2015) Subgroup discovery. Wiley Interdiscip Rev Data Min Knowl Discov 5(1):35–49

    Article  Google Scholar 

  4. Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. ACM Sigmod Rec 27(2):85–93

    Article  Google Scholar 

  5. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Ro Stat Soc Ser B (Methodol) 57:289–300

    MathSciNet  MATH  Google Scholar 

  6. Bonferroni C (1936) Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8:3–62

    MATH  Google Scholar 

  7. Dong G, Bailey J (2012) Contrast data mining: concepts, algorithms, and applications. CRC Press, Boca Raton

    Google Scholar 

  8. Duivesteijn W, Knobbe A (2011) Exploiting false discoveries–statistical validation of patterns and quality measures in subgroup discovery. In: 2011 IEEE 11th international conference on data mining. IEEE, pp 151–160

  9. Fisher RA (1922) On the interpretation of \(\chi \) 2 from contingency tables, and the calculation of p. J R Stat Soc 85(1):87–94

    Article  Google Scholar 

  10. Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data (TKDD) 1(3):14

    Article  Google Scholar 

  11. Hämäläinen W (2012) Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl Inf Syst 32(2):383–414

    Article  Google Scholar 

  12. Hämäläinen W, Webb GI (2019) A tutorial on statistically sound pattern discovery. Data Min Knowl Disc 33(2):325–377

    MathSciNet  Article  Google Scholar 

  13. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: Chen W, Naughton JF, Bernstein PA (eds) SIGMOD conference. ACM, New YorkD, pp 1–12

    Google Scholar 

  14. Han J, Wang J, Lu Y, Tzvetkov P (2002) Mining top-k frequent closed patterns without minimum support. In: Proceedings 2002 IEEE international conference on data mining, 2002. ICDM 2003. IEEE, pp 211–218

  15. Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Mining Knowl Discov 15:55–86

    MathSciNet  Article  Google Scholar 

  16. Herrera F, Carmona CJ, González P, Del Jesus MJ (2011) An overview on subgroup discovery: foundations and applications. Knowl Inf Syst 29(3):495–525

    Article  Google Scholar 

  17. Komiyama J, Ishihata M, Arimura H, Nishibayashi T, Minato S-I (2017) Statistical emerging pattern mining with multiple testing correction. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 897–906

  18. Lehmann EL, Romano JP (2012) Generalizations of the familywise error rate. In: Selected works of EL Lehmann. Springer, pp 719–735

  19. Li J, Liu J, Toivonen H, Satou K, Sun Y, Sun B (2014) Discovering statistically non-redundant subgroups. Knowl-Based Syst 67:315–327

    Article  Google Scholar 

  20. Llinares-López F, Sugiyama M, Papaxanthos L, Borgwardt K (2015) Fast and memory-efficient significant pattern mining via permutation testing. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 725–734

  21. Minato S-i, Uno T, Tsuda K, Terada A, Sese J (2014) A fast method of statistical assessment for combinatorial hypotheses based on frequent itemset enumeration. In: Joint European conference on machine learning and knowledge discovery in databases. Springer, pp 422–436

  22. Nijssen S, Kok JN (2004) A quickstart in frequent structure mining can make a difference. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 647–652

  23. Nijssen S, Kok JN (2006) Frequent subgraph miners: runtimes don’t say everything. In: MLG 2006, p 173

  24. Papaxanthos L, Llinares-López F, Bodenham D, Borgwardt K (2016) Finding significant combinations of features in the presence of categorical covariates. In: Advances in neural information processing systems, pp 2279–2287

  25. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: International conference on database theory. Springer, pp 398–416

  26. Pietracaprina A, Vandin F (2007) Efficient incremental mining of top-K frequent closed itemsets. In: Discovery science, volume 4755 of lecture notes in computer science. Springer, Berlin Heidelberg, pp 275–280

  27. Pellegrina L, Vandin F (2018) Efficient mining of the most significant patterns with permutation testing. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, pp 2070–2079

  28. Pellegrina L, Riondato M, Vandin F (2019a) Hypothesis testing and statistically-sound pattern mining. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. ACM, pp 3215–3216

  29. Pellegrina L, Riondato M, Vandin F (2019b) Spumante: Significant pattern mining with unconditional testing. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining-KDD, vol 19

  30. Tarone R (1990) A modified bonferroni method for discrete data. Biometrics 515–522

  31. Terada A, Okada-Hatakeyama M, Tsuda K, Sese J (2013a) Statistical significance of combinatorial regulations. Proc Nat Acad Sci 110(32):12996–13001

    MathSciNet  Article  Google Scholar 

  32. Terada A, Tsuda K, Sese J (2013b) Fast westfall-young permutation procedure for combinatorial regulation discovery. In: 2013 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 153–158

  33. Terada A, Kim H, Sese J (2015) High-speed westfall-young permutation procedure for genome-wide association studies. In: Proceedings of the 6th ACM conference on bioinformatics, computational biology and health informatics. ACM, pp 17–26

  34. Terada A, Tsuda K et al (2016) Significant pattern mining with confounding variables. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 277–289

  35. Uno T, Kiyomi M, Arimura H (2005) Lcm ver. 3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. ACM, pp 77–86

  36. van der Laan MJ, Dudoit S, Pollard KS (2004) Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Stat Appl Genet Mol Biol 3(1):1–25

    MathSciNet  MATH  Google Scholar 

  37. Webb GI (2006) Discovering significant rules. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 434–443

  38. Webb GI (2007) Discovering significant patterns. Mach Learn 68(1):1–33

    MathSciNet  Article  Google Scholar 

  39. Webb GI (2008) Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Mach Learn 71(2–3):307–323

    Article  Google Scholar 

  40. Westfall PH, Young SS (1993) Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley Series in Probability and Statistics, Hoboknen

    Google Scholar 

  41. Wörlein M, Meinl T, Fischer I, Philippsen M (2005) A quantitative comparison of the subgraph miners mofa, gspan, ffsm, and gaston. In: European conference on principles of data mining and knowledge discovery. Springer, pp 392–403

  42. Zandolin D, Pietracaprina A (2003) Mining frequent itemsets using patricia tries. In: Proceedings of FIMI03, vol 90

Download references

Acknowledgements

This work is supported, in part by the National Science Foundation grant IIS-1247581 (https://www.nsf.gov/awardsearch/showAward?AWD_ID=1247581), by the University of Padova grants SID2017 and STARS: Algorithms for Inferential Data Mining, and by MIUR, the Italian Ministry of Education, University and Research, under PRIN Project n. 20174LF3T8 AHeAD (Efficient Algorithms for HArnessing Networked Data).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Fabio Vandin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work appeared in the proceedings of ACM KDD’18 as (Pellegrina and Vandin 2018).

Responsible editor: M. J. Zaki.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pellegrina, L., Vandin, F. Efficient mining of the most significant patterns with permutation testing. Data Min Knowl Disc 34, 1201–1234 (2020). https://doi.org/10.1007/s10618-020-00687-8

Download citation

Keywords

  • Statistical pattern mining
  • Hypothesis testing
  • Top-k patterns