Advertisement

Items2Data: Generating Synthetic Boolean Datasets from Itemsets

  • Ian Shane WongEmail author
  • Gillian Dobbie
  • Yun Sing Koh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11393)

Abstract

Boolean data is a core data type in machine learning. It is used to represent categorical and transactional data. Unlike real valued data, it is notoriously difficult to efficiently design boolean datasets that satisfy particular constraints. Inverse Frequent Itemset Mining (IFM) is the problem of constructing a boolean dataset, satisfying given support constraints for some itemsets. Previous work mainly focuses on the theoretical complexity of IFM and practical solutions scale poorly or do not satisfy all the constraints. We propose Items2Data, a practical algorithm for generating boolean datasets which is efficient under specific conditions. We introduce global closure to describe the condition which a dataset can be efficiently constructed. We evaluate Items2Data and its use in designing synthetic datasets and to analyze its accuracy, scalability and speed on real world datasets. The results indicate Items2Data is practical and efficient for generating synthetic boolean data when pre-defined itemsets are globally closed.

References

  1. 1.
    Trefethen, L., Bau, D.: Numerical Linear Algebra. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics (1997)Google Scholar
  2. 2.
    Belohlavek, R., Vychodil, V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci. 76(1), 3–20 (2010)MathSciNetCrossRefGoogle Scholar
  3. 3.
    Guzzo, A., Moccia, L., Saccà, D., Serra, E.: Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. ACM Trans. Knowl. Disc. Data (TKDD) 7(4), 18 (2013)Google Scholar
  4. 4.
    Guzzo, A., Saccà, D., Serra, E.: An effective approach to inverse frequent set mining. In: Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 806–811. IEEE (2009)Google Scholar
  5. 5.
    Wu, X., Wu, Y., Wang, Y., Li, Y.: Privacy-aware market basket data set generation: a feasible approach for inverse frequent set mining. In: Proceedings of the 2005 SIAM International Conference on Data Mining, pp. 103–114. SIAM (2005)Google Scholar
  6. 6.
    Ramesh, G., Zaki, M.J., Maniatty, W.A.: Distribution-based synthetic database generation techniques for itemset mining. In: 9th International Database Engineering and Application Symposium, IDEAS 2005, pp. 307–316. IEEE (2005)Google Scholar
  7. 7.
    Calders, T.: The complexity of satisfying constraints on databases of transactions. Acta Informatica 44(7–8), 591–624 (2007)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Calders, T.: Computational complexity of itemset frequency satisfiability. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 143–154. ACM (2004)Google Scholar
  9. 9.
    Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Buneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 398–416. Springer, Heidelberg (1999).  https://doi.org/10.1007/3-540-49257-7_25CrossRefGoogle Scholar
  10. 10.
    Mielikainen, T.: On inverse frequent set mining. In: Proceedings of the 3rd IEEE ICDM Workshop on Privacy Preserving Data Mining, pp. 18–23. Citeseer (2003)Google Scholar
  11. 11.
    Madsen, L., Birkes, D.: Simulating dependent discrete data. J. Stat. Comput. Simul. 83(4), 677–691 (2013)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)Google Scholar
  13. 13.
    Calders, T., Rigotti, C., Boulicaut, J.-F.: A survey on condensed representations for frequent sets. In: Boulicaut, J.-F., De Raedt, L., Mannila, H. (eds.) Constraint-Based Mining and Inductive Databases. LNCS (LNAI), vol. 3848, pp. 64–80. Springer, Heidelberg (2006).  https://doi.org/10.1007/11615576_4CrossRefGoogle Scholar
  14. 14.
    Calders, T., Goethals, B.: Non-derivable itemset mining. Data Min. Knowl. Disc. 14(1), 171–206 (2007)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  16. 16.
    Rish, I.: An empirical study of the Naive Bayes Classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46. IBM (2001)Google Scholar
  17. 17.
    Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017)Google Scholar
  18. 18.
    Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling of high-frequency accident locations by use of association rules. Transp. Res. Rec. J. Transp. Res. Board 1840, 123–130 (2003)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Ian Shane Wong
    • 1
    Email author
  • Gillian Dobbie
    • 1
  • Yun Sing Koh
    • 1
  1. 1.Department of Computer ScienceUniversity of AucklandAucklandNew Zealand

Personalised recommendations