Advertisement

Journal of Intelligent Information Systems

, Volume 50, Issue 1, pp 63–96 | Cite as

Redescription mining augmented with random forest of multi-target predictive clustering trees

  • Matej MihelčićEmail author
  • Sašo Džeroski
  • Nada Lavrač
  • Tomislav Šmuc
Article

Abstract

In this work, we present a redescription mining algorithm that uses Random Forest of Predictive Clustering Trees (RFPCTs) for generating and iteratively improving a set of redescriptions. The approach uses information about element membership in different queries, generated from a single constructed PCT, to explore redescription space, while queries obtained from the Random Forest of PCTs increase candidate diversity. The approach is able to produce highly accurate, statistically significant redescriptions described by Boolean, nominal or numerical attributes. As opposed to current tree-based approaches that use multi-class or binary classification, we explore the benefits of using multi-label classification and multi-target regression to create redescriptions. Major benefit of the approach, compared to other state of the art solutions, is that it does not require specifying minimal threshold on redescription accuracy to obtain highly accurate, optimized set of redescriptions. The process of Random Forest based augmentation and different modes of redescription set creation are evaluated on three datasets with different properties. We use the same datasets to compare the performance of our algorithm to state of the art redescription mining approaches.

Keywords

Knowledge discovery Redescription mining Random forest Predictive clustering trees World countries Computer science bibliography Bioclimatic niches 

Notes

Acknowledgments

The authors would like to acknowledge the European Commission’s support through the MAESTRA project (Gr. no. 612944), the MULTIPLEX project (Gr.no. 317532), the InnoMol project (Gr. no. 316289), and support of the Croatian Science Foundation (Pr. no. 9623: Machine Learning Algorithms for Insightful Analysis of Complex Data Structures).

Supplementary material

10844_2017_448_MOESM1_ESM.pdf (285 kb)
(PDF 284 KB)

References

  1. Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on management of data (pp. 207–216). Washington: D.C.Google Scholar
  2. Bickel, S., & Scheffer, T. (2004). Multi-View Clustering. In Proceedings of the 4th IEEE international conference on data mining, 19–26, Washington.Google Scholar
  3. Blockeel, H. (1998). Top-down induction of first order logical decision trees. Phd thesis, Katholieke Universiteit Leuven, Department of Computer Science.Google Scholar
  4. Bringmann, B., & Zimmermann, A. (2007). The chosen few: on identifying valuable patterns. In Proceedings of the 7th IEEE international conference on data mining (pp. 63–72). Omaha.Google Scholar
  5. Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., & Yang, C. (2000). Finding interesting associations without support pruning. In ICDE, 489–499.Google Scholar
  6. DBLP dataset (2010). http://dblp.uni-trier.de/db.
  7. Galbrun, E. (2013). Methods for Redescription mining. Phd thesis, University of Helsinki.Google Scholar
  8. Galbrun, E., & Kimmig, A. (2014). Finding relational redescriptions. Machine Learning, 225–248.Google Scholar
  9. Galbrun, E., & Miettinen, P. (2012a). From black and white to full color: extending redescription mining outside the Boolean world. Statistical Analysis and Data Mining, 284–303.Google Scholar
  10. Galbrun, E., & Miettinen, P. (2012b). Siren an interactive tool for mining and visualizing geospatial redescriptions. KDD, 1544–1547.Google Scholar
  11. Galbrun, E., & Miettinen, P. (2012c). A Case of Visual and Interactive Data Analysis: Geospatial Redescription Mining. Instant Interactive Data Mining Workshop @ ECML-PKDD.Google Scholar
  12. Gallo, A., Miettinen, P., & Mannila, H. (2008). Finding subgroups having several descriptions: algorithms for redescription mining. In Proceedings of the SIAM international conference on data mining (pp. 334–345). Georgia: Atlanta.Google Scholar
  13. Gamberger, D., & Lavrač, N. (2002). Expert-guided subgroup discovery: methodology and application. Journal of Artificial Intelligence Research, 17, 501–527.zbMATHGoogle Scholar
  14. Gamberger, D., Mihelčić, M., & Lavrač, N. (2014). Multilayer clustering, a discovery experiment on country level trading data. In Proceedings of the 17th international conference on discovery science (pp. 87–98). Slovenia: Bled.Google Scholar
  15. Giacometti, A., Li, D.H., Marcel, P., & Soulet, A. (2014). 20 Years of pattern mining: a bibliometric survey. SIGKDD Explor. Newsl., 41–50.Google Scholar
  16. Han, J., Cheng, H., Xin, D., & Yan, X. (2007). Frequent pattern mining, current status and future directions. Data Mining and Knowledge Discovery, 15, 55–86.MathSciNetCrossRefGoogle Scholar
  17. Hijmans, R.J., Cameron, S., Parra, L., Jones, P., & Jarvis, A. (2005). Very high resolution interpolated climate surfaces for global land areas. International Journal of Climatology, 25, 1965–978. www.worldclim.org.CrossRefGoogle Scholar
  18. Knobbe, A.J., & Ho, E.K.Y. (2006). Pattern teams. In Proceedings of the 10th european conference on principles and practice of knowledge discovery in databases (pp. 577–584). Germany: Berlin.Google Scholar
  19. Kocev, D.K., Vens, C., Struyf, J., & Džeroski, S. (2013). Tree ensembles for predicting structured outputs. Pattern Recognition, 817–833.Google Scholar
  20. Lavrač, N., Kavšek, B., Flach, P., & Todorovski, Lj. (2004). Subgroup discovery with CN2-SD. Journal of Machine Learning Research, 5, 153–188.MathSciNetGoogle Scholar
  21. Mihelčić, M., Džeroski, S., Lavrač, N., & Šmuc, T. (2015a). Redescription mining with multi-label predictive clustering trees. In Proceedings of the 4th workshop on new frontiers in mining complex patterns (pp. 86–97). Portugal: Porto.Google Scholar
  22. Mihelčić, M., Džeroski, S., Lavrač, N., & Šmuc, T. (2015b). Redescription mining with multi-target predictive clustering trees (2015b). In New frontiers in mining complex patterns - 4th international workshop, NFMCP 2015, held in conjunction with ECML-PKDD 2015, porto, Portugal, September 7, 2015, Revised Selected Papers, (Vol. 9607 pp. 125–143).Google Scholar
  23. Mitchell-Jones, A.J., Amori, G., Bogdanowicz, W., Krystufe, B., Reijnders, P., Spitzenberger, F., Stubbe, M., Thissen, J., Vohralik, V., & Zima, J. (1999). The atlas of european mammals. London: Academic Press. www.european-mammals.org.Google Scholar
  24. Mooney, C.H., & Roddick, J.F (2013). Sequential pattern mining – approaches and algorithms. ACM Computing Surveys, 45(2).Google Scholar
  25. Parida, L., & Ramakrishnan, N. (2004). Redescription mining: structure theory and algorithms. In Proceedings of the 20th national conference on artificial intelligence (pp. 837–844). Pennsylvania: Pittsburgh.Google Scholar
  26. Piccart, B. (2012). Algorithms for multi-target learning. Phd thesis, Katholieke Universiteit Leuven.Google Scholar
  27. Ramakrishnan, N., Kumar, D., Mishra, B., Potts, M., & Helm, R.F. (2004). Turning CARTwheels: an alternating algorithm for mining redescriptions. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 266–275). Seattle, WA: ACM.Google Scholar
  28. Stojanova, D., Ceci, M., Appice, A., & Džeroski, S. (2012). Network regression with predictive clustering trees. Data Mining and Knowledge Discovery, 378–413.Google Scholar
  29. van Leeuwen, M., & Galbrun, E. (2015). Association discovery in two-view data. IEEE Transactions on Knowledge and Data Engineering, 27, 3190–3202.Google Scholar
  30. World bank database, http://data.worldbank.org/.
  31. Zaki, M.J., & Ramakrishnan, N. (2005). Reasoning about sets using redescription mining. In Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 364–373). Chicago, Illinois: ACM.Google Scholar
  32. Zinchenko, T. (2014). Redescription mining over non-binary data sets using decision trees. Masters thesis, Universität des Saarlandes.Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Matej Mihelčić
    • 1
    • 3
    Email author
  • Sašo Džeroski
    • 2
    • 3
  • Nada Lavrač
    • 2
    • 3
  • Tomislav Šmuc
    • 1
  1. 1.Ruđer Bošković InstituteZagrebCroatia
  2. 2.Jožef Stefan InstituteLjubljanaSlovenia
  3. 3.Jožef Stefan International Postgraduate SchoolLjubljanaSlovenia

Personalised recommendations