Skip to main content
Log in

A peek into the black box: exploring classifiers by randomization

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Classifiers are often opaque and cannot easily be inspected to gain understanding of which factors are of importance. We propose an efficient iterative algorithm to find the attributes and dependencies used by any classifier when making predictions. The performance and utility of the algorithm is demonstrated on two synthetic and 26 real-world datasets, using 15 commonly used learning algorithms to generate the classifiers. The empirical investigation shows that the novel algorithm is indeed able to find groupings of interacting attributes exploited by the different classifiers. These groupings allow for finding similarities among classifiers for a single dataset as well as for determining the extent to which different classifiers exploit such interactions in general.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. The algorithm can be downloaded from https://bitbucket.org/aheneliu/goldeneye/ (accessed 7 July 2014) or easily installed using the command install_bitbucket from the devtools package (Wickham and Chang 2014) as follows: install_bitbucket(repo = “goldeneye”, username = “aheneliu”).

References

  • Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowl Based Syst 8(6):373–389

    Article  Google Scholar 

  • Bache K, Lichman M (2014) UCI machine learning repository. http://archive.ics.uci.edu/ml

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont

    MATH  Google Scholar 

  • Chanda P, Cho YR, Zhang A, Ramanathan M (2009) Mining of attribute interactions using information theoretic metrics. In: IEEE International Conference on Data Mining Workshops, pp 350–355

  • Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3(4):261–283

    Google Scholar 

  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–293

    MATH  Google Scholar 

  • De Bie T (2011a) An information theoretic framework for data mining. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’11, pp 564–572

  • De Bie T (2011b) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446

    Article  MATH  MathSciNet  Google Scholar 

  • Domingos P, Pazzani MJ (1997) On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning 29(2–3):103–130

    Article  MATH  Google Scholar 

  • Freitas AA (2001) Understanding the crucial role of attribute interaction in data mining. Artif Intell Rev 16(3):177–199

    Article  MATH  MathSciNet  Google Scholar 

  • Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  • Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something i don’t know: Randomization strategies for iterative data mining. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, KDD ’09, pp 379–388

  • Henelius A, Korpela J, Puolamäki K (2013) Explaining interval sequences by randomization. In: Blockeel H, Kersting K, Nijssen S, Z̆elezný Filip (eds) Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 8188, pp 337–352

  • Hornik K, Buchta C, Zeileis A (2009) Open-source machine learning: R meets Weka. Comput Stat 24(2):225–232. doi:10.1007/s00180-008-0119-7

    Article  MATH  MathSciNet  Google Scholar 

  • Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason 44(1):4–31

    Article  MATH  MathSciNet  Google Scholar 

  • Jakulin A, Bratko I, Smrke D, Demsar J, Zupan B (2003) Attribute interactions in medical data analysis. In: 9th Conference on Artificial Intelligence in Medicine in Europe, pp 229–238

  • Janitza S, Strobl C, Boulesteix AL (2013) An auc-based permutation variable importance measure for random forests. BMC Bioinform 14:119

    Article  Google Scholar 

  • Johansson U, König R, Niklasson L (2003) Rule extraction from trained neural networks using genetic programming. In: 13th International Conference on Artificial Neural Networks, pp 13–16

  • Liaw A, Wiener M (2002) Classification and regression by randomforest. R News 2(3):18–22, http://CRAN.R-project.org/doc/Rnews/

  • Lijffijt J, Papapetrou P, Puolamäki K (2014) A statistical significance testing approach to mining the most informative set of patterns. Data Min Knowl Discov 28:238–263. doi:10.1007/s10618-012-0298-2

    Article  MATH  MathSciNet  Google Scholar 

  • Misra G, Golshan B, Terzi E (2012) A Framework for Evaluating the Smoothness of Data-Mining Results. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, vol II, pp 660–675

  • Ojala M, Garriga GC (2010) Permutation tests for studying classier performance. J Mach Learn Res 11:1833–1863

    MATH  MathSciNet  Google Scholar 

  • Plate T (1999) Accuracy versus interpretability in flexible modeling: implementing a tradeoff using gaussian process models. Behaviormetrika 26:29–50

    Article  Google Scholar 

  • Pulkkinen P, Koivisto H (2008) Fuzzy classifier identification using decision tree and multiobjective evolutionary algorithms. Int J Approx Reason 48(2):526–543

    Article  Google Scholar 

  • Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106

    Google Scholar 

  • R Core Team (2014) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org/

  • Segal MR, Cummings MP, Hubbard AE (2001) Relating amino acid sequence to phenotype: analysis of peptide-binding data. Biometrics 57(2):632–643

    Article  MATH  MathSciNet  Google Scholar 

  • Strobl C, Boulesteix AL, Zeileis A, Hothorn T (2007) Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics 8(25):

  • Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307

    Article  Google Scholar 

  • Wickham H, Chang W (2014) devtools: Tools to make developing R code easier. http://CRAN.R-project.org/package=devtools, r package version 1.5

  • Zacarias OP, Boström H (2013) Comparing support vector regression and random forests for predicting malaria incidence in Mozambique. In: International conference on advances in ICT for Emerging regions, IEEE, pp 217–221

  • Zhao Z, Liu H (2007) Searching for interacting features. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp 1156–1161

  • Zhao Z, Liu H (2009) Searching for interacting features in subset selection. Intell Data Anal 13(2):207–228

    Google Scholar 

Download references

Acknowledgments

AH and KP were partly supported by the Revolution of Knowledge Work project, funded by Tekes. HB and LA were partly supported by the project High-Performance Data Mining for Drug Effect Detection at Stockholm University, funded by Swedish Foundation for Strategic Research under grant IIS11-0053.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreas Henelius.

Additional information

Responsible editor: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Henelius, A., Puolamäki, K., Boström, H. et al. A peek into the black box: exploring classifiers by randomization. Data Min Knowl Disc 28, 1503–1529 (2014). https://doi.org/10.1007/s10618-014-0368-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0368-8

Keywords

Navigation