Variable Transformation for Granularity Change in Hierarchical Databases in Actual Data Mining Solutions

  • Paulo J. L. AdeodatoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9375)


This paper presents a variable transformation strategy for enriching the variables´ information content and defining the project target in actual data mining applications based on relational databases with data at different grains. In an actual solution for assessing the schools´ quality based on official school survey and students tests data, variables at the student and teachers´ grains had to become features of the schools they belonged. The formal problem was how to summarize the relevant information content of the attribute distributions in a few summarizing concepts (features). Instead of the typical lowest order distribution momenta, the proposed transformations based on the distribution histogram produced a weighted score for the input variables. Following the CRISP-DM method, the problem interpretation has been precisely defined as a binary decision problem on a granularly transformed student grade. The proposed granular transformation embedded additional human expert´s knowledge to the input variables at the school level. Logistic regression produced a classification score for good schools and the AUC_ROC and Max_KS assessed that score performance on statistically independent datasets. A 10-fold cross-validation experimental procedure showed that this domain-driven data mining approach produced statistically significant improvement at a 0.99 confidence level over the usual distribution central tendency approach.


Granularity transformation Relational data mining School quality assessment Educational decision support system CRISP-DM Domain-driven data mining Logistic regression Ten-fold cross-validation 



The author would like to thank Mr. Fábio C. Pereira for running the experiments.


  1. 1.
    INEP Databases. <>. Accessed 15 March 2015. (In Portuguese)
  2. 2.
    Travitzki, R.: ENEM: limites e possibilidades do Exame Nacional do Ensino Médio enquanto indicador de qualidade escolar. Ph.D. thesis, USP, São Paulo (2013). (In Portuguese)Google Scholar
  3. 3.
    Shearer, C.: The CRISP-DM model: the new blueprint for data mining. J. Data Warehouse. 5(4), 13–22 (2000)Google Scholar
  4. 4.
    Fawcett, T.: An introduction to ROC analysis. Patt. Recognition Lett. 27, 861–874 (2006)CrossRefGoogle Scholar
  5. 5.
    Bolton, R.J., Hand, D.J.: Statistical fraud detection: a review. Statist. Sci. 17(3), 235–255 (2002)MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Nordin, F., Kowalkowski, C.: Solutions offerings: a critical review and reconceptualisation. J. Serv. Manage. 21(4), 441–459 (2010)CrossRefGoogle Scholar
  7. 7.
    Hu, M.K.: Visual pattern recognition by moment invariants. IRE Trans Info. Theor. 8(2), 179–187 (1962)CrossRefzbMATHGoogle Scholar
  8. 8.
    Hair, Jr., J.F., Black, W.C., Babin, B.J., Anderson, R.E., Tatham, R.L.: Multivariate Data Analysis, 6th edn. Pearson Prentice Hall, Upper Saddle River (2006)Google Scholar
  9. 9.
    Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 6th edn. Pearson Prentice Hall, Upper Saddle River (2007)zbMATHGoogle Scholar
  10. 10.
    Sousa, M.U.R.S., Silva, K.P., Adeodato, P.J.L.: Data mining applied to the processes celerity of Pernambuco’s state court of accounts. In: Proceedings of CONTECSI 2008 (2008). (In Portuguese)Google Scholar
  11. 11.
    Flusser, J., Suk, T.: Pattern recognition by affine moment invariants. Pattern Recogn. 26(1), 167–174 (1993)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Cao, L.: Introduction to domain driven data mining. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds.) Data Mining for Business Applications, pp. 3–10. Springer, US (2008)Google Scholar
  13. 13.
    Provost, F., Fawcett, T.: Robust classification for imprecise environments. Mach. Learn. J. 42(3), 203–231 (2001)CrossRefzbMATHGoogle Scholar
  14. 14.
    Conover, W.J.: Practical Nonparametric Statistics, 3rd edn. Wiley, New York (1999)Google Scholar
  15. 15.
    Adeodato, P.J.L., Vasconcelos, G.C., et al.: The power of sampling and stacking for the PAKDD-2007 cross-selling problem. Int. J. Data Warehouse. Min. 4(2), 22–31 (2008)CrossRefGoogle Scholar
  16. 16.
    Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, Waltham (2012)zbMATHGoogle Scholar
  17. 17.
    Kavukcuoglu, K.: Learning feature hierarchies for object recognition. Ph.D. thesis, Department Computer Science, New York University, January 2011Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Centro de InformáticaUniversidade Federal de PernambucoRecifeBrazil

Personalised recommendations