Variable Transformation for Granularity Change in Hierarchical Databases in Actual Data Mining Solutions
This paper presents a variable transformation strategy for enriching the variables´ information content and defining the project target in actual data mining applications based on relational databases with data at different grains. In an actual solution for assessing the schools´ quality based on official school survey and students tests data, variables at the student and teachers´ grains had to become features of the schools they belonged. The formal problem was how to summarize the relevant information content of the attribute distributions in a few summarizing concepts (features). Instead of the typical lowest order distribution momenta, the proposed transformations based on the distribution histogram produced a weighted score for the input variables. Following the CRISP-DM method, the problem interpretation has been precisely defined as a binary decision problem on a granularly transformed student grade. The proposed granular transformation embedded additional human expert´s knowledge to the input variables at the school level. Logistic regression produced a classification score for good schools and the AUC_ROC and Max_KS assessed that score performance on statistically independent datasets. A 10-fold cross-validation experimental procedure showed that this domain-driven data mining approach produced statistically significant improvement at a 0.99 confidence level over the usual distribution central tendency approach.
KeywordsGranularity transformation Relational data mining School quality assessment Educational decision support system CRISP-DM Domain-driven data mining Logistic regression Ten-fold cross-validation
The author would like to thank Mr. Fábio C. Pereira for running the experiments.
- 1.INEP Databases. <http://portal.inep.gov.br/basica-levantamentos-acessar>. Accessed 15 March 2015. (In Portuguese)
- 2.Travitzki, R.: ENEM: limites e possibilidades do Exame Nacional do Ensino Médio enquanto indicador de qualidade escolar. Ph.D. thesis, USP, São Paulo (2013). (In Portuguese)Google Scholar
- 3.Shearer, C.: The CRISP-DM model: the new blueprint for data mining. J. Data Warehouse. 5(4), 13–22 (2000)Google Scholar
- 8.Hair, Jr., J.F., Black, W.C., Babin, B.J., Anderson, R.E., Tatham, R.L.: Multivariate Data Analysis, 6th edn. Pearson Prentice Hall, Upper Saddle River (2006)Google Scholar
- 10.Sousa, M.U.R.S., Silva, K.P., Adeodato, P.J.L.: Data mining applied to the processes celerity of Pernambuco’s state court of accounts. In: Proceedings of CONTECSI 2008 (2008). (In Portuguese)Google Scholar
- 12.Cao, L.: Introduction to domain driven data mining. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds.) Data Mining for Business Applications, pp. 3–10. Springer, US (2008)Google Scholar
- 14.Conover, W.J.: Practical Nonparametric Statistics, 3rd edn. Wiley, New York (1999)Google Scholar
- 17.Kavukcuoglu, K.: Learning feature hierarchies for object recognition. Ph.D. thesis, Department Computer Science, New York University, January 2011Google Scholar