Skip to main content

Variable Transformation for Granularity Change in Hierarchical Databases in Actual Data Mining Solutions

  • Conference paper
  • First Online:
Intelligent Data Engineering and Automated Learning – IDEAL 2015 (IDEAL 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9375))

  • 1464 Accesses

Abstract

This paper presents a variable transformation strategy for enriching the variables´ information content and defining the project target in actual data mining applications based on relational databases with data at different grains. In an actual solution for assessing the schools´ quality based on official school survey and students tests data, variables at the student and teachers´ grains had to become features of the schools they belonged. The formal problem was how to summarize the relevant information content of the attribute distributions in a few summarizing concepts (features). Instead of the typical lowest order distribution momenta, the proposed transformations based on the distribution histogram produced a weighted score for the input variables. Following the CRISP-DM method, the problem interpretation has been precisely defined as a binary decision problem on a granularly transformed student grade. The proposed granular transformation embedded additional human expert´s knowledge to the input variables at the school level. Logistic regression produced a classification score for good schools and the AUC_ROC and Max_KS assessed that score performance on statistically independent datasets. A 10-fold cross-validation experimental procedure showed that this domain-driven data mining approach produced statistically significant improvement at a 0.99 confidence level over the usual distribution central tendency approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. INEP Databases. <http://portal.inep.gov.br/basica-levantamentos-acessar>. Accessed 15 March 2015. (In Portuguese)

  2. Travitzki, R.: ENEM: limites e possibilidades do Exame Nacional do Ensino Médio enquanto indicador de qualidade escolar. Ph.D. thesis, USP, São Paulo (2013). (In Portuguese)

    Google Scholar 

  3. Shearer, C.: The CRISP-DM model: the new blueprint for data mining. J. Data Warehouse. 5(4), 13–22 (2000)

    Google Scholar 

  4. Fawcett, T.: An introduction to ROC analysis. Patt. Recognition Lett. 27, 861–874 (2006)

    Article  Google Scholar 

  5. Bolton, R.J., Hand, D.J.: Statistical fraud detection: a review. Statist. Sci. 17(3), 235–255 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  6. Nordin, F., Kowalkowski, C.: Solutions offerings: a critical review and reconceptualisation. J. Serv. Manage. 21(4), 441–459 (2010)

    Article  Google Scholar 

  7. Hu, M.K.: Visual pattern recognition by moment invariants. IRE Trans Info. Theor. 8(2), 179–187 (1962)

    Article  MATH  Google Scholar 

  8. Hair, Jr., J.F., Black, W.C., Babin, B.J., Anderson, R.E., Tatham, R.L.: Multivariate Data Analysis, 6th edn. Pearson Prentice Hall, Upper Saddle River (2006)

    Google Scholar 

  9. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, 6th edn. Pearson Prentice Hall, Upper Saddle River (2007)

    MATH  Google Scholar 

  10. Sousa, M.U.R.S., Silva, K.P., Adeodato, P.J.L.: Data mining applied to the processes celerity of Pernambuco’s state court of accounts. In: Proceedings of CONTECSI 2008 (2008). (In Portuguese)

    Google Scholar 

  11. Flusser, J., Suk, T.: Pattern recognition by affine moment invariants. Pattern Recogn. 26(1), 167–174 (1993)

    Article  MathSciNet  Google Scholar 

  12. Cao, L.: Introduction to domain driven data mining. In: Cao, L., Yu, P.S., Zhang, C., Zhang, H. (eds.) Data Mining for Business Applications, pp. 3–10. Springer, US (2008)

    Google Scholar 

  13. Provost, F., Fawcett, T.: Robust classification for imprecise environments. Mach. Learn. J. 42(3), 203–231 (2001)

    Article  MATH  Google Scholar 

  14. Conover, W.J.: Practical Nonparametric Statistics, 3rd edn. Wiley, New York (1999)

    Google Scholar 

  15. Adeodato, P.J.L., Vasconcelos, G.C., et al.: The power of sampling and stacking for the PAKDD-2007 cross-selling problem. Int. J. Data Warehouse. Min. 4(2), 22–31 (2008)

    Article  Google Scholar 

  16. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. Morgan Kaufmann, Waltham (2012)

    MATH  Google Scholar 

  17. Kavukcuoglu, K.: Learning feature hierarchies for object recognition. Ph.D. thesis, Department Computer Science, New York University, January 2011

    Google Scholar 

Download references

Acknowledgments

The author would like to thank Mr. Fábio C. Pereira for running the experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paulo J. L. Adeodato .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Adeodato, P.J.L. (2015). Variable Transformation for Granularity Change in Hierarchical Databases in Actual Data Mining Solutions. In: Jackowski, K., Burduk, R., Walkowiak, K., Wozniak, M., Yin, H. (eds) Intelligent Data Engineering and Automated Learning – IDEAL 2015. IDEAL 2015. Lecture Notes in Computer Science(), vol 9375. Springer, Cham. https://doi.org/10.1007/978-3-319-24834-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24834-9_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24833-2

  • Online ISBN: 978-3-319-24834-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics