Abstract
Preprocessing is often the most time-consuming phase in data analysis and interdependent data quality issues a cause of suboptimal modelling results. The design problem addressed in this paper is: what kind of framework can support visualization of data quality issue interdependencies for faster and more effective preprocessing? An object framework was designed that uses constructed features as a basis of visualizations. Six real datasets from business performance measurement system domain were acquired to demonstrate the implementation. The framework was found to be a viable preprocessing analysis supplement to both industry practice of exploratory data analysis and research benchmark of preprocessing combinations.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdul-Rahmana, S., Abu Bakara, A., Hussein, B., Zeti, A.: An intelligent data pre-processing of complex datasets. Intell. Data Anal. 16, 305–325 (2012)
Bellman, R.: Dynamic Programming. Rand Corporation, Princeton University Press (1957)
Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F.: Guide to Intelligent Data Analysis – How to Intelligently Make Sense of Real Data. Springer, London (2010)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Breunig, M.M., Kriegel, H-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD 2000 International Conference on Management of Data, pp. 93–104 (2000)
Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection for libraries of models. In: Proceedings of ICML, p. 18 (2004)
Chambers, J.: Software for Data Analysis. Springer, New York (2008)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)
Chapman, P., Clinton, J., Kerber, R., Khabaza T., Reinartz T., Shearer, C., Wirth R.: Crisp-Dm 1.0 step by step data mining guide, Crisp-DM Consortium (2000)
Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2005)
Engel, J., Gerretzen, J., Szymanka, E., Jansen, J.J., Downey, G., Blanchet, L., Buydens, L.: Breaking with trends in preprocessing. TrAC Trends Anal. Chem. 50, 96–106 (2013)
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting useful knowledge from volumes of data. Commun. ACM 39(11), 27–34 (1996)
Filzmoser, P., Maronna, R., Werner, M.: Outlier identification in high dimensions. Comput. Stat. Data Anal. 52(3), 1694–1711 (2008)
Franco-Santos, M., Kennerley, M., Micheli, P., Martinez, V., Mason, S., Marr, B., Gray, D., Neely, A.: Towards a definition of a business performance measurement system. Int. J. Oper. Prod. Manage. 27(8), 784–801 (2007)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1995)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2012)
Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Classification. Taiwan National University, Taipei (2010)
Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Hu, M-X., Salvucci, S.: A Study of Imputation Algorithms, Institute of Education Science, NCES, U.S. Department of Education (1991)
Juhola, M., Siermala, M.: A scatter method for data and variable importance evaluation. Integr. Comput. Aided Eng. 19(2), 137–149 (2012)
Järvinen, P.: On Research Methods. Opinpajan kirja, Tampere (2012)
Kaplan, R.S., Norton, D.P.: The balanced scorecard – measures that drive performance. Harvard Bus. Rev. 71(1), 71–79 (1992)
Kriegel, H.-P., Borgwardt, K.M., Kröger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Min. Knowl. Disc. 15(1), 87–97 (2007)
Kriegel, H-P., Kröger P., Zimek, A.: Outlier detection techniques. In: 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC (2010)
Kuhn, M., Johnson, K.: Applied Predictive Modeling. Springer, New York (2013)
Kochanski, A., Perzyk, M., Klebczyk, M.: Knowledge in imperfect data in advances in knowledge representation. In: Ramirez, C. (ed.) (2012). DOI:10.5772/37714. http://www.intechopen.com/books/advances-inknowledge-representation/knowledge-in-imperfect-data
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Kotsiantis, S.B., Kanellopoulos, D., Pintelas, P.E.: Data preprocessing for supervised learning. Int. J. Comput. Sci. 2, 111–117 (2006)
Kuncheva, L.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, Hoboken (2004)
Longadge, R., Dongre, S.S., Malik, L.: Class imbalance problem in data mining: review. Int. J. Comput. Sci. Netw. 2(1) (2013)
Lawson, R.G., Jurs, P.C.: New index for clustering tendency and its application to chemical problems. J. Chem. Inf. Comput. Sci. 30(1), 36–41 (1990)
March, S., Smith, G.: Design and natural science research on information technology. J. Decis. Support Syst. 15(4), 251–266 (1995)
Nørreklit, H.: The balance on the balanced scorecard—a critical analysis of some of its assumptions. Manage. Account. Res. 11(1), 65–88 (2000)
Pyle, D.: Data Preparation for Data Mining. Morgan Kauffman, San Francisco (2003)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2008)
Sadig, S., Yeganeh, N.K., Induska, M.: 20 years of data quality research: themes, trends and synergies. In: ADC 2011 Proceedings of the Twenty-Second Australasian Database Conference, vol. 115, pp. 153–162 (2011)
Somerville, I.: Software Engineering. Pearson, Boston (2015)
Sondhi, P.: Feature Construction Methods: A Survey, Jan 2016. sifaka.cs.uiuc.edu
Torgo, L.: Data Mining with R: Learning with Case Studies. CRC Press, Boca Raton (2010)
Vattulainen, M.: A method to improve the predictive power of a business performance measurement system by data preprocessing combinations: two cases in predictive classification of service sales volume from balanced data. In: Ghazawneh, A., Nørbjerg, J., Pries-Heje, J. (eds.) Proceedings of the 37th Information Systems Research Seminar in Scandinavia (IRIS 37), Ringsted, Denmark, 10–13 August 2014
Vattulainen, M.: Improving the predictive power of business performance measurement systems by constructed data quality features - five cases. In: Perner, P. (ed.) ICDM 2015. LNCS, vol. 9165, pp. 3–16. Springer, Heidelberg (2015)
Vattulainen, M.: Preproviz: Tools for Visualization of Interdependent Data Quality Issues (2016). https://cran.r-project.org/web/packages/preproviz
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Wand, Y., Wang, R.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39(11), 86–95 (1996)
Wickham, H.: Advanced R. Chapman (2014)
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Discov. Data Eng. 26(1), 97–107 (2013)
Yang, Q., Wu, X.: 10 Challenging problems in data mining research. Int. J. Inf. Technol. Decis. Making 5(4), 597–604 (2006)
Zhao, H., Sudra, R.: Entity identification for heterogenous database integration —a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Vattulainen, M. (2016). Data Quality Visualization for Preprocessing. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2016. Lecture Notes in Computer Science(), vol 9728. Springer, Cham. https://doi.org/10.1007/978-3-319-41561-1_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-41561-1_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41560-4
Online ISBN: 978-3-319-41561-1
eBook Packages: Computer ScienceComputer Science (R0)