Abstract
Predictive power is an important objective for current business performance measurement systems and it is based on metrics design, collection and preprocessing of data and predictive modeling. A promising but less studied preprocessing activity is to construct additional features that can be interpreted to express the quality of data and thus provide predictive models not only data points but also their quality characteristics. The research problem addressed in this study is: can we improve the predictive power of business performance measurement systems by constructing additional data quality features? Unsupervised, supervised and domain knowledge approaches were used to operationalize eight features based on elementary data quality dimensions. In the case studies five corporate datasets Toyota Material Handling Finland, Innolink group, 3StepIt, Papua Merchandising and Lempesti constructed data quality features performed better than minimally processed data sets in 29/38 and equally in 9/38 tests. Comparison to a competing method of preprocessing combinations with the first two datasets showed that constructed features had slightly lower prediction performance, but they were clearly better in execution time and easiness of use. Additionally, constructed data quality features helped to visually explore high dimensional data quality patterns. Further research is needed to expand the range of constructed features and to map the findings systematically to data quality concepts and practices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abdul-Rahmana, S., Abu Bakara, A., Hussein, B., Zeti, A.: An intelligent data pre-processing of complex datasets. Intell. Data Anal. 16, 305–325 (2012)
Bellman, R.E.: Dynamic Programming. Rand Corporation, Princeton University Press, New Jersey (1957)
Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F.: Guide to Intelligent Data Analysis – How to Intelligently Make Sense of Real Data. Springer, London (2010)
Breiman, L.: Random forests. Mach. Learn. 45, 5–32 (2001)
Breunig, M.M., Kriegel, H.-P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD 2000 International Conference on Management of Data, pp. 93–104 (2000)
Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection for libraries of models. In: Proceedings of ICML, p. 18 (2004)
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3), 15 (2009)
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: Crisp-Dm 1.0 Step by Step Data Mining Guide. Crisp-DM Consortium (2000)
Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2005)
Engel, J., Gerretzen, J., Szymanka, E., Jeroen, J.J., Downey, G., Blanchet, L., Buydens, L.: Breaking with trends in preprocessing. TrAC Trends in Analytical Chemistry 50, 96–106 (2013)
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The KDD process for extracting useful knowledge from volumes of data. Commun. ACM 39(11), 27–34 (1996)
Filzmoser, P., Maronna, R., Werner, M.: Outlier identification in high dimensions. Comput. Stat. Data Anal. 52(3), 1694–1711 (2008)
Franco-Santos, M., Kennerley, M., Micheli, P., Martinez, V., Mason, S., Marr, B., Gray, D., Neely, A.: Towards a definition of a business performance measurement system. Int. J. Oper. Prod. Manag. 27(8), 784–801 (2007)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1995)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
Han, J., Kamber, M., Pei, J.: Data mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2012)
Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Classification. Taiwan National University, Taipei (2010)
Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22(2), 85–126 (2004)
Hu, M.-X., Salvucci, S.: A Study of Imputation Algorithms, Institure of Education Science, NCES, New York (1991)
Järvinen, P.: On Research Methods. Opinpajan kirja, Tampere (2012)
Kaplan, R.S., Norton, D.P.: the balanced scorecard – measures that drive performance. Harvard Bus. Rev. 71(1), 71–79 (1992)
Kira, K., Rendell, L.A.: A practical approach to feature selection. In: Proceedings of the Ninth International Workshop on Machine Learning, pp. 249–256 (1992)
Kitchenham, B., Brereton, O.P., Budgen, D., Turner, M., Bailey, J., Linkman, S.: Systematic literature reviews in software engineering - a systematic literature review. J. Inf. Softw. Technol. 51(1), 7–15 (2009)
Kriegel, H.-P., Borgwardt, K.M., Kröger, P., Pryakhin, A., Schubert, M., Zimek, A.: Future trends in data mining. Data Min. Knowl. Disc. 15(1), 87–97 (2007)
Kriegel, H.-P., Kröger, P., Zimek, A.: Outlier detection techniqes. In: 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC (2010)
Kuhn, M., Johnson, K.: Applied Predictive Modeling. Springer, New York (2013)
Kochanski, A., Perzyk, M., Klebczyk, M.: Knowledge in imperfect data in advances in knowledge representation. In: Ramirez, C. (ed), DOI: 10.5772/37714. http://www.intechopen.com/books/advances-inknowledge-representation/knowledge-in-imperfect-data (2012)
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324 (1997)
Kotsiantis, S.B., Kanellopoulos, D., Pintelas, P.E.: Data preprocessing for supervised learning. Int. J. Comput. Sci. 2, 111–117 (2006)
Ludmila, K.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, New Jersey (2004)
Longadge, R., Dongre, S.S., Malik, L.: Class imbalance problem in data mining: review. Int. J. Comput. Sci. Netw. 2(1) (2013)
March, S., Smith, G.: Design and natural science research on information technology. J. Decis. Support Syst. 15(4), 251–266 (1995)
Nørreklit, H.: The balance on the balanced scorecard—a critical analysis of some of its assumptions. Manag. Acc. Res. 11(1), 65–88 (2000)
Peltonen, J.: Dimensionality Reduction. Lecture Series, University of Tampere (2014)
Pyle, D.: Data Preparation for Data Mining. Morgan Kauffman, San Francisco (2003)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kauffman, San Francisco (1993)
Sadiq, S., Khodabandehloo, Y.N., Induska, M.: 20 Years of data quality research: themes, trends and synergies. In: ADC 2011 Proceedings of the Twenty-Second Australasian Database Conference, vol. 115, pp. 153–162 (2011)
Torgo, L.: Data Mining with R: Learning with Case Studies. CRC Press, Boca Raton (2010)
Vattulainen, M.: A method to improve the predictive power of a business performance measurement system by data preprocessing combinations: two cases in predictive classification of service sales volume from balanced data. In: Ghazawneh, A., Nørbjerg, J., Pries-Heje, J. (eds.) Proceedings of the 37th Information Systems Research Seminar in Scandinavia (IRIS 37), Ringsted, Denmark, pp.10–13 (2014)
Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)
Wand, Y., Wang, R.: Anchoring data quality dimensions in ontological foundations. Commun. ACM 39(11), 86–95 (1996)
Wu, X., Zhu, X., Wu, G.-Q., Ding, W.: Data mining with big data. IEEE Trans. Kowl. Disc. Data Eng. 26(1), 97–107 (2013)
Wolpert, D.: Stacked generalization. Neural Netw. 5, 241–259 (1992)
Yang, Q., Wu, X.: 10 Challenging problems in data mining research. Int. J. Inf. Technol. Decis. Mak. 5(4), 597–604 (2006)
Zhao, H., Sudra, R.: Entity identification for heterogenous database integration —a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005)
Acknowledgements
Professor emeritus Pertti Järvinen, professor Martti Juhola and Dr. Kati Iltanen University of Tampere, Finland. After sales director Jarmo Laamanen Toyota Material Handling Finland, managing director Marko Kukkola Innolink Group, sales director Mika Karjalainen 3StepIt, managing director Olli Vaaranen Papua Merchandising and managing director Sirpa Kauppila Lempesti.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Vattulainen, M. (2015). Improving the Predictive Power of Business Performance Measurement Systems by Constructed Data Quality Features? Five Cases. In: Perner, P. (eds) Advances in Data Mining: Applications and Theoretical Aspects. ICDM 2015. Lecture Notes in Computer Science(), vol 9165. Springer, Cham. https://doi.org/10.1007/978-3-319-20910-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-20910-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-20909-8
Online ISBN: 978-3-319-20910-4
eBook Packages: Computer ScienceComputer Science (R0)