Abstract
We are living in a world of information abundance, surplus, and access. We have technologies to acquire any type of information but we still face the challenge of extracting the underlying valuable knowledge. Data analyses and mining processes may be severely impaired whenever data are corrupted by noise, ambiguity and distortions.
This paper aims to provide a systematic procedure for data cleaning in single files data sources without schema that may be corrupted by the most common data problems. The methodology is guided by the dimensions of data quality standards and focuses on the goal of performing reasonable posterior statistical analyses.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Balasingam, B., Mannaru, P., Sidoti, D., Pattipati, K., Willett, P., Pedrycz, W., Chen, S.-M. (eds.): Online anomaly detection in big data. In: The First Line of Defense Against Intruders Data Science and Big Data: An Environment of Computational Intelligence, pp. 83–107. Springer International Publishing (2017)
Gliklich, R.E., Dreyer, N.A., Leavy, M.B. (eds.) Registries for Evaluating Patient Outcomes: A User’s Guide, 3rd edn., 11 April 2014. Data Collection and Quality Assurance 2014
Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14 (2015)
Karr, A.F., Sanil, A.P., Banks, D.L.: Data quality: a statistical perspective statistical methodology. Elsevier 3, 137–173 (2006)
Wickham, H.: Tidy data. J. Stat. Softw. 59, 1–23 (2014). Foundation for Open Access Statistics
Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning. Wiley, Hoboken (2003)
Keller, S., Korkmaz, G., Orr, M., Schroeder, A., Shipp, S.: The evolution of data quality: understanding the transdisciplinary origins of data quality concepts and approaches. Ann. Rev. Stat. Appl. 4, 85–108 (2017)
Laranjeiro, N.; Soydemir, S.N., Bernardino, J.: A survey on data quality: classifying poor data. In: 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 179–188 (2015)
BARC (Business Application research Center) - a CXP Group Company, Data Quality and Master Data Management: How to Improve your data quality (2017)
EUROSTAT, Handbook on Data Validation in Eurostat -Practical Guide to Data Validation in EuroSttat (2010)
Azimaee, M., Smith, M., Lix, L., Burchill, C., Orr, J.: MCHP data quality framework. Manitoba Centre for Health Policy, University of Manitoba, Winnipeg (Manitoba) (2015)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)
Hipp, J., Guntzer, U., Grimmer, U.: Data quality mining-making a virute of necessity. In: DMKD (2001)
De Jonge, E., van der Loo, M.: An introduction to data cleaning with R Heerlen, Statistics Netherlands (2013)
Taleb, I., Dssouli, R., Serhani, M.A.: Big data pre-processing: a quality framework Big Data (BigData Congress). In: IEEE International Congress on 2015, pp. 191–198 (2015)
ESS Task Force Peer Review, Quality Assurance Framework of the European Statistical System- Version 1.2, European Statistical System (2015)
Barateiro, J., Galhardas, H.: A survey of data quality tools. Datenbank-Spektrum 14, 48 (2005)
van der Loo, M.: A formal typology of data validation functions (2015)
Chalamalla, A., Ilyas, I.F., Ouzzani, M., Papotti, P.: Descriptive and prescriptive data cleaning. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 445–456 (2014)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 315–32 (2007)
Zio, M., Fursova, N., Gelsema, T., Giebing, S., Guarnera, U., Petrauskiene, J., Kalben, Q., Scanu, M., Bosch, K., van der Loo, M., Walsdorfer, K.: Methodology for data validation 1.0. (2016)
van der loo, M.: Properties of validation rules. In: Methodology for Data Validation 1.0 (2016)
van der Loo, M.: Validation levels based on decomposition of metadata - Essnet Validat Foudation. In: Methodology for data validation 1.0 (2016)
Giessing, S., Walsdorfer, K.: Validation levels from a business prespective - Essnet Validat Foudation. In: Methodology for data validation 1.0 (2016)
Winkler, W.E.: Inf. Syst. Methods for evaluating and creating data quality 29, 531–550 (2004)
de Waal, T., Pannekoek, J., Scholtus, S.: Handbook of Statistical Data Editing and Imputation. Wiley, Hoboken (2011)
Osborne, J.W., Overbay, A.: The power of outliers (and why researchers should always check for them) Practical assessment, research and evaluation, vol. 9, pp. 1–12 (2004)
Schafer, J., Graham, J.: Missing data: our view of the state of the art. Psychol. Methods 7, 147 (2002)
Dusetzina, S., Tyree, S., Meyer, A., Green, L., Carpenter, W.: Linking data for health services research: a framework and instructional guide. Agency for Healthcare Research and Quality (US), Rockville (MD) (2014)
Forchhammerl, B., Papenbrockl, T., Steningl, T., Viehmeierl, S.: Duplicate detection on GPUs. HPI Future SOC Lab: Proc. 2011 70, 59 (2013)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)
Christen, P. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection, Springer Science and Business Media (2012)
Acknowledgements
Luís Paulo Reis and Alexandra Oliveira were partially founded by the European Regional Development Fund through the programme COMPETE by FCT (Portugal) in the scope of the project PEst-UID/CEC/ 00027/2015 and QVida+: Estimação Contínua de Qualidade de Vida para Auxílio Eficaz à Decisão Clínica, NORTE010247FEDER003446, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement.
Rita Gaio was partially supported by CMUP (UID/MAT/00144/2019), which is funded by FCT with national (MCTES) and European structural funds through the programs FEDER, under the partnership agreement PT2020.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Oliveira, A., Gaio, R., Baylina, P., Rebelo, C., Reis, L.P. (2019). Data Quality Mining. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S. (eds) New Knowledge in Information Systems and Technologies. WorldCIST'19 2019. Advances in Intelligent Systems and Computing, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-16181-1_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-16181-1_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-16180-4
Online ISBN: 978-3-030-16181-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)