Skip to main content

Data Quality Mining

  • Conference paper
  • First Online:
  • 3086 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 930))

Abstract

We are living in a world of information abundance, surplus, and access. We have technologies to acquire any type of information but we still face the challenge of extracting the underlying valuable knowledge. Data analyses and mining processes may be severely impaired whenever data are corrupted by noise, ambiguity and distortions.

This paper aims to provide a systematic procedure for data cleaning in single files data sources without schema that may be corrupted by the most common data problems. The methodology is guided by the dimensions of data quality standards and focuses on the goal of performing reasonable posterior statistical analyses.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Balasingam, B., Mannaru, P., Sidoti, D., Pattipati, K., Willett, P., Pedrycz, W., Chen, S.-M. (eds.): Online anomaly detection in big data. In: The First Line of Defense Against Intruders Data Science and Big Data: An Environment of Computational Intelligence, pp. 83–107. Springer International Publishing (2017)

    Google Scholar 

  2. Gliklich, R.E., Dreyer, N.A., Leavy, M.B. (eds.) Registries for Evaluating Patient Outcomes: A User’s Guide, 3rd edn., 11 April 2014. Data Collection and Quality Assurance 2014

    Google Scholar 

  3. Cai, L., Zhu, Y.: The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14 (2015)

    Google Scholar 

  4. Karr, A.F., Sanil, A.P., Banks, D.L.: Data quality: a statistical perspective statistical methodology. Elsevier 3, 137–173 (2006)

    MATH  Google Scholar 

  5. Wickham, H.: Tidy data. J. Stat. Softw. 59, 1–23 (2014). Foundation for Open Access Statistics

    Article  Google Scholar 

  6. Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning. Wiley, Hoboken (2003)

    Book  Google Scholar 

  7. Keller, S., Korkmaz, G., Orr, M., Schroeder, A., Shipp, S.: The evolution of data quality: understanding the transdisciplinary origins of data quality concepts and approaches. Ann. Rev. Stat. Appl. 4, 85–108 (2017)

    Article  Google Scholar 

  8. Laranjeiro, N.; Soydemir, S.N., Bernardino, J.: A survey on data quality: classifying poor data. In: 2015 IEEE 21st Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 179–188 (2015)

    Google Scholar 

  9. BARC (Business Application research Center) - a CXP Group Company, Data Quality and Master Data Management: How to Improve your data quality (2017)

    Google Scholar 

  10. EUROSTAT, Handbook on Data Validation in Eurostat -Practical Guide to Data Validation in EuroSttat (2010)

    Google Scholar 

  11. Azimaee, M., Smith, M., Lix, L., Burchill, C., Orr, J.: MCHP data quality framework. Manitoba Centre for Health Policy, University of Manitoba, Winnipeg (Manitoba) (2015)

    Google Scholar 

  12. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)

    Google Scholar 

  13. Hipp, J., Guntzer, U., Grimmer, U.: Data quality mining-making a virute of necessity. In: DMKD (2001)

    Google Scholar 

  14. De Jonge, E., van der Loo, M.: An introduction to data cleaning with R Heerlen, Statistics Netherlands (2013)

    Google Scholar 

  15. Taleb, I., Dssouli, R., Serhani, M.A.: Big data pre-processing: a quality framework Big Data (BigData Congress). In: IEEE International Congress on 2015, pp. 191–198 (2015)

    Google Scholar 

  16. ESS Task Force Peer Review, Quality Assurance Framework of the European Statistical System- Version 1.2, European Statistical System (2015)

    Google Scholar 

  17. Barateiro, J., Galhardas, H.: A survey of data quality tools. Datenbank-Spektrum 14, 48 (2005)

    Google Scholar 

  18. van der Loo, M.: A formal typology of data validation functions (2015)

    Google Scholar 

  19. Chalamalla, A., Ilyas, I.F., Ouzzani, M., Papotti, P.: Descriptive and prescriptive data cleaning. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 445–456 (2014)

    Google Scholar 

  20. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 315–32 (2007)

    Google Scholar 

  21. Zio, M., Fursova, N., Gelsema, T., Giebing, S., Guarnera, U., Petrauskiene, J., Kalben, Q., Scanu, M., Bosch, K., van der Loo, M., Walsdorfer, K.: Methodology for data validation 1.0. (2016)

    Google Scholar 

  22. van der loo, M.: Properties of validation rules. In: Methodology for Data Validation 1.0 (2016)

    Google Scholar 

  23. van der Loo, M.: Validation levels based on decomposition of metadata - Essnet Validat Foudation. In: Methodology for data validation 1.0 (2016)

    Google Scholar 

  24. Giessing, S., Walsdorfer, K.: Validation levels from a business prespective - Essnet Validat Foudation. In: Methodology for data validation 1.0 (2016)

    Google Scholar 

  25. Winkler, W.E.: Inf. Syst. Methods for evaluating and creating data quality 29, 531–550 (2004)

    Google Scholar 

  26. de Waal, T., Pannekoek, J., Scholtus, S.: Handbook of Statistical Data Editing and Imputation. Wiley, Hoboken (2011)

    Book  Google Scholar 

  27. Osborne, J.W., Overbay, A.: The power of outliers (and why researchers should always check for them) Practical assessment, research and evaluation, vol. 9, pp. 1–12 (2004)

    Google Scholar 

  28. Schafer, J., Graham, J.: Missing data: our view of the state of the art. Psychol. Methods 7, 147 (2002)

    Article  Google Scholar 

  29. Dusetzina, S., Tyree, S., Meyer, A., Green, L., Carpenter, W.: Linking data for health services research: a framework and instructional guide. Agency for Healthcare Research and Quality (US), Rockville (MD) (2014)

    Google Scholar 

  30. Forchhammerl, B., Papenbrockl, T., Steningl, T., Viehmeierl, S.: Duplicate detection on GPUs. HPI Future SOC Lab: Proc. 2011 70, 59 (2013)

    Google Scholar 

  31. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)

    Article  Google Scholar 

  32. Christen, P. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection, Springer Science and Business Media (2012)

    Google Scholar 

Download references

Acknowledgements

Luís Paulo Reis and Alexandra Oliveira were partially founded by the European Regional Development Fund through the programme COMPETE by FCT (Portugal) in the scope of the project PEst-UID/CEC/ 00027/2015 and QVida+: Estimação Contínua de Qualidade de Vida para Auxílio Eficaz à Decisão Clínica, NORTE010247FEDER003446, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement.

Rita Gaio was partially supported by CMUP (UID/MAT/00144/2019), which is funded by FCT with national (MCTES) and European structural funds through the programs FEDER, under the partnership agreement PT2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandra Oliveira .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Oliveira, A., Gaio, R., Baylina, P., Rebelo, C., Reis, L.P. (2019). Data Quality Mining. In: Rocha, Á., Adeli, H., Reis, L., Costanzo, S. (eds) New Knowledge in Information Systems and Technologies. WorldCIST'19 2019. Advances in Intelligent Systems and Computing, vol 930. Springer, Cham. https://doi.org/10.1007/978-3-030-16181-1_34

Download citation

Publish with us

Policies and ethics