Advertisement

Multiple Data Quality Evaluation and Data Cleaning on Imprecise Temporal Data

  • Xiaoou Ding
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11158)

Abstract

With data currency issues draw the attentions of both researchers and engineers, temporal data, which describes real world events with time tags in database, is playing a key role in data warehouse, data mining, and etc. At the same time, 4V features of big data give rise to the difficulties in comprehensive data quality management and data cleaning. On one hand, entity resolution methods are faced with challenges when dealing with temporal data. On another hand, multiple problems existing in data records are hard to be captured and repaired. Motivated by this, we address data quality evaluation and data cleaning issues in imprecise temporal data. This project aims to solve three key problems in temporal data quality improvement and cleaning: (1) Determining currency on imprecise temporal data, (2) Entity resolution on temporal data with incomplete timestamps, and (3) Data quality improvement on consistency and completeness with data currency. The purpose of this paper is to address the problem definitions and discuss the procedure framework and the solutions of improving the effectiveness of temporal data cleaning with multiple errors.

Keywords

Temporal data Data currency Multiple data cleaning Data quality 

References

  1. 1.
    UNIMATCH: a record linkage system: users manual. In: Bureau of the Census, Washington DC (1976)Google Scholar
  2. 2.
    Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: International Conference on Very Large Data Bases, pp. 586–597 (2002)CrossRefGoogle Scholar
  3. 3.
    Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41(3), 16 (2009)CrossRefGoogle Scholar
  4. 4.
    Bertiequille, L., Sarma, A.D., Dong, Marian, A., Srivastava, D.: Sailing the information ocean with awareness of currents: discovery and application of source dependence. Computer. Science 26(8), 1881–3 (2009)Google Scholar
  5. 5.
    Cappiello, C., Francalanci, C., Pernici, B.: Time related factors of data accuracy, completeness, and currency in multi-channel information systems. In: The Conference on Advanced Information Systems Engineering, pp. 145–153 (2008)Google Scholar
  6. 6.
    Chiang, Y.H., Doan, A.H., Naughton, J.F.: Tracking entities in the dynamic world: a fast algorithm for matching temporal records. Proc. VLDB Endow. 7, 469–480 (2014)CrossRefGoogle Scholar
  7. 7.
    Chu, X., Ilyas, I.F., Papotti, P., Ye, Y.: Ruleminer: data quality rules discovery. In: IEEE International Conference on Data Engineering, pp. 1222–1225 (2014)Google Scholar
  8. 8.
    Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: ACM SIGMOD International Conference on Management of Data, pp. 201–212 (1998)Google Scholar
  9. 9.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: International Conference on Very Large Data Bases, pp. 315–326 (2007)Google Scholar
  10. 10.
    Deng, T., Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010, Indianapolis, Indiana, USA, 6–11 June 2010, pp. 169–178 (2010)Google Scholar
  11. 11.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  12. 12.
    Fan, W., Geerts, F.: Foundations of Data Quality Management (2012)CrossRefGoogle Scholar
  13. 13.
    Fan, W., Geerts, F., Jia, X.: Conditional dependencies: a principled approach to improving data quality. In: Sexton, A.P. (ed.) BNCOD 2009. LNCS, vol. 5588, pp. 8–20. Springer, Heidelberg (2009).  https://doi.org/10.1007/978-3-642-02843-4_4CrossRefGoogle Scholar
  14. 14.
    Fan, W., Geerts, F., Ma, S., Tang, N., Yu, W.: Data quality problems beyond consistency and deduplication. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.-C., Fourman, M. (eds.) In Search of Elegance in the Theory and Practice of Computation. LNCS, vol. 8000, pp. 237–249. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-41660-6_12CrossRefGoogle Scholar
  15. 15.
    Fan, W., Geerts, F., Wijsen, J.: Determining the currency of data. ACM Trans. Database Syst. 37(4), 71–82 (2012)CrossRefGoogle Scholar
  16. 16.
    Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)CrossRefGoogle Scholar
  17. 17.
    Fei, C., Miller, R.J.: A unified model for data and constraint repair. In: IEEE International Conference on Data Engineering, pp. 446–457 (2011)Google Scholar
  18. 18.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)CrossRefGoogle Scholar
  19. 19.
    Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: Thirtieth International Conference on Very Large Data Bases, pp. 1078–1086 (2004)CrossRefGoogle Scholar
  20. 20.
    Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2015)CrossRefGoogle Scholar
  21. 21.
    Pei, L.I., Dong, X.L., Maurino, A., Srivastava, D.: Linking temporal records. PVLDB 4(11), 956–967 (2011)zbMATHGoogle Scholar
  22. 22.
    Richman, J., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 475–480 (2002)Google Scholar
  23. 23.
    Sidi, F., Panahy, P.H.S., Affendey, L.S., Jabar, M.A., Ibrahim, H., Mustapha, A.: Data quality: a survey of data quality dimensions. In: International Conference on Information Retrieval and Knowledge Management, pp. 300–304 (2012)Google Scholar
  24. 24.
    Ullmann, J.R.: A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. Comput. J. 20(2), 141–147 (1977)CrossRefGoogle Scholar
  25. 25.
    Verykios, V.S., Moustakides, G.V., Elfeky, M.G.: A bayesian decision model for cost optimal record matching. VLDB J. 12(1), 28–40 (2003)CrossRefGoogle Scholar
  26. 26.
    Verykios, V.S., Elmagarmid, A.K., Houstis, E.N.: Automating the approximate record-matching process. Inf. Sci. 126(1–4), 83–98 (2002)zbMATHGoogle Scholar
  27. 27.
    Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)CrossRefGoogle Scholar
  28. 28.
    Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. Proc. VLDB Endow. 3(1–2), 1326–1337 (2010)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Harbin Institute of TechnologyHarbinChina

Personalised recommendations