Abstract
With data currency issues draw the attentions of both researchers and engineers, temporal data, which describes real world events with time tags in database, is playing a key role in data warehouse, data mining, and etc. At the same time, 4V features of big data give rise to the difficulties in comprehensive data quality management and data cleaning. On one hand, entity resolution methods are faced with challenges when dealing with temporal data. On another hand, multiple problems existing in data records are hard to be captured and repaired. Motivated by this, we address data quality evaluation and data cleaning issues in imprecise temporal data. This project aims to solve three key problems in temporal data quality improvement and cleaning: (1) Determining currency on imprecise temporal data, (2) Entity resolution on temporal data with incomplete timestamps, and (3) Data quality improvement on consistency and completeness with data currency. The purpose of this paper is to address the problem definitions and discuss the procedure framework and the solutions of improving the effectiveness of temporal data cleaning with multiple errors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
UNIMATCH: a record linkage system: users manual. In: Bureau of the Census, Washington DC (1976)
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: International Conference on Very Large Data Bases, pp. 586–597 (2002)
Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41(3), 16 (2009)
Bertiequille, L., Sarma, A.D., Dong, Marian, A., Srivastava, D.: Sailing the information ocean with awareness of currents: discovery and application of source dependence. Computer. Science 26(8), 1881–3 (2009)
Cappiello, C., Francalanci, C., Pernici, B.: Time related factors of data accuracy, completeness, and currency in multi-channel information systems. In: The Conference on Advanced Information Systems Engineering, pp. 145–153 (2008)
Chiang, Y.H., Doan, A.H., Naughton, J.F.: Tracking entities in the dynamic world: a fast algorithm for matching temporal records. Proc. VLDB Endow. 7, 469–480 (2014)
Chu, X., Ilyas, I.F., Papotti, P., Ye, Y.: Ruleminer: data quality rules discovery. In: IEEE International Conference on Data Engineering, pp. 1222–1225 (2014)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: ACM SIGMOD International Conference on Management of Data, pp. 201–212 (1998)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: International Conference on Very Large Data Bases, pp. 315–326 (2007)
Deng, T., Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010, Indianapolis, Indiana, USA, 6–11 June 2010, pp. 169–178 (2010)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Fan, W., Geerts, F.: Foundations of Data Quality Management (2012)
Fan, W., Geerts, F., Jia, X.: Conditional dependencies: a principled approach to improving data quality. In: Sexton, A.P. (ed.) BNCOD 2009. LNCS, vol. 5588, pp. 8–20. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02843-4_4
Fan, W., Geerts, F., Ma, S., Tang, N., Yu, W.: Data quality problems beyond consistency and deduplication. In: Tannen, V., Wong, L., Libkin, L., Fan, W., Tan, W.-C., Fourman, M. (eds.) In Search of Elegance in the Theory and Practice of Computation. LNCS, vol. 8000, pp. 237–249. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41660-6_12
Fan, W., Geerts, F., Wijsen, J.: Determining the currency of data. ACM Trans. Database Syst. 37(4), 71–82 (2012)
Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)
Fei, C., Miller, R.J.: A unified model for data and constraint repair. In: IEEE International Conference on Data Engineering, pp. 446–457 (2011)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Koudas, N., Marathe, A., Srivastava, D.: Flexible string matching against large databases in practice. In: Thirtieth International Conference on Very Large Data Bases, pp. 1078–1086 (2004)
Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2015)
Pei, L.I., Dong, X.L., Maurino, A., Srivastava, D.: Linking temporal records. PVLDB 4(11), 956–967 (2011)
Richman, J., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 475–480 (2002)
Sidi, F., Panahy, P.H.S., Affendey, L.S., Jabar, M.A., Ibrahim, H., Mustapha, A.: Data quality: a survey of data quality dimensions. In: International Conference on Information Retrieval and Knowledge Management, pp. 300–304 (2012)
Ullmann, J.R.: A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. Comput. J. 20(2), 141–147 (1977)
Verykios, V.S., Moustakides, G.V., Elfeky, M.G.: A bayesian decision model for cost optimal record matching. VLDB J. 12(1), 28–40 (2003)
Verykios, V.S., Elmagarmid, A.K., Houstis, E.N.: Automating the approximate record-matching process. Inf. Sci. 126(1–4), 83–98 (2002)
Wang, R.Y., Strong, D.M.: Beyond accuracy: what data quality means to data consumers. J. Manag. Inf. Syst. 12(4), 5–33 (1996)
Whang, S.E., Garcia-Molina, H.: Entity resolution with evolving rules. Proc. VLDB Endow. 3(1–2), 1326–1337 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Ding, X. (2018). Multiple Data Quality Evaluation and Data Cleaning on Imprecise Temporal Data. In: Woo, C., Lu, J., Li, Z., Ling, T., Li, G., Lee, M. (eds) Advances in Conceptual Modeling. ER 2018. Lecture Notes in Computer Science(), vol 11158. Springer, Cham. https://doi.org/10.1007/978-3-030-01391-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-01391-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01390-5
Online ISBN: 978-3-030-01391-2
eBook Packages: Computer ScienceComputer Science (R0)