Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Avenali A, Batini C, Bertolazzi P, and Missier P. A formulation of the data quality optimization problem. In Proc. of the Intl. CAiSE Workhop on Data and Information Quality (DIQ), pages 49-63, Riga, Latvia, 2004.
Karakasidis A, Vassiliadis P, and Pitoura E. Etl queues for active data warehousing. In Proc. of the 2nd ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 28-39, Baltimore, MD, USA, 2005.
McCallum A, Nigam K, and Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of the 6th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 169-178, Boston, MA, USA, 2000.
Monge A. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull., 23(4):14-20, 2000.
Sheth A, Wood C, and Kashyap V. Q-data: Using deductive database technology to improve data quality. In Proc. of Intl. Workshop on Programming with Logic Databases (ILPS), pages 23-56, 1993.
Simitsis A, Vassiliadis P, and Sellis TK. Optimizing etl processes in data warehouses. In Proc. of the 11th Intl. Conf. on Data Engineering (ICDE), pages 564-575, Tokyo, Japan, 2005.
Dempster AP, Laird NM, and Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39:1-38, 1977.
Kahn B, Strong D, and Wang R. Information quality benchmark: Product and service performance. Com. of the ACM, 45(4):184-192, 2002.
Batini C, Catarci T, and Scannapiceco M. A survey of data quality issues in cooperative information systems. In Tutorial presented at the 23rd Intl. Conf. on Conceptual Modeling (ER), Shanghai, China, 2004.
Djeraba C. Association and content-based retrieval. IEEE Transactions on Knowledge and Data Engineering (TDKE), 15(1):118-135, 2003.
Fox C, Levitin A, and Redman T. The notion of data and its quality dimensions. Information Processing and Management, 30(1), 1994.
Ordonez C and Omiecinski E. Discovering association rules based on image content. In Proc. of IEEE Advances in Digital Libraries Conf. (ADL’99), pages 38-49, 1999.
Carlson D. Data stewardship in action. DM Review, 2002.
Loshin D. Enterprise Knowledge Management: The Data Quality Approach. .Morgan Kaufmann, 2001.
Pyle D. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
Quass D and Starkey P. Record linkage for genealogical databases. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 40-42, Washington, DC, USA, 2003.
Theodoratos D and Bouzeghoub M. Data currency quality satisfaction in the design of a data warehouse. Special Issue on Design and Management of Data Warehouses, Intl. Journal of Cooperative Inf. Syst., 10(3):299-326, 2001.
Paradice DB and Fuerst WL. A mis data quality management strategy based on an optimal methodology. Journal of Information Systems, 5(1):48-66, 1991.
Ballou DP and Pazer H. Designing information systems to optimize the accuracy-timeliness trade-off. Information Systems Research, 6(1), 1995.
Ballou DP and Pazer H. Modeling completeness versus consistency trade-offs in information decision contexts. IEEE Transactions on Knowledge and Data Engineering (TDKE), 15(1):240-243, 2002.
Guérin E, Marquet G, Burgun A, Loral O, Berti- Équille L, Leser U, and Moussouni F. Integrating and warehousing liver gene expression data and related biomedical resources in gedaw. In Proc. of the 2nd Intl. Workshop on Data Integration in the Life Science (DILS), San Diego, CA, USA, 2005.
Knorr E and Ng R. Algorithms for mining distance-based outliers in large datasets. In Proc. of the 24th Intl. Conf. on Very Large Data Bases (VLDB), pages 392-403, New York City, USA, 1998.
Rahm E and Do H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3-13, 2000.
Caruso F, Cochinwala M, Ganapathy U, Lalk G, and Missier P. Telcordia’s database reconciliation and data quality analysis tool. In Proc. of the 26th Intl. Conf. on Very Large Data Bases (VLDB), pages 615-618, Cairo, Egypt, September 10-14 2000.
Naumann F. Quality-Driven Query Answering for Integrated Information Systems, volume 2261 of LNCS. Springer, 2002.
Naumann F, Leser U, and Freytag JC. Quality-driven integration of hetero-geneous information systems. In Proc. of the 25th Intl. Conf. on Very Large Data Bases (VLDB), pages 447-458, Edinburgh, Scotland, 1999.
De Giacomo G, Lembo D, Lenzerini M, and Rosati R. Tackling inconsistencies in data integration through source preferences. In Proc. of the 1rst ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 27-34, Paris, France, 2004.
Delen G and Rijsenbrij D. The specification, engineering and measurement of information systems quality. Journal of Software Systems, 17:205-217, 1992.
Liepins G and Uppuluri V. Data Quality Control: Theory and Pragmatics. M. Dekker, 1990.
Navarro G. A guided tour to approximate string matching. ACM Computer Surveys, 33(1):31-88, 2001.
Shankaranarayan G, Wang RY, and Ziad M. Modeling the manufacture of an information product with ip-map. In Proc. of the 6th Intl. Conf. on Information Quality, Boston, MA, USA, 2000.
Mihaila GA, Raschid L, and Vidal M. Using quality of data metadata for source selection and ranking. In Proc. of the 3rd Intl. WebDB Workshop, pages 93-98, Dallas, TX, USA, 2000.
Tayi GK and Ballou DP. Examining data quality. Com. of the ACM, 41(2):54-57,1998.
Galhardas H, Florescu D, Shasha D, Simon E, and Saita C. Declarative data cleaning: Language, model and algorithms. In Proc. of the 9th Intl. Conf. on Very Large Data Bases (VLDB), pages 371-380, Roma, Italy, 2001.
Müller H, Leser U, and Freytag JC. Mining for patterns in contradictory data. In Proc. of the 1rst ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 51-58, Paris, France, 2004.
Pasula H, Marthi B, Milch B, Russell S, and Shpitser I. Identity uncertainty and citation matching. In Proc. of the Intl. Conf. Advances in Neural Information Processing Systems (NIPS), pages 1401-1408, Vancouver, British Colombia, 2003.
Newcombe HB, Kennedy JM, Axford SJ, and James AP. Automatic linkage of vital records. Science, 130:954-959, 1959.
Fellegi IP and Sunter AB. A theory for record linkage. Journal of the American Statistical Association, 64:1183-1210, 1969.
Celko J and McDonald J. Don’t warehouse dirty data. Datamation, 41(18), 1995.
Rothenberg J. Metadata to support data quality and longevity. In Proc. Of the 1st IEEE Metadata Conf., 1996.
Schlimmer J. Learning determinations and checking databases. In Proc. Of AAAI Workshop on Knowledge Discovery in Databases, 1991.
Schafer JL. Analysis of Incomplete Multivariate Data. Chapman & Hall, 1997.
Ullmann JR. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal, 20(2):141-147, 1997.
Fan K, Lu H, Madnick S, and Cheung D. Discovering and reconciling value conflicts for numerical data integration. Information Systems, 26(8):235-656, 2001.
Huang K, Lee Y, and Wang R. Quality Information and Knowledge Management. Prentice Hall, New Jersey, 1999.
Berti- Équille L. Data quality awareness: a case study for cost-optimal association rule mining. Knowl. Inf. Syst., 2006.
English L. Improving Data Warehouse and Business Information Quality. Wiley, New York, 1998.
Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Pietarinen L, and Srivastava D. Using Q-grams in a DBMS for Approximate String Processing. IEEE Data Eng. Bull., 24(4), December 2001.
Gravano L, Ipeirotis PG, Koudas N, and Srivastava D. Text joins in an rdbms for web data integration. In Proc. of the 12th Intl. World Wide Web Conf. (WWW), pages 90-101, Budapest, Hungary, 2003.
Lim L, Srivastava J, Prabhakar S, and Richardson J. Entity identification in database integration. In Proc. of the 9th Intl. Conf. on Data Engineering (ICDE), pages 294-301, Vienna, Austria, 1993.
Liu L and Chi L. Evolutionary data quality. In Proc. of the 7th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, USA, 2002.
Santis LD, Scannapieco M, and Catarci T. Trusting data quality in cooperative information systems. In Proc. of the Intl. Conf. on Cooperative Information Systems (CoopIS), pages 354-369, Catania, Sicily, Italy, 2003.
Bilenko M and Mooney RJ. Adaptive duplicate detection using learnable string similarity measures. In Proc. of the 9th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 39-48, Washington, DC, USA, 2003.
Bouzeghoub M and Peralta V. A framework for analysis of data freshness. In Proc. of the 1st ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 59-67, Paris, France, 2004.
Breunig M, Kriegel H, Ng R, and Sander J. Lof: Identifying density-based local outliers. In Proc. of 2000 ACM SIGMOD Conf., pages 93-104, Dallas, TX, USA, May 16-18 2000.
Buechi M, Borthwick A, Winkel A, and Goldberg A. Cluemaker: a language for approximate record matching. In Proc. of the 8th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, USA, 2003.
Goodchild M and Jeansoulin R. Data Quality in Geographic Information: From Error to Uncertainty. Hermès, 1998.
Hernandez M and Stolfo S. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9-37, 1998.
Jarke M, Jeusfeld MA, Quix C, and Vassiliadis P. Architecture and quality in data warehouses. In Proc. of the 10th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 93-113, Pisa, Italy, 1998.
Piattini M, Calero C, and Genero M, editors. Information and Database Quality, volume 25. Kluwer International Series on Advances in Database Systems, 2002.
Piattini M, Genero M, Calero C, Polo C, and Ruiz F. Chapter 14: Advanced Database Technology and Design, chapter Database Quality, pages 485-509. Artech House, 2000.
Scannapieco M, Pernici B, and Pierce E. Advances in Management Information Systems - Information Quality Monograph (AMIS-IQ), chapter IP-UML: A Methodology for Quality Improvement Based on IP-MAP and UML. Sharpe, 2004.
Weis M and Naumann F. Detecting duplicate objects in xml documents. In Proc. of the 1st Intl. ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 10-19, Paris, France, 2004.
Jeusfeld MA, Quix C, and Jarke M. Design and analysis of quality information for data warehouses. In Proc. of 17th Intl. Conf. Conceptual Modelling (ER), pages 349-362, Singapore, 1998.
Elfeky MG, Verykios VS, and Elmagarmid AK. Tailor: A record linkage toolbox. In Proc. of the 19th Intl. Conf. on Data Engineering (ICDE), pages 1-28, San Jose, CA, USA, 2002.
Brodie ML. Data quality in information systems. Information and Management, 3:245-258, 1980.
Lavrač N, Flach PA, and Zupan B. Rule evaluation measures: A unifying view. In Proc. of the Intl. Workshop on Inductive Logic Programming (ILP), pages 174-185, Bled, Slovenia, 1999.
Benjelloun O, Garcia-Molina H, Su Q, and Widom J. Swoosh: A generic approach to entity resolution. Technical report, Stanford Database Group., 2005.
ıane O, Han J, and Zhu H. Mining recurrent items in multimedia with progressive resolution refinement. In Proc. of the 16th Intl. Conf. on Data Engineering (ICDE), p.461-476, San Diego, CA, USA, 2000.
Christen P, Churches T, and Hegland M. Febrl - a parallel open source data linkage system. In Proc. of the 8th Pacific Asia Conf. on Advances in Knowledege Discovery and Data Mining (PAKDD), pages 638-647, Sydney, Australia, May 26-28 2004.
Missier P and Batini C. A multidimensional model for information quality in cis. In Proc. of the 8th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, MA, USA, 2003.
Perner P. Data Mining on Multimedia, volume LNCS 2558. Springer, 2002.
Vassiliadis P. Data Warehouse Modeling and Quality Issues. PhD thesis, Technical University of Athens, Greece, 2000.
Vassiliadis P, Simitsis A, Georgantas P, and Terrovitis M. A framework for the design of etl scenarios. In Proc. of the 15th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 520-535, Klagenfurt, Austria, 2003.
Vassiliadis P, Bouzeghoub M, and Quix C. Towards quality-oriented data warehouse usage and evolution. In Proc. of the 11th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 164-179, Heidelberg, Germany, 1999.
Vassiliadis P, Vagena Z, Skiadopoulos S, and Karayannidis N. ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. IEEE Data Eng. Bull., 23(4):42-47, 2000.
Tan PN, Kumar V, and Srivastava J. Selecting the right interestingness measure for association patterns. In Proc. of the 8th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 32-41, Edmonton, Canada, 2002.
Agrawal R, Imielinski T, and Swami AN. Mining association rules between sets of items in large databases. In Proc. of the 1993 ACM SIGMOD Conf., pages 207-216, Washington, DC,USA, 1993.
Ananthakrishna R, Chaudhuri S, and Ganti V. Eliminating fuzzy duplicates in datawarehouses. In Proc. of the 28th Intl. Conf. on Very Large Data Bases (VLDB), pages 586-597, Hong-Kong, China, 2002.
Baxter R, Christen P, and Churches T. A comparison of fast blocking methods for record linkage. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 27-29, Washington, DC, USA, 2003.
Wang R. A product perspective on total data quality management. Com. Of the ACM, 41(2):58-65, 1998.
Wang R. Advances in Database Systems, volume 23, chapter Journey to Data Quality. Kluwer Academic Press, Boston, MA, USA, 2002.
Wang R, Storey V, and Firth C. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering (TDKE), 7(4):670-677, 1995.
Little RJ and Rubin DB. Statistical Analysis with Missing Data. Wiley, New-York, 1987.
Pearson RK. Data mining in face of contaminated and incomplete records. In Proc. of SIAM Intl. Conf. Data Mining, 2002.
Hamming RW. Error-detecting and error-correcting codes. Bell System Technical Journal, 29(2):147-160, 1950.
Chaudhuri S, Ganjam K, Ganti V, and Motwani R. Robust and efficient fuzzy match for online data cleaning. In Proc. of the 2003 ACM SIGMOD Intl. Conf. on Management of Data, pages 313-324, San Diego, CA, USA, 2003.
Tejada S, Knoblock CA, and Minton S. Learning object identification rules for information integration. Information Systems, 26(8), 2001.
Ahmed T, Asgari AH, Mehaoua A, Borcoci E, Berti- Équille L, and Kormentzas G. End-to-end quality of service provisioning through an integrated management system for multimedia content delivery. Special Issue of Computer Communications on Emerging Middleware for Next Generation Networks, 2005.
Dasu T and Johnson T. Exploratory Data Mining and Data Cleaning. Wiley, New York, 2003.
Dasu T, Johnson T, Muthukrishnan S, and Shkapenyuk V. Mining database structure or how to build a data quality browser. In Proc. of the 2002 ACM SIGMOD Intl. Conf., pages 240-251, Madison, WI, USA, 2002.
Johnson T and Dasu T. Comparing massive high-dimensional data sets. In Proc. of the 4th Intl. Conf. KDD, pages 229-233, New York City, New York, USA, 1998.
Redman T. Data Quality: The Field Guide. Digital Press, Elsevier, 2001.
Raman V and Hellerstein JM. Potter’s wheel: an interactive data cleaning system. In Proc. of the 26th Intl. Conf. on Very Large Data Bases (VLDB), pages 381-390, Roma, Italy, 2001.
DuMouchel W, Volinsky C, Johnson T, Cortez C, and Pregibon D. Squashing flat files flatter. In Proc. of the 5th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 6-16, San Diego, CA, USA, 1999.
Madnick SE Wang R, Kon HB. Data quality requirements analysis and modeling. In Proc. of the 9th Intl. Conf. on Data Engineering (ICDE), pages 670-677, Vienna, Austria, 1993.
Hou WC and Zhang Z. Enhancing database correctness: A statistical approach. In Proc. of the 1995 ACM SIGMOD Intl. Conf. on Management of Data, San Jose, CA, USA, 1995.
Winkler WE. Methods for evaluating and creating data quality. Information Systems, 29(7), 2004.
Winkler WE and Thibaudeau Y. An application of the fellegi-sunter model of record linkage to the 1990 u.s. decennial census. Technical Report Statistical Research Report Series RR91/09, U.S. Bureau of the Census, Washington, DC, USA, 1991.
Low WL, Lee ML, and Ling TW. A knowledge-based approach for duplicate elimination in data cleaning. Information System, 26(8), 2001.
Cui Y and Widom J. Lineage tracing for general data warehouse transformation. In Proc. of the 27th Intl. Conf. on Very Large Data Bases (VLDB), pages 471-480, Roma, Italy, September 11-14 2001.
Zhu Y and Shasha D. Statstream: Statistical monitoring of thousands of data streams in real time. In Proc. of the 10th Intl. Conf. on Very Large Data Bases (VLDB), pages 358-369, Hong-Kong, China, 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Berti-Équille, L. (2007). Measuring and Modelling Data Quality for Quality-Awareness in Data Mining. In: Guillet, F.J., Hamilton, H.J. (eds) Quality Measures in Data Mining. Studies in Computational Intelligence, vol 43. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-44918-8_5
Download citation
DOI: https://doi.org/10.1007/978-3-540-44918-8_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44911-9
Online ISBN: 978-3-540-44918-8
eBook Packages: EngineeringEngineering (R0)