Measuring and Modelling Data Quality for Quality-Awareness in Data Mining

Berti-Équille, Laure

doi:10.1007/978-3-540-44918-8_5

Laure Berti-Équille⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 43))

1203 Accesses
9 Citations
3 Altmetric

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Avenali A, Batini C, Bertolazzi P, and Missier P. A formulation of the data quality optimization problem. In Proc. of the Intl. CAiSE Workhop on Data and Information Quality (DIQ), pages 49-63, Riga, Latvia, 2004.
Google Scholar
Karakasidis A, Vassiliadis P, and Pitoura E. Etl queues for active data warehousing. In Proc. of the 2nd ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 28-39, Baltimore, MD, USA, 2005.
Google Scholar
McCallum A, Nigam K, and Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. of the 6th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 169-178, Boston, MA, USA, 2000.
Google Scholar
Monge A. Matching algorithms within a duplicate detection system. IEEE Data Eng. Bull., 23(4):14-20, 2000.
Google Scholar
Sheth A, Wood C, and Kashyap V. Q-data: Using deductive database technology to improve data quality. In Proc. of Intl. Workshop on Programming with Logic Databases (ILPS), pages 23-56, 1993.
Google Scholar
Simitsis A, Vassiliadis P, and Sellis TK. Optimizing etl processes in data warehouses. In Proc. of the 11th Intl. Conf. on Data Engineering (ICDE), pages 564-575, Tokyo, Japan, 2005.
Google Scholar
Dempster AP, Laird NM, and Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39:1-38, 1977.
MATH MathSciNet Google Scholar
Kahn B, Strong D, and Wang R. Information quality benchmark: Product and service performance. Com. of the ACM, 45(4):184-192, 2002.
Article Google Scholar
Batini C, Catarci T, and Scannapiceco M. A survey of data quality issues in cooperative information systems. In Tutorial presented at the 23rd Intl. Conf. on Conceptual Modeling (ER), Shanghai, China, 2004.
Google Scholar
Djeraba C. Association and content-based retrieval. IEEE Transactions on Knowledge and Data Engineering (TDKE), 15(1):118-135, 2003.
Article Google Scholar
Fox C, Levitin A, and Redman T. The notion of data and its quality dimensions. Information Processing and Management, 30(1), 1994.
Google Scholar
Ordonez C and Omiecinski E. Discovering association rules based on image content. In Proc. of IEEE Advances in Digital Libraries Conf. (ADL’99), pages 38-49, 1999.
Google Scholar
Carlson D. Data stewardship in action. DM Review, 2002.
Google Scholar
Loshin D. Enterprise Knowledge Management: The Data Quality Approach. .Morgan Kaufmann, 2001.
Google Scholar
Pyle D. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
Google Scholar
Quass D and Starkey P. Record linkage for genealogical databases. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 40-42, Washington, DC, USA, 2003.
Google Scholar
Theodoratos D and Bouzeghoub M. Data currency quality satisfaction in the design of a data warehouse. Special Issue on Design and Management of Data Warehouses, Intl. Journal of Cooperative Inf. Syst., 10(3):299-326, 2001.
Article Google Scholar
Paradice DB and Fuerst WL. A mis data quality management strategy based on an optimal methodology. Journal of Information Systems, 5(1):48-66, 1991.
Google Scholar
Ballou DP and Pazer H. Designing information systems to optimize the accuracy-timeliness trade-off. Information Systems Research, 6(1), 1995.
Google Scholar
Ballou DP and Pazer H. Modeling completeness versus consistency trade-offs in information decision contexts. IEEE Transactions on Knowledge and Data Engineering (TDKE), 15(1):240-243, 2002.
Google Scholar
Guérin E, Marquet G, Burgun A, Loral O, Berti- Équille L, Leser U, and Moussouni F. Integrating and warehousing liver gene expression data and related biomedical resources in gedaw. In Proc. of the 2nd Intl. Workshop on Data Integration in the Life Science (DILS), San Diego, CA, USA, 2005.
Google Scholar
Knorr E and Ng R. Algorithms for mining distance-based outliers in large datasets. In Proc. of the 24th Intl. Conf. on Very Large Data Bases (VLDB), pages 392-403, New York City, USA, 1998.
Google Scholar
Rahm E and Do H. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3-13, 2000.
Google Scholar
Caruso F, Cochinwala M, Ganapathy U, Lalk G, and Missier P. Telcordia’s database reconciliation and data quality analysis tool. In Proc. of the 26th Intl. Conf. on Very Large Data Bases (VLDB), pages 615-618, Cairo, Egypt, September 10-14 2000.
Google Scholar
Naumann F. Quality-Driven Query Answering for Integrated Information Systems, volume 2261 of LNCS. Springer, 2002.
Google Scholar
Naumann F, Leser U, and Freytag JC. Quality-driven integration of hetero-geneous information systems. In Proc. of the 25th Intl. Conf. on Very Large Data Bases (VLDB), pages 447-458, Edinburgh, Scotland, 1999.
Google Scholar
De Giacomo G, Lembo D, Lenzerini M, and Rosati R. Tackling inconsistencies in data integration through source preferences. In Proc. of the 1rst ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 27-34, Paris, France, 2004.
Google Scholar
Delen G and Rijsenbrij D. The specification, engineering and measurement of information systems quality. Journal of Software Systems, 17:205-217, 1992.
Article Google Scholar
Liepins G and Uppuluri V. Data Quality Control: Theory and Pragmatics. M. Dekker, 1990.
Google Scholar
Navarro G. A guided tour to approximate string matching. ACM Computer Surveys, 33(1):31-88, 2001.
Article Google Scholar
Shankaranarayan G, Wang RY, and Ziad M. Modeling the manufacture of an information product with ip-map. In Proc. of the 6th Intl. Conf. on Information Quality, Boston, MA, USA, 2000.
Google Scholar
Mihaila GA, Raschid L, and Vidal M. Using quality of data metadata for source selection and ranking. In Proc. of the 3rd Intl. WebDB Workshop, pages 93-98, Dallas, TX, USA, 2000.
Google Scholar
Tayi GK and Ballou DP. Examining data quality. Com. of the ACM, 41(2):54-57,1998.
Article Google Scholar
Galhardas H, Florescu D, Shasha D, Simon E, and Saita C. Declarative data cleaning: Language, model and algorithms. In Proc. of the 9th Intl. Conf. on Very Large Data Bases (VLDB), pages 371-380, Roma, Italy, 2001.
Google Scholar
Müller H, Leser U, and Freytag JC. Mining for patterns in contradictory data. In Proc. of the 1rst ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 51-58, Paris, France, 2004.
Google Scholar
Pasula H, Marthi B, Milch B, Russell S, and Shpitser I. Identity uncertainty and citation matching. In Proc. of the Intl. Conf. Advances in Neural Information Processing Systems (NIPS), pages 1401-1408, Vancouver, British Colombia, 2003.
Google Scholar
Newcombe HB, Kennedy JM, Axford SJ, and James AP. Automatic linkage of vital records. Science, 130:954-959, 1959.
Article Google Scholar
Fellegi IP and Sunter AB. A theory for record linkage. Journal of the American Statistical Association, 64:1183-1210, 1969.
Article Google Scholar
Celko J and McDonald J. Don’t warehouse dirty data. Datamation, 41(18), 1995.
Google Scholar
Rothenberg J. Metadata to support data quality and longevity. In Proc. Of the 1st IEEE Metadata Conf., 1996.
Google Scholar
Schlimmer J. Learning determinations and checking databases. In Proc. Of AAAI Workshop on Knowledge Discovery in Databases, 1991.
Google Scholar
Schafer JL. Analysis of Incomplete Multivariate Data. Chapman & Hall, 1997.
Google Scholar
Ullmann JR. A binary n-gram technique for automatic correction of substitution, deletion, insertion and reversal errors in words. The Computer Journal, 20(2):141-147, 1997.
Article Google Scholar
Fan K, Lu H, Madnick S, and Cheung D. Discovering and reconciling value conflicts for numerical data integration. Information Systems, 26(8):235-656, 2001.
Article Google Scholar
Huang K, Lee Y, and Wang R. Quality Information and Knowledge Management. Prentice Hall, New Jersey, 1999.
Google Scholar
Berti- Équille L. Data quality awareness: a case study for cost-optimal association rule mining. Knowl. Inf. Syst., 2006.
Google Scholar
English L. Improving Data Warehouse and Business Information Quality. Wiley, New York, 1998.
Google Scholar
Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Pietarinen L, and Srivastava D. Using Q-grams in a DBMS for Approximate String Processing. IEEE Data Eng. Bull., 24(4), December 2001.
Google Scholar
Gravano L, Ipeirotis PG, Koudas N, and Srivastava D. Text joins in an rdbms for web data integration. In Proc. of the 12th Intl. World Wide Web Conf. (WWW), pages 90-101, Budapest, Hungary, 2003.
Google Scholar
Lim L, Srivastava J, Prabhakar S, and Richardson J. Entity identification in database integration. In Proc. of the 9th Intl. Conf. on Data Engineering (ICDE), pages 294-301, Vienna, Austria, 1993.
Google Scholar
Liu L and Chi L. Evolutionary data quality. In Proc. of the 7th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, USA, 2002.
Google Scholar
Santis LD, Scannapieco M, and Catarci T. Trusting data quality in cooperative information systems. In Proc. of the Intl. Conf. on Cooperative Information Systems (CoopIS), pages 354-369, Catania, Sicily, Italy, 2003.
Google Scholar
Bilenko M and Mooney RJ. Adaptive duplicate detection using learnable string similarity measures. In Proc. of the 9th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 39-48, Washington, DC, USA, 2003.
Google Scholar
Bouzeghoub M and Peralta V. A framework for analysis of data freshness. In Proc. of the 1st ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS), pages 59-67, Paris, France, 2004.
Google Scholar
Breunig M, Kriegel H, Ng R, and Sander J. Lof: Identifying density-based local outliers. In Proc. of 2000 ACM SIGMOD Conf., pages 93-104, Dallas, TX, USA, May 16-18 2000.
Google Scholar
Buechi M, Borthwick A, Winkel A, and Goldberg A. Cluemaker: a language for approximate record matching. In Proc. of the 8th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, USA, 2003.
Google Scholar
Goodchild M and Jeansoulin R. Data Quality in Geographic Information: From Error to Uncertainty. Hermès, 1998.
Google Scholar
Hernandez M and Stolfo S. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9-37, 1998.
Article Google Scholar
Jarke M, Jeusfeld MA, Quix C, and Vassiliadis P. Architecture and quality in data warehouses. In Proc. of the 10th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 93-113, Pisa, Italy, 1998.
Google Scholar
Piattini M, Calero C, and Genero M, editors. Information and Database Quality, volume 25. Kluwer International Series on Advances in Database Systems, 2002.
Google Scholar
Piattini M, Genero M, Calero C, Polo C, and Ruiz F. Chapter 14: Advanced Database Technology and Design, chapter Database Quality, pages 485-509. Artech House, 2000.
Google Scholar
Scannapieco M, Pernici B, and Pierce E. Advances in Management Information Systems - Information Quality Monograph (AMIS-IQ), chapter IP-UML: A Methodology for Quality Improvement Based on IP-MAP and UML. Sharpe, 2004.
Google Scholar
Weis M and Naumann F. Detecting duplicate objects in xml documents. In Proc. of the 1st Intl. ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS) in conjunction with ACM PODS/SIGMOD, pages 10-19, Paris, France, 2004.
Google Scholar
Jeusfeld MA, Quix C, and Jarke M. Design and analysis of quality information for data warehouses. In Proc. of 17th Intl. Conf. Conceptual Modelling (ER), pages 349-362, Singapore, 1998.
Google Scholar
Elfeky MG, Verykios VS, and Elmagarmid AK. Tailor: A record linkage toolbox. In Proc. of the 19th Intl. Conf. on Data Engineering (ICDE), pages 1-28, San Jose, CA, USA, 2002.
Google Scholar
Brodie ML. Data quality in information systems. Information and Management, 3:245-258, 1980.
Article Google Scholar
Lavrač N, Flach PA, and Zupan B. Rule evaluation measures: A unifying view. In Proc. of the Intl. Workshop on Inductive Logic Programming (ILP), pages 174-185, Bled, Slovenia, 1999.
Google Scholar
Benjelloun O, Garcia-Molina H, Su Q, and Widom J. Swoosh: A generic approach to entity resolution. Technical report, Stanford Database Group., 2005.
Google Scholar
ıane O, Han J, and Zhu H. Mining recurrent items in multimedia with progressive resolution refinement. In Proc. of the 16th Intl. Conf. on Data Engineering (ICDE), p.461-476, San Diego, CA, USA, 2000.
Google Scholar
Christen P, Churches T, and Hegland M. Febrl - a parallel open source data linkage system. In Proc. of the 8th Pacific Asia Conf. on Advances in Knowledege Discovery and Data Mining (PAKDD), pages 638-647, Sydney, Australia, May 26-28 2004.
Google Scholar
Missier P and Batini C. A multidimensional model for information quality in cis. In Proc. of the 8th Intl. Conf. on Information Quality (IQ), MIT, Cambridge, MA, USA, 2003.
Google Scholar
Perner P. Data Mining on Multimedia, volume LNCS 2558. Springer, 2002.
Google Scholar
Vassiliadis P. Data Warehouse Modeling and Quality Issues. PhD thesis, Technical University of Athens, Greece, 2000.
Google Scholar
Vassiliadis P, Simitsis A, Georgantas P, and Terrovitis M. A framework for the design of etl scenarios. In Proc. of the 15th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 520-535, Klagenfurt, Austria, 2003.
Google Scholar
Vassiliadis P, Bouzeghoub M, and Quix C. Towards quality-oriented data warehouse usage and evolution. In Proc. of the 11th Intl. Conf. on Advanced Information Systems Engineering (CAiSE), pages 164-179, Heidelberg, Germany, 1999.
Google Scholar
Vassiliadis P, Vagena Z, Skiadopoulos S, and Karayannidis N. ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. IEEE Data Eng. Bull., 23(4):42-47, 2000.
Google Scholar
Tan PN, Kumar V, and Srivastava J. Selecting the right interestingness measure for association patterns. In Proc. of the 8th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 32-41, Edmonton, Canada, 2002.
Google Scholar
Agrawal R, Imielinski T, and Swami AN. Mining association rules between sets of items in large databases. In Proc. of the 1993 ACM SIGMOD Conf., pages 207-216, Washington, DC,USA, 1993.
Google Scholar
Ananthakrishna R, Chaudhuri S, and Ganti V. Eliminating fuzzy duplicates in datawarehouses. In Proc. of the 28th Intl. Conf. on Very Large Data Bases (VLDB), pages 586-597, Hong-Kong, China, 2002.
Google Scholar
Baxter R, Christen P, and Churches T. A comparison of fast blocking methods for record linkage. In Proc. of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 27-29, Washington, DC, USA, 2003.
Google Scholar
Wang R. A product perspective on total data quality management. Com. Of the ACM, 41(2):58-65, 1998.
Article Google Scholar
Wang R. Advances in Database Systems, volume 23, chapter Journey to Data Quality. Kluwer Academic Press, Boston, MA, USA, 2002.
Google Scholar
Wang R, Storey V, and Firth C. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering (TDKE), 7(4):670-677, 1995.
Google Scholar
Little RJ and Rubin DB. Statistical Analysis with Missing Data. Wiley, New-York, 1987.
MATH Google Scholar
Pearson RK. Data mining in face of contaminated and incomplete records. In Proc. of SIAM Intl. Conf. Data Mining, 2002.
Google Scholar
Hamming RW. Error-detecting and error-correcting codes. Bell System Technical Journal, 29(2):147-160, 1950.
MathSciNet Google Scholar
Chaudhuri S, Ganjam K, Ganti V, and Motwani R. Robust and efficient fuzzy match for online data cleaning. In Proc. of the 2003 ACM SIGMOD Intl. Conf. on Management of Data, pages 313-324, San Diego, CA, USA, 2003.
Google Scholar
Tejada S, Knoblock CA, and Minton S. Learning object identification rules for information integration. Information Systems, 26(8), 2001.
Google Scholar
Ahmed T, Asgari AH, Mehaoua A, Borcoci E, Berti- Équille L, and Kormentzas G. End-to-end quality of service provisioning through an integrated management system for multimedia content delivery. Special Issue of Computer Communications on Emerging Middleware for Next Generation Networks, 2005.
Google Scholar
Dasu T and Johnson T. Exploratory Data Mining and Data Cleaning. Wiley, New York, 2003.
Book MATH Google Scholar
Dasu T, Johnson T, Muthukrishnan S, and Shkapenyuk V. Mining database structure or how to build a data quality browser. In Proc. of the 2002 ACM SIGMOD Intl. Conf., pages 240-251, Madison, WI, USA, 2002.
Google Scholar
Johnson T and Dasu T. Comparing massive high-dimensional data sets. In Proc. of the 4th Intl. Conf. KDD, pages 229-233, New York City, New York, USA, 1998.
Google Scholar
Redman T. Data Quality: The Field Guide. Digital Press, Elsevier, 2001.
Google Scholar
Raman V and Hellerstein JM. Potter’s wheel: an interactive data cleaning system. In Proc. of the 26th Intl. Conf. on Very Large Data Bases (VLDB), pages 381-390, Roma, Italy, 2001.
Google Scholar
DuMouchel W, Volinsky C, Johnson T, Cortez C, and Pregibon D. Squashing flat files flatter. In Proc. of the 5th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD), pages 6-16, San Diego, CA, USA, 1999.
Google Scholar
Madnick SE Wang R, Kon HB. Data quality requirements analysis and modeling. In Proc. of the 9th Intl. Conf. on Data Engineering (ICDE), pages 670-677, Vienna, Austria, 1993.
Google Scholar
Hou WC and Zhang Z. Enhancing database correctness: A statistical approach. In Proc. of the 1995 ACM SIGMOD Intl. Conf. on Management of Data, San Jose, CA, USA, 1995.
Google Scholar
Winkler WE. Methods for evaluating and creating data quality. Information Systems, 29(7), 2004.
Google Scholar
Winkler WE and Thibaudeau Y. An application of the fellegi-sunter model of record linkage to the 1990 u.s. decennial census. Technical Report Statistical Research Report Series RR91/09, U.S. Bureau of the Census, Washington, DC, USA, 1991.
Google Scholar
Low WL, Lee ML, and Ling TW. A knowledge-based approach for duplicate elimination in data cleaning. Information System, 26(8), 2001.
Google Scholar
Cui Y and Widom J. Lineage tracing for general data warehouse transformation. In Proc. of the 27th Intl. Conf. on Very Large Data Bases (VLDB), pages 471-480, Roma, Italy, September 11-14 2001.
Google Scholar
Zhu Y and Shasha D. Statstream: Statistical monitoring of thousands of data streams in real time. In Proc. of the 10th Intl. Conf. on Very Large Data Bases (VLDB), pages 358-369, Hong-Kong, China, 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

IRISA, University of Rennes I, France
Laure Berti-Équille

Authors

Laure Berti-Équille
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

LINA-CNRS FRE 2729, Ecole polytechnique de l'université de Nantes, Rue Christian-Pauc-La Chantrerie, 60601, 44306, NANTES Cedex 3, France
Fabrice J. Guillet
Department of Computer Science, University of Regina, SK S4S 0A2, Regina, Canada
Howard J. Hamilton

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Berti-Équille, L. (2007). Measuring and Modelling Data Quality for Quality-Awareness in Data Mining. In: Guillet, F.J., Hamilton, H.J. (eds) Quality Measures in Data Mining. Studies in Computational Intelligence, vol 43. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-44918-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-44918-8_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44911-9
Online ISBN: 978-3-540-44918-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics