Abstract
Dirty data is recognized as a top challenge for data mining. Textual data is one type of data that should be explored more on the topic of data quality, to ensure the discovered knowledge is of quality. In this paper, we focus on the topic of textual data quality (TDQ) in data mining. Based on our data mining experiences for years, three typical TDQ dimensions and related problems are highlighted, including representation granularity, representation consistency, and completeness. Then, to provide a real-world example on how to enhance TDQ in data mining, a case study is demonstrated in detail in this paper, under the background of data mining in traditional Chinese medicine and covers three typical TDQ problems and corresponding solutions. The case study provided in this paper is expected to help data analysts and miners to attach more importance to TDQ issue, and enhance TDQ for more reliable data mining.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
PAKDD QIMIE’13 website, http://conferences.telecom-bretagne.eu/qimie2013/
ICDM RIKD’12 website, http://www.deakin.edu.au/~hdai/RIKD12/
Rexer, K.: 4th Annual Data Miner Survey - 2010 Survey Summary Report (2011), http://www.rexeranalytics.com
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP 1.0 Process and User Guide (2000), http://www.crisp-dm.org
Ballou, D.P., Pazer, H.L.: Cost/Quality Tradeoffs for Control Procedures in Information Systems. OMEGA: Int’l J. Management Science 15(6), 509–521 (1987)
Wang, R.Y., Reddy, M.P., Kon, H.B.: Toward Quality Data: an Attribute-based Approach. Decision Support Systems 13(3-4), 349–372 (1995)
Wang, R.Y., Strong, D.M.: Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 4, 5–34 (1996)
Madnick, S.E., Wang, R.Y., Lee, Y.W., Zhu, H.W.: Overview and Framework for Data and Information Quality Research. ACM Journal of Data and Information Quality 1(1), 1–22 (2009)
O’Donnell, M., Knott, A.: Oberlander Jon., Mellish C.: Optimising Text Quality in Generation from Relational Databases. In: Proc. of 1st International Conference on Natural Language Generation, pp. 133–140 (2000)
Sonntag, D.: Assessing the Quality of Natural Language Text Data. Proc. of GI Jahrestagung 1, 259–263 (2004)
Feng, Y., Wu, Z.H., Chen, H.: j., Yu, T., Mao, Y.X., Jiang, X.H.: Data Quality in Traditional Chinese Medicine. In: Proc. of BMEI 2008, pp. 255–259 (2008)
Zhou, X.Z., Wu, Z.H., Yin, A.N., Wu, L.C., Fan, W.Y., Zhang, R.E.: Ontology Development for Unified Traditional Chinese Medical Language System. Artificial Intelligence in Medicine 32(1), 15–27 (2004)
Feng, Y., Wu, Z., Zhou, Z.: Combining an Order-Semisensitive Text Similarity and Closest Fit Approach to Textual Missing Values in Knowledge Discovery. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3682, pp. 943–949. Springer, Heidelberg (2005)
Schmid, J.: The Main Steps to Data Quality. In: Proc. of 4th Industrial Conf. on Data Mining, pp. 69–77 (2004)
Feng, Y., Wu, Z.H., Zhou, Z.M., Fan, W.Y.: Knowledge Discovery in Traditional Chinese Medicine: State of the Art and Perspectives. Artificial Intelligence in Medicine 38(3), 219–236 (2006)
Pipino, L., Kopcso, D.: Data Mining, Dirty Data, and Costs. In: Proc. of ICIQ 2004, pp. 164–169 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Feng, Y., Ju, C. (2013). Enhancing Textual Data Quality in Data Mining: Case Study and Experiences. In: Li, J., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7867. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40319-4_34
Download citation
DOI: https://doi.org/10.1007/978-3-642-40319-4_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40318-7
Online ISBN: 978-3-642-40319-4
eBook Packages: Computer ScienceComputer Science (R0)