Enhancing Textual Data Quality in Data Mining: Case Study and Experiences

Feng, Yi; Ju, Chunhua

doi:10.1007/978-3-642-40319-4_34

Yi Feng²⁵ &
Chunhua Ju^25,26

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7867))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3484 Accesses
1 Citations

Abstract

Dirty data is recognized as a top challenge for data mining. Textual data is one type of data that should be explored more on the topic of data quality, to ensure the discovered knowledge is of quality. In this paper, we focus on the topic of textual data quality (TDQ) in data mining. Based on our data mining experiences for years, three typical TDQ dimensions and related problems are highlighted, including representation granularity, representation consistency, and completeness. Then, to provide a real-world example on how to enhance TDQ in data mining, a case study is demonstrated in detail in this paper, under the background of data mining in traditional Chinese medicine and covers three typical TDQ problems and corresponding solutions. The case study provided in this paper is expected to help data analysts and miners to attach more importance to TDQ issue, and enhance TDQ for more reliable data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

PAKDD QIMIE’13 website, http://conferences.telecom-bretagne.eu/qimie2013/
ICDM RIKD’12 website, http://www.deakin.edu.au/~hdai/RIKD12/
Rexer, K.: 4th Annual Data Miner Survey - 2010 Survey Summary Report (2011), http://www.rexeranalytics.com
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP 1.0 Process and User Guide (2000), http://www.crisp-dm.org
Ballou, D.P., Pazer, H.L.: Cost/Quality Tradeoffs for Control Procedures in Information Systems. OMEGA: Int’l J. Management Science 15(6), 509–521 (1987)
Article Google Scholar
Wang, R.Y., Reddy, M.P., Kon, H.B.: Toward Quality Data: an Attribute-based Approach. Decision Support Systems 13(3-4), 349–372 (1995)
Article Google Scholar
Wang, R.Y., Strong, D.M.: Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 4, 5–34 (1996)
Google Scholar
Madnick, S.E., Wang, R.Y., Lee, Y.W., Zhu, H.W.: Overview and Framework for Data and Information Quality Research. ACM Journal of Data and Information Quality 1(1), 1–22 (2009)
Google Scholar
O’Donnell, M., Knott, A.: Oberlander Jon., Mellish C.: Optimising Text Quality in Generation from Relational Databases. In: Proc. of 1st International Conference on Natural Language Generation, pp. 133–140 (2000)
Google Scholar
Sonntag, D.: Assessing the Quality of Natural Language Text Data. Proc. of GI Jahrestagung 1, 259–263 (2004)
Google Scholar
Feng, Y., Wu, Z.H., Chen, H.: j., Yu, T., Mao, Y.X., Jiang, X.H.: Data Quality in Traditional Chinese Medicine. In: Proc. of BMEI 2008, pp. 255–259 (2008)
Google Scholar
Zhou, X.Z., Wu, Z.H., Yin, A.N., Wu, L.C., Fan, W.Y., Zhang, R.E.: Ontology Development for Unified Traditional Chinese Medical Language System. Artificial Intelligence in Medicine 32(1), 15–27 (2004)
Article Google Scholar
Feng, Y., Wu, Z., Zhou, Z.: Combining an Order-Semisensitive Text Similarity and Closest Fit Approach to Textual Missing Values in Knowledge Discovery. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3682, pp. 943–949. Springer, Heidelberg (2005)
Chapter Google Scholar
Schmid, J.: The Main Steps to Data Quality. In: Proc. of 4th Industrial Conf. on Data Mining, pp. 69–77 (2004)
Google Scholar
Feng, Y., Wu, Z.H., Zhou, Z.M., Fan, W.Y.: Knowledge Discovery in Traditional Chinese Medicine: State of the Art and Perspectives. Artificial Intelligence in Medicine 38(3), 219–236 (2006)
Article Google Scholar
Pipino, L., Kopcso, D.: Data Mining, Dirty Data, and Costs. In: Proc. of ICIQ 2004, pp. 164–169 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science & Information Engineering, Zhejiang Gongshang University, Hangzhou, 310018, P.R. China
Yi Feng & Chunhua Ju
Contemporary Business and Trade Research Center, Zhejiang Gongshang University, Hangzhou, 310018, P.R. China
Chunhua Ju

Authors

Yi Feng
View author publications
You can also search for this author in PubMed Google Scholar
Chunhua Ju
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Mathematical Sciences, University of South Australia, 1 Mawson Lakes Boulevard, 5095, Adelaide, SA, Australia
Jiuyong Li
Advanced Analytics Institute, University of Technology, 2-12 Blackfriars Street, Chippendale, Blackfriars Campus, 2008, Sydney, NSW, Australia
Longbing Cao & Can Wang &
Department of Electrical and Computer Engineering, National University of Singapore, 4 Engineering Drive 3, 117576, Singapore, Singapore
Kay Chen Tan
School of Automation, Guangdong University of Technology, No. 100 Waihuan Xi Road, Panyu District, 510006, Guangzhou, China
Bo Liu
School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, 701, Tainan, Taiwan
Vincent S. Tseng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, Y., Ju, C. (2013). Enhancing Textual Data Quality in Data Mining: Case Study and Experiences. In: Li, J., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7867. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40319-4_34

Download citation

DOI: https://doi.org/10.1007/978-3-642-40319-4_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40318-7
Online ISBN: 978-3-642-40319-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics