Skip to main content

Enhancing Textual Data Quality in Data Mining: Case Study and Experiences

  • Conference paper
Trends and Applications in Knowledge Discovery and Data Mining (PAKDD 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7867))

Included in the following conference series:

Abstract

Dirty data is recognized as a top challenge for data mining. Textual data is one type of data that should be explored more on the topic of data quality, to ensure the discovered knowledge is of quality. In this paper, we focus on the topic of textual data quality (TDQ) in data mining. Based on our data mining experiences for years, three typical TDQ dimensions and related problems are highlighted, including representation granularity, representation consistency, and completeness. Then, to provide a real-world example on how to enhance TDQ in data mining, a case study is demonstrated in detail in this paper, under the background of data mining in traditional Chinese medicine and covers three typical TDQ problems and corresponding solutions. The case study provided in this paper is expected to help data analysts and miners to attach more importance to TDQ issue, and enhance TDQ for more reliable data mining.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. PAKDD QIMIE’13 website, http://conferences.telecom-bretagne.eu/qimie2013/

  2. ICDM RIKD’12 website, http://www.deakin.edu.au/~hdai/RIKD12/

  3. Rexer, K.: 4th Annual Data Miner Survey - 2010 Survey Summary Report (2011), http://www.rexeranalytics.com

  4. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP 1.0 Process and User Guide (2000), http://www.crisp-dm.org

  5. Ballou, D.P., Pazer, H.L.: Cost/Quality Tradeoffs for Control Procedures in Information Systems. OMEGA: Int’l J. Management Science 15(6), 509–521 (1987)

    Article  Google Scholar 

  6. Wang, R.Y., Reddy, M.P., Kon, H.B.: Toward Quality Data: an Attribute-based Approach. Decision Support Systems 13(3-4), 349–372 (1995)

    Article  Google Scholar 

  7. Wang, R.Y., Strong, D.M.: Beyond Accuracy: What Data Quality Means to Data Consumers. Journal of Management Information Systems 4, 5–34 (1996)

    Google Scholar 

  8. Madnick, S.E., Wang, R.Y., Lee, Y.W., Zhu, H.W.: Overview and Framework for Data and Information Quality Research. ACM Journal of Data and Information Quality 1(1), 1–22 (2009)

    Google Scholar 

  9. O’Donnell, M., Knott, A.: Oberlander Jon., Mellish C.: Optimising Text Quality in Generation from Relational Databases. In: Proc. of 1st International Conference on Natural Language Generation, pp. 133–140 (2000)

    Google Scholar 

  10. Sonntag, D.: Assessing the Quality of Natural Language Text Data. Proc. of GI Jahrestagung 1, 259–263 (2004)

    Google Scholar 

  11. Feng, Y., Wu, Z.H., Chen, H.: j., Yu, T., Mao, Y.X., Jiang, X.H.: Data Quality in Traditional Chinese Medicine. In: Proc. of BMEI 2008, pp. 255–259 (2008)

    Google Scholar 

  12. Zhou, X.Z., Wu, Z.H., Yin, A.N., Wu, L.C., Fan, W.Y., Zhang, R.E.: Ontology Development for Unified Traditional Chinese Medical Language System. Artificial Intelligence in Medicine 32(1), 15–27 (2004)

    Article  Google Scholar 

  13. Feng, Y., Wu, Z., Zhou, Z.: Combining an Order-Semisensitive Text Similarity and Closest Fit Approach to Textual Missing Values in Knowledge Discovery. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3682, pp. 943–949. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  14. Schmid, J.: The Main Steps to Data Quality. In: Proc. of 4th Industrial Conf. on Data Mining, pp. 69–77 (2004)

    Google Scholar 

  15. Feng, Y., Wu, Z.H., Zhou, Z.M., Fan, W.Y.: Knowledge Discovery in Traditional Chinese Medicine: State of the Art and Perspectives. Artificial Intelligence in Medicine 38(3), 219–236 (2006)

    Article  Google Scholar 

  16. Pipino, L., Kopcso, D.: Data Mining, Dirty Data, and Costs. In: Proc. of ICIQ 2004, pp. 164–169 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Feng, Y., Ju, C. (2013). Enhancing Textual Data Quality in Data Mining: Case Study and Experiences. In: Li, J., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7867. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40319-4_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40319-4_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40318-7

  • Online ISBN: 978-3-642-40319-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics