Skip to main content

Generating Incomplete Data with DataZapper

  • Conference paper
Agents and Artificial Intelligence (ICAART 2009)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 67))

Included in the following conference series:

  • 591 Accesses

Abstract

A nearly universal problem with real data is that they are incomplete, with some values missing. Furthermore, the ways in which values can go missing are quite varied, with arbitrary interdependencies between variables and their values leading to missing values. In order to test and compare data mining algorithms it is necessary to generate artificial data which have the same characteristics. We introduce DataZapper, a tool for uncreating data. Given a dataset containing joint samples over variables, DataZapper will make a specified percentage of observed values disappear, replaced by an indication that the measurement failed. DataZapper also supports any kind of dependence, and any degree of dependence, in its generation of missing values. We illustrate its use in a machine learning experiment and offer it to the data mining and machine learning communities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Onisko, A., Druzdzel, M.J., Wasyluk, H.: An experimental comparison of methods for handling incomplete data in learning parameters of bayesian networks. In: Proceedings of the IIS 2002 Symposium on Intelligent Information Systems, pp. 351–360. Physica-Verlag (2002)

    Google Scholar 

  2. Twala, B., Cartwright, M., Shepperd, M.J.: Comparison of various methods for handling incomplete data in software engineering databases. In: 2005 International Symposium on Empirical Software Engineering, Noosa Heads, Australia, pp. 105–114 (2005)

    Google Scholar 

  3. Twala, B.E.T.H., Jones, M.C., Hand, D.J.: Good methods for coping with missing data in decision trees. Pattern Recogn. Lett. 29, 950–956 (2008)

    Article  Google Scholar 

  4. Rubin, D.B.: Inference and missing data. Biometrika 63, 581–592 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  5. Ghahramani, Z., Jordan, M.I.: Learning from incomplete data. Technical Report AIM-1509, Artificial Intelligence laboraory and Center for Biological and Computational Learning, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology (1994)

    Google Scholar 

  6. Gill, M.K., Asefa, T., Kaheil, Y., McKee, M.: Effect of missing data on performance of learning algorithms for hydrologic predictions: Implications to an imputation technique. Water Resources Research 43 (2007)

    Google Scholar 

  7. Richman, M.B., Trafalis, T.B., Adrianto, I.: Multiple imputation through machine learning algorithms. In: Artificial Intelligence and Climate Applications (Joint between 5th Conference on Applications of Artificial Intelligence in the Environmental Sciences and 19th Conference on Climate Variability and Change) (2007)

    Google Scholar 

  8. Francois, O., Leray, P.: Generation of incomplete test-data using bayesian networks. In: Proceedings of International Joint Conference on Neural Networks, Orlando, Florida, USA, pp. 12–17 (2007)

    Google Scholar 

  9. Backus, J., Naur, P.: Revised report on the algorithmic language algol 60. Communications of the ACM 3, 299–314 (1960)

    Article  Google Scholar 

  10. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

  11. Wallace, C., Korb, K.B., Dai, H.: Causal discovery via MML. In: Proceedings of the Thirteenth International Conference on Machine Learning, pp. 516–524. Morgan Kaufmann, San Francisco (1996)

    Google Scholar 

  12. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search, 2nd edn. MIT Press, Cambridge (2000)

    Google Scholar 

  13. Leray, P., Francois, O.: BNT structure learning package: documentation and experiment s. Technical Report Laboratoire PSI - INSA Rouen-FRE CNRS 2645, Universitet INSA de Rouen (2004)

    Google Scholar 

  14. Cooper, G.F., Herskovits, E.: A Bayesian method for constructing Bayesian belief networks from databases. In: Proceedings of the Conference on Uncertainty in AI, pp. 86–94. Morgan Kaufmann, San Mateo (1991)

    Google Scholar 

  15. Meek, C.: Graphical Models: Selecting Causal and Statistical Models. PhD thesis, Carnegie Mellon University (1997)

    Google Scholar 

  16. Chickering, D.M.: A tranformational characterization of equivalent Bayesian network structures. In: Besnard, P., Hanks, S. (eds.) UAI 1995, San Francisco, pp. 87–98 (1995)

    Google Scholar 

  17. Wen, Y., Korb, K.B.: A heuristic algorithm for pattern-to-dag conversion. In: Proceedings of IASTED International Conference on Artificial Intelligence and Applications, pp. 428–433 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wen, Y., Korb, K.B., Nicholson, A.E. (2010). Generating Incomplete Data with DataZapper. In: Filipe, J., Fred, A., Sharp, B. (eds) Agents and Artificial Intelligence. ICAART 2009. Communications in Computer and Information Science, vol 67. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11819-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-11819-7_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-11818-0

  • Online ISBN: 978-3-642-11819-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics