Skip to main content

Part of the book series: Massive Computing ((MACO,volume 6))

Abstract

A typical data mining project uses data collected for various purposes, ranging from routinely gathered data, to process improvement projects, and to data required for archival purposes. In some cases, the set of considered features might be large (a wide data set) and sufficient for extraction of knowledge. In other cases the data set might be narrow and insufficient to extract meaningful knowledge or the data may not even exist.

Mining wide data sets has received attention in the literature, and many models and algorithms for feature selection have been developed for wide data sets.

Determining features for which data should be collected in the absence of an existing data set or when a data set is partially available has not been sufficiently addressed in the literature. Yet, this issue is of paramount importance as the interest in data mining is growing. The methods and process for the definition of the most appropriate features for data collection, data transformation, data quality assessment, and data analysis are referred to as data farming. This chapter outlines the elements of a data fanning discipline.

Triantaphyllou, E. and G. Felici (Eds.), Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques, Massive Computing Series, Springer, Heidelberg, Germany, pp. 279–304, 2006.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Barry, MJ.A. and G. Linoff (1997), Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley, New York.

    Google Scholar 

  • Bloedorn, E. and R.S. Michalski (1998), Data-driven constructive induction, IEEE Intelligent Systems, Vol. 13, No. 2, pp. 30–37.

    Article  Google Scholar 

  • Breiman, L., J.H. Friedman, R.A. Olshen, and P.J. Stone (1984), Classification and Regression Trees, Wadworth International Group, Belmont, CA.

    MATH  Google Scholar 

  • Carlett, J. (1991), Megainduction: Machine Learning on Very Large Databases, Ph.D. Thesis, Department of Computer Science, University of Sydney, Australia.

    Google Scholar 

  • Caroll, J.M. and J. Olson (1987), Mental Models in Human-Computer Interaction: Research Issues About the User of Software Knows, National Academy Press, Washington, DC.

    Google Scholar 

  • Cattral, R., F. Oppacher, and D. Deugo (2001), Supervised and unsupervised data mining with an evolutionary algorithm, Proceedings of the 2001 Congress on Evolutionary Computation, IEEE Press, Piscataway, NJ, pp. 767–776.

    Chapter  Google Scholar 

  • Cios, K., W. Pedrycz, and R. Swiniarski (1998), Data Mining: Methods for Knowledge Discovery, Kluwer, Boston, MA.

    Google Scholar 

  • Dugherty, D., R. Kohavi, and M. Sahami (1995), Supervised and unsupervised discretization of continuous features, Proceedings of the 12 th International Machine Learning Conference, pp. 194–202.

    Google Scholar 

  • Duda, R.O. and P.E. Hart (1973), Pattern Recognition and Scene Analysis, John Wiley, New York.

    Google Scholar 

  • Fayyad, U.M. and K.B. Irani (1993), Multi-interval discretization of continuously-valued attributes for classification learning, Proceedings of the 13 th International Joint Conference on Artificial Intelligence, pp. 1022–1027.

    Google Scholar 

  • Fukunaga, K. (1990), Introduction to Statistical Pattern Analysis, Academic Press, San Diego, CA.

    Google Scholar 

  • Han, J. and M. Kamber (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann, San Diego, CA.

    Google Scholar 

  • John, G., R. Kohavi, and K. Pfleger (1994), Irrelevant features and the subset selection problem, Proceedings of the II th International Conference on Machine Learning, ICLM’94, Morgan Kaufmann, San Diego, CA, pp. 121–127.

    Google Scholar 

  • Kruchten, P. (2000), The Rational Unified Process: An Introduction, Addison-Wesley, New York, 2000.

    Google Scholar 

  • Kovacs, T. (2001), What should a classifier system learn, Proceedings of the 2001 Congress on Evolutionary Computation, IEEE Press, Piscataway, NJ, pp. 775–782.

    Chapter  Google Scholar 

  • Kusiak, A. (1999), Engineering Design: Products, Processes, and Systems, Academic Press, San Diego, CA.

    Google Scholar 

  • Kusiak, A. (2000), Decomposition in data mining: an industrial case study, IEEE Transactions on Electronics Packaging Manufacturing, Vol. 23, No. 4, pp. 345–353.

    Article  Google Scholar 

  • Kusiak, A., J.A. Kern, K.H. Kernstine, and T.L. Tseng (2000), Autonomous decision-making: A data mining approach, IEEE Transactions on Information Technology in Biomedicine, Vol. 4, No. 4, pp. 274–284.

    Article  Google Scholar 

  • Kusiak, A. (2001), Feature transformation methods in data mining, IEEE Transactions on Electronics Packaging Manufacturing, Vol. 24, No. 3, 2001, pp. 214–221.

    Article  Google Scholar 

  • Kusiak, A. (2002), A Data Mining Approach for Generation of Control Signatures, ASME Transactions: Journal of Manufacturing Science and Engineering, Vol. 124, No. 4, pp. 923–926.

    Article  Google Scholar 

  • LINDO (2003), http://www.lindo.com (Accessed June 5, 2003).

    Google Scholar 

  • Pawlak Z. (1982), Rough sets, International Journal of Information and Computer Science, Vol. 11, No. 5, pp. 341–356.

    Article  MATH  MathSciNet  Google Scholar 

  • Pawlak, Z. (1991), Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer, Boston, MA.

    Google Scholar 

  • Preparata, F.P. and Shamos, M.I. (1985), Pattern Recognition and Scene Analysis, Springer-Verlag, New York.

    Google Scholar 

  • Quinlan, J.R. (1986), Induction of decision trees, Machine Learning, Vol. 1, No 1, pp. 81–106.

    Google Scholar 

  • Ragel, A. and B. Cremilleux (1998), Treatment of missing values for association rules, Proceedings of the Second Pacific Asia Conference, PAKDD’ 98, Melbourne, Australia.

    Google Scholar 

  • Stone, M. (1974), Cross-validatory choice and assessment of statistical predictions, Journal of the Royal Statistical Society, Vol. 36, pp. 111–147.

    MATH  Google Scholar 

  • Slowinski, R. (1993), Rough set learning of preferential attitude in multi-criteria decision making, in Komorowski, J. and Ras, Z. (Eds), Methodologies for Intelligent Systems, Springer-Verlag, Berlin, Germany, pp. 642–651.

    Google Scholar 

  • Tou, J.T. and R.C. Gonzalez (1974), Pattern Recognition Principles, Addison Wesley, New York.

    MATH  Google Scholar 

  • Vafaie, H. and K. De Jong (1998), Feature space transformation using genetic algorithms, IEEE Intelligent Systems, Vol. 13, No. 2, pp. 57–65.

    Article  Google Scholar 

  • Venables, W.N. and B.D. Ripley (1998), Modern Statistics with S-PLUS, Springer-Verlag, New York.

    Google Scholar 

  • Wickens, G., S.E. Gordon, and Y. Liu (1998), An Introduction to Human Factors Engineering, Harper Collins, New York.

    Google Scholar 

  • Wilson, S.W. (1995), Classifier fitness based on accuracy, Evolutionary Computation, Vol. 3, No. 2, pp. 149–175.

    Google Scholar 

  • Wnek, J. and R.S. Michalski (1994), Hypothesis-driven constructive induction in AQ17-HCI: A method and experiments, Machine Learning, Vol. 14, No, 2, pp. 139–168.

    Article  MATH  Google Scholar 

  • Yang, J. and V. Honavar (1998), Feature subset selection using a genetic algorithm, IEEE Intelligent Systems, Vol. 13, No. 2, pp. 44–49.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Kusiak, A. (2006). Data Farming: Concepts and Methods. In: Triantaphyllou, E., Felici, G. (eds) Data Mining and Knowledge Discovery Approaches Based on Rule Induction Techniques. Massive Computing, vol 6. Springer, Boston, MA . https://doi.org/10.1007/0-387-34296-6_8

Download citation

  • DOI: https://doi.org/10.1007/0-387-34296-6_8

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-34294-8

  • Online ISBN: 978-0-387-34296-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics