Skip to main content

Incremental Classification Using Tree-Based Sampling for Large Data

  • Chapter
Instance Selection and Construction for Data Mining

Abstract

We present an efficient method called ICE for incremental classification that employs tree-based sampling techniques and is independent of data distribution. The basic idea is to represent the class distribution in the dataset by using the weighted samples. The weighted samples are extracted from the nodes of intermediate decision trees using a clustering technique. As the data grows, an intermediate classifier is built only on the incremental portion of the data. The weighted samples from the intermediate classifier are combined with the previously generated samples to obtain an up-to-date classifier for the current data in an efficient, incremental fashion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Agrawal, R., Imielinski, T., and Swami, A. (1993). Database Mining: A Performance Perspective. IEEE Transactions on Knowledge and Data Engineering, 5(6):914–925.

    Article  Google Scholar 

  • AlSabti, K. (1998). Efficient Algorithms for Data Mining. PhD thesis, Syracuse University.

    Google Scholar 

  • AlSabti, K., Ranka, S., and Singh, V. (1998). Coulds: A decision tree classifier for large datasets. In International Conference on Knowledge Discovery and Data Mining, pages 2–8, New York, NY.

    Google Scholar 

  • Blake, C., Keogh, E., and Merz, C. J. (1998). Uci repository of machine learning databases. The URL is http://www.ics.uci.edu/~mlearn/ML-Repository.html.

    Google Scholar 

  • Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth, Belmont.

    MATH  Google Scholar 

  • Catlett, J. (1984). Megainduction: Machine Learning on Very Large Databases. PhD thesis, University of Sydney.

    Google Scholar 

  • Chan, P. K. and Stolfo, S. J. (1997). On the accuracy of meta-learning for scalable data mining. Intelligent Information Systems, 8:5–28.

    Article  Google Scholar 

  • Cheeseman, P., Kelly, J., Self, M., Stutz, J., and Taylor, W. (1988). Autoclass: A bayesian classification system. In The 5th Internaltion Conference on Machine Learning, pages 54–64, San Francisco, CA.

    Google Scholar 

  • Ester, M., Kriegel, H., Sander, J., and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In International Conference on Knowledge Discovery and Data Mining, pages 226–231, Portland, OR.

    Google Scholar 

  • Gehrke, J., Ganti, V., Ramakrishnan, R., and Loh, W.-Y. (1999). Boat: Optimistic decision tree construction. In ACM SIGMOD Conference, pages 169–180, Philadelphia, PA.

    Google Scholar 

  • Gehrke, J., Ramakrishinan, R., and Ganti, V. (1998). Rainforest: A framework for fast decision tree classification of large datasets. In Internation Conference on Very Large Databases, pages 416–427, New York, NY.

    Google Scholar 

  • Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Morgan Kaufman, San Francisco, CA.

    MATH  Google Scholar 

  • James, M. (1985). Classification Algorithms. Wiley and Sons, New York, NY.

    MATH  Google Scholar 

  • Kaufman, L. and Rousseeuw, P. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley and Sons, New York, NY.

    Book  Google Scholar 

  • Lim, T.-S., Loh, W.-Y., and Shih, Y.-S. (1997). An Emperical Comparison of Decision Trees and Other Classification Methods. Technical Report TR 979, Department of Statistics, University of Wisconsin, Madison.

    Google Scholar 

  • Mitchie, D., Spiegelhalter, D. J., and Taylor, C. C. (1994). Machine Learning, Neural and Statistical Classification. Ellis Horwood, New York, NY.

    Google Scholar 

  • Morimoto, Y., Fukuta, T., Matsuzawa, H., Tokuyama, T., and Yoda, K. (1998). Algorithms for mining association rules for binary segmentations of huge categorical databases. In International Conference on Very Large Databases, pages 380–391, New York, NY.

    Google Scholar 

  • Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1:81–106.

    Google Scholar 

  • Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufman, San Francisco, CA.

    Google Scholar 

  • Quinlan, J. R. and Rivest, R. L. (1989). Inferring decision trees using minimum description length principle. Information and Computation, 80:227–248.

    Article  MathSciNet  MATH  Google Scholar 

  • Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, UK.

    MATH  Google Scholar 

  • Shafer, J., Agrawal, R., and Mehta, M. (1996). Sprint: A scalable parallel classifier for data mining. In International Conference on Very Large Databases, pages 544–555, Bombay, India.

    Google Scholar 

  • Weiss, S. M. and Kulikowski, C. A. (1991). Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, San Mateo, CA.

    Google Scholar 

  • Yoon, H. (2000). Efficient Algorithms and Software for Mining Sparse, High-dimensional Data. PhD thesis, University of Florida.

    Google Scholar 

  • Zhang, T., Ramakrishinan, R., and Livny, M. (1996). Birch: An efficient data clustering method for very large databases. In ACM SIGMOD Conference, pages 103–114, Montreal, Canada.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Yoon, H., Alsabti, K., Ranka, S. (2001). Incremental Classification Using Tree-Based Sampling for Large Data. In: Liu, H., Motoda, H. (eds) Instance Selection and Construction for Data Mining. The Springer International Series in Engineering and Computer Science, vol 608. Springer, Boston, MA. https://doi.org/10.1007/978-1-4757-3359-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-1-4757-3359-4_11

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-1-4419-4861-8

  • Online ISBN: 978-1-4757-3359-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics