Skip to main content

Text Categorization with Diversity Random Forests

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8836))

Abstract

Text categorization (TC), has many typical traits, such as large and difficult category taxonomies, noise and incremental data, etc. Random Forests, one of the most important but simple state-of-the-art ensemble methods, has been used to solve such type of subjects with good performance. most current Random Forests approaches with diversity-related issues focus on maximizing tree diversity while producing and training component trees. There are much diverse characteristics for component trees in TC trained on data of noise, huge categories and features. Consequently, given numerous component trees from the original Random Forests, we propose a novel method, Diversity Random Forests, which diversely and adaptively select and combine tree classifiers with diversity learning and sample weighting. Diversity Random Forests includes two key issues. First, by designing a matrix for the data distribution creatively, we formulate a unified optimization model for learning and selecting diverse trees, where tree weights are learned through a convex quadratic programming problem with given sample weights. Second, we propose a new self-training algorithm to iteratively run the convex optimization and automatically learn the sample weights. Extensive experiments on a variety of text categorization benchmark data sets show that the proposed approach consistently outperforms state-of-the-art methods.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Manning, C.D., Prabhakar, R., Hinrich, S.: Introduction to Information Retrieval. Cambridge University Press (2008)

    Google Scholar 

  2. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  3. Liu, F.T., Ting, K.M., Fan, W.: Maximizing tree diversity by building complete-random decision trees. In: Proceeding of PAKDD, pp. 605–610 (2005)

    Google Scholar 

  4. Liu, F.T., Ting, K.M., Yu, Y., Zhou, Z.H.: Spectrum of variable-random trees. J. Artif. Intell. Res. 32, 355–384 (2008)

    MATH  Google Scholar 

  5. Zhou, Z.H., Wu, J., Tang, W.: Ensembling neural networks: Many could be better than all. Artificial Intelligence 137, 239–263 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  6. Yin, X.-C., Huang, K., Hao, H.-W., Iqbal, K., Wang, Z.-B.: Classifier ensemble using a heuristic learning with sparsity and diversity. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part II. LNCS, vol. 7664, pp. 100–107. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  7. Yin, X.C., Huang, K., Hao, H.W., Iqbal, K., Wang, Z.B.: A novel classifier ensemble method with sparsity and diversity. Neurocomputing 134, 214–221 (2014)

    Article  Google Scholar 

  8. Yin, X.C., Huang, K., Yang, C., Hao, H.W.: Convex ensemble learning with sparsity and diversity. Information Fusion 20, 49–59 (2014)

    Article  Google Scholar 

  9. Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proceedings of ICML, pp. 148–156 (1996)

    Google Scholar 

  10. Biau, G.: Analysis of a random forests model. J. Mach. Learn. Res. 13, 1063–1095 (2012)

    MathSciNet  MATH  Google Scholar 

  11. Genuer, R., Poggi, J.M., Tuleau-Malot, C.: Variable selection using random forests. Pattern Recognition Letters 31(14), 2225–2236 (2010)

    Article  Google Scholar 

  12. Verikas, A., Gelzinis, A., Bacauskiene, M.: Mining data with random forests: A survey and results of new tests. Pattern Recognition 44(2), 330–349 (2011)

    Article  Google Scholar 

  13. Skalak, D.B.: The sources of increased accuracy for two proposed boosting algorithms. In: Proceeding of AAAI, pp. 120–125 (1996)

    Google Scholar 

  14. Han, E.H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Proceedings of European PKDD, pp. 424–431 (2000)

    Google Scholar 

  15. Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines. ACM Trans. Intelligent Systems and Technology 2(3), 1–27 (2011), http://www.csie.ntu.edu.tw/cjlin/libsvm

    Article  Google Scholar 

  16. Brazdil, P., Soares, C.: A comparison of ranking methods for classification algorithm selection. In: Proceedings of ECML, pp. 63–74 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Yang, C., Yin, XC., Huang, K. (2014). Text Categorization with Diversity Random Forests. In: Loo, C.K., Yap, K.S., Wong, K.W., Beng Jin, A.T., Huang, K. (eds) Neural Information Processing. ICONIP 2014. Lecture Notes in Computer Science, vol 8836. Springer, Cham. https://doi.org/10.1007/978-3-319-12643-2_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12643-2_39

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12642-5

  • Online ISBN: 978-3-319-12643-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics