Abstract
Text categorization (TC), has many typical traits, such as large and difficult category taxonomies, noise and incremental data, etc. Random Forests, one of the most important but simple state-of-the-art ensemble methods, has been used to solve such type of subjects with good performance. most current Random Forests approaches with diversity-related issues focus on maximizing tree diversity while producing and training component trees. There are much diverse characteristics for component trees in TC trained on data of noise, huge categories and features. Consequently, given numerous component trees from the original Random Forests, we propose a novel method, Diversity Random Forests, which diversely and adaptively select and combine tree classifiers with diversity learning and sample weighting. Diversity Random Forests includes two key issues. First, by designing a matrix for the data distribution creatively, we formulate a unified optimization model for learning and selecting diverse trees, where tree weights are learned through a convex quadratic programming problem with given sample weights. Second, we propose a new self-training algorithm to iteratively run the convex optimization and automatically learn the sample weights. Extensive experiments on a variety of text categorization benchmark data sets show that the proposed approach consistently outperforms state-of-the-art methods.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Manning, C.D., Prabhakar, R., Hinrich, S.: Introduction to Information Retrieval. Cambridge University Press (2008)
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Liu, F.T., Ting, K.M., Fan, W.: Maximizing tree diversity by building complete-random decision trees. In: Proceeding of PAKDD, pp. 605–610 (2005)
Liu, F.T., Ting, K.M., Yu, Y., Zhou, Z.H.: Spectrum of variable-random trees. J. Artif. Intell. Res. 32, 355–384 (2008)
Zhou, Z.H., Wu, J., Tang, W.: Ensembling neural networks: Many could be better than all. Artificial Intelligence 137, 239–263 (2002)
Yin, X.-C., Huang, K., Hao, H.-W., Iqbal, K., Wang, Z.-B.: Classifier ensemble using a heuristic learning with sparsity and diversity. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part II. LNCS, vol. 7664, pp. 100–107. Springer, Heidelberg (2012)
Yin, X.C., Huang, K., Hao, H.W., Iqbal, K., Wang, Z.B.: A novel classifier ensemble method with sparsity and diversity. Neurocomputing 134, 214–221 (2014)
Yin, X.C., Huang, K., Yang, C., Hao, H.W.: Convex ensemble learning with sparsity and diversity. Information Fusion 20, 49–59 (2014)
Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Proceedings of ICML, pp. 148–156 (1996)
Biau, G.: Analysis of a random forests model. J. Mach. Learn. Res. 13, 1063–1095 (2012)
Genuer, R., Poggi, J.M., Tuleau-Malot, C.: Variable selection using random forests. Pattern Recognition Letters 31(14), 2225–2236 (2010)
Verikas, A., Gelzinis, A., Bacauskiene, M.: Mining data with random forests: A survey and results of new tests. Pattern Recognition 44(2), 330–349 (2011)
Skalak, D.B.: The sources of increased accuracy for two proposed boosting algorithms. In: Proceeding of AAAI, pp. 120–125 (1996)
Han, E.H., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Proceedings of European PKDD, pp. 424–431 (2000)
Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines. ACM Trans. Intelligent Systems and Technology 2(3), 1–27 (2011), http://www.csie.ntu.edu.tw/cjlin/libsvm
Brazdil, P., Soares, C.: A comparison of ranking methods for classification algorithm selection. In: Proceedings of ECML, pp. 63–74 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Yang, C., Yin, XC., Huang, K. (2014). Text Categorization with Diversity Random Forests. In: Loo, C.K., Yap, K.S., Wong, K.W., Beng Jin, A.T., Huang, K. (eds) Neural Information Processing. ICONIP 2014. Lecture Notes in Computer Science, vol 8836. Springer, Cham. https://doi.org/10.1007/978-3-319-12643-2_39
Download citation
DOI: https://doi.org/10.1007/978-3-319-12643-2_39
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12642-5
Online ISBN: 978-3-319-12643-2
eBook Packages: Computer ScienceComputer Science (R0)