Abstract
As textual data have exponentially increased, it is focused that a need for automatic classification of relevant data to one of pre-defined classes. In many practical applications, they assume that training data are evenly distributed among all classes, but they are suffered from an imbalanced problem. Several algorithms and re-sampling methods have been proposed to overcome an imbalanced problem, but they are still facing the overfitting and information missing. This paper proposes the Decomposed K-Nearest Neighbor (DCM-KNN). In training step, the DCM-KNN decomposes training data into misclassified and correctly-classified data set based on the result of traditional KNN, and finds the appropriate KNN for each set. In test step, the DCM-KNN estimates whether test data is similar to misclassified and correctly-classified data set, and applies the appropriate KNNs. Experimental results show that proposed algorithm can achieve more accurate results in an imbalanced condition.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aci, M., Inan, C., Avci, M.: A hybrid classification method of K-Nearest Neighbor, Bayesian methods and genetic algorithm. Expert Systems with Applications 37(7), 5061–5067 (2010)
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An experimental comparison of Naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167 (2000)
Baoli, L., Qin, L., Shiwen, Y.: An adaptive K-Nearest Neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing 3(4), 215–226 (2004)
Chen, E., Lin, Y., Xiong, H., Luo, Q., Ma, H.: Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing & Management 47(2), 202–214 (2011)
Chen, J.N., Huang, H.K., Tian, S.F., Qu, Y.L.: Feature selection for text classification with Naïve Bayes. Expert Systems with Applications 36(3), 5432–5435 (2009)
Carpinter, J., Hunt, R.: Tightening the net: A review of current and next generation spam filtering tools. Computers & Security 25(8), 566–578 (2006)
Japkowicz, N.: The class imbalance problem: Significance and strategies. In: Proceedings of the International Conference on Artificial Intelligence (IC-AI), pp. 111–117 (2000)
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5), 429–449 (2002)
Jiang, S., Pang, G., Wu, M., Kuang, L.: An improved K-Nearest Neighbor algorithm for text categorization. Expert Systems with Applications 39(1), 1503–1509 (2012)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Joachims, T.: Learning to classify text using support vector machines. Kluwer Academic Publishers, Norwell (2002)
Lee, L.H., Isa, D., Choo, W.O., Chue, W.Y.: High relevance keyword extraction facility for Bayesian text classification on different domains of varying characteristic. Expert Systems with Applications 39(1), 1147–1155 (2012)
Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using Euclidean distance function for text document categorization. Applied Intelligence 37(1), 80–99 (2012)
Manne, S., Kotha, S.K., Fatima, S.S.: Text categorization with K-Nearest Neighbor approach. In: Proceedings of the International Conference on Information Systems Design and Intelligent Applications, vol. 132, pp. 413–420 (2012)
Shi, K., Li, L., Liu, H., He, J., Zhang, N., Song, W.: An improved KNN text classification algorithm based on density. In: IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), pp. 113–117 (2011)
Tan, S.: Neighbor-weighted K-Nearest Neighbor for unbalanced text corpus. Expert Systems with Applications 28(4), 667–671 (2005)
Tan, S.: An effective refinement strategy for K-Nearest Neighbor text classifier. Expert Systems with Applications 30(2), 290–298 (2006)
Wan, C.H., Lee, H.L., Rajkurmar, R., Isa, D.: A hybrid text classification approach with low dependency on parameter by integrating K-Nearest Neighbor and support vector machine. Expert Systems with Applications 39(15), 11880–11888 (2012)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1-2), 76–88 (1999)
Yang, Y., Ault, T., Peirce, T., Lattimer, C.W.: Improving text categorization methods for event tracking. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 65–72 (2000)
Ye, Q., Zhang, Z., Law, R.: Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Systems with Applications 36(3), 6527–6535 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kang, HS., Nam, K., Kim, Si. (2012). The Decomposed K-Nearest Neighbor Algorithm for Imbalanced Text Classification. In: Kim, Th., Lee, Yh., Fang, Wc. (eds) Future Generation Information Technology. FGIT 2012. Lecture Notes in Computer Science, vol 7709. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35585-1_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-35585-1_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35584-4
Online ISBN: 978-3-642-35585-1
eBook Packages: Computer ScienceComputer Science (R0)