The Decomposed K-Nearest Neighbor Algorithm for Imbalanced Text Classification

Kang, Hyung-Seok; Nam, Kihyo; Kim, Seong-in

doi:10.1007/978-3-642-35585-1_12

The Decomposed K-Nearest Neighbor Algorithm for Imbalanced Text Classification

Hyung-Seok Kang¹⁹,
Kihyo Nam²⁰ &
Seong-in Kim¹⁹

Conference paper

914 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7709))

Abstract

As textual data have exponentially increased, it is focused that a need for automatic classification of relevant data to one of pre-defined classes. In many practical applications, they assume that training data are evenly distributed among all classes, but they are suffered from an imbalanced problem. Several algorithms and re-sampling methods have been proposed to overcome an imbalanced problem, but they are still facing the overfitting and information missing. This paper proposes the Decomposed K-Nearest Neighbor (DCM-KNN). In training step, the DCM-KNN decomposes training data into misclassified and correctly-classified data set based on the result of traditional KNN, and finds the appropriate KNN for each set. In test step, the DCM-KNN estimates whether test data is similar to misclassified and correctly-classified data set, and applies the appropriate KNNs. Experimental results show that proposed algorithm can achieve more accurate results in an imbalanced condition.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aci, M., Inan, C., Avci, M.: A hybrid classification method of K-Nearest Neighbor, Bayesian methods and genetic algorithm. Expert Systems with Applications 37(7), 5061–5067 (2010)
Article Google Scholar
Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C.: An experimental comparison of Naïve Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167 (2000)
Google Scholar
Baoli, L., Qin, L., Shiwen, Y.: An adaptive K-Nearest Neighbor text categorization strategy. ACM Transactions on Asian Language Information Processing 3(4), 215–226 (2004)
Article Google Scholar
Chen, E., Lin, Y., Xiong, H., Luo, Q., Ma, H.: Exploiting probabilistic topic models to improve text categorization under class imbalance. Information Processing & Management 47(2), 202–214 (2011)
Article Google Scholar
Chen, J.N., Huang, H.K., Tian, S.F., Qu, Y.L.: Feature selection for text classification with Naïve Bayes. Expert Systems with Applications 36(3), 5432–5435 (2009)
Article Google Scholar
Carpinter, J., Hunt, R.: Tightening the net: A review of current and next generation spam filtering tools. Computers & Security 25(8), 566–578 (2006)
Article Google Scholar
Japkowicz, N.: The class imbalance problem: Significance and strategies. In: Proceedings of the International Conference on Artificial Intelligence (IC-AI), pp. 111–117 (2000)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6(5), 429–449 (2002)
MATH Google Scholar
Jiang, S., Pang, G., Wu, M., Kuang, L.: An improved K-Nearest Neighbor algorithm for text categorization. Expert Systems with Applications 39(1), 1503–1509 (2012)
Article Google Scholar
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims, T.: Learning to classify text using support vector machines. Kluwer Academic Publishers, Norwell (2002)
Book Google Scholar
Lee, L.H., Isa, D., Choo, W.O., Chue, W.Y.: High relevance keyword extraction facility for Bayesian text classification on different domains of varying characteristic. Expert Systems with Applications 39(1), 1147–1155 (2012)
Article Google Scholar
Lee, L.H., Wan, C.H., Rajkumar, R., Isa, D.: An enhanced support vector machine classification framework by using Euclidean distance function for text document categorization. Applied Intelligence 37(1), 80–99 (2012)
Article Google Scholar
Manne, S., Kotha, S.K., Fatima, S.S.: Text categorization with K-Nearest Neighbor approach. In: Proceedings of the International Conference on Information Systems Design and Intelligent Applications, vol. 132, pp. 413–420 (2012)
Google Scholar
Shi, K., Li, L., Liu, H., He, J., Zhang, N., Song, W.: An improved KNN text classification algorithm based on density. In: IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), pp. 113–117 (2011)
Google Scholar
Tan, S.: Neighbor-weighted K-Nearest Neighbor for unbalanced text corpus. Expert Systems with Applications 28(4), 667–671 (2005)
Article Google Scholar
Tan, S.: An effective refinement strategy for K-Nearest Neighbor text classifier. Expert Systems with Applications 30(2), 290–298 (2006)
Article Google Scholar
Wan, C.H., Lee, H.L., Rajkurmar, R., Isa, D.: A hybrid text classification approach with low dependency on parameter by integrating K-Nearest Neighbor and support vector machine. Expert Systems with Applications 39(15), 11880–11888 (2012)
Article Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1(1-2), 76–88 (1999)
Google Scholar
Yang, Y., Ault, T., Peirce, T., Lattimer, C.W.: Improving text categorization methods for event tracking. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 65–72 (2000)
Google Scholar
Ye, Q., Zhang, Z., Law, R.: Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Systems with Applications 36(3), 6527–6535 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Division of Industrial Management Engineering, Korea University, 5-1 Anam-dong, Seongbuk-gu, Seoul, 136-701, Republic of Korea
Hyung-Seok Kang & Seong-in Kim
UMLogics Co., Ltd., E-420 Pangyo Inovalley, 622 Sampyung-dong, Bundang-gu, Seongnam-city, Kyungki-do, 463-400, Republic of Korea
Kihyo Nam

Authors

Hyung-Seok Kang
View author publications
You can also search for this author in PubMed Google Scholar
Kihyo Nam
View author publications
You can also search for this author in PubMed Google Scholar
Seong-in Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

GVSA and University of Tasmania, Hobart, TAS, Australia
Tai-hoon Kim
Hannam University, Daejeon, South Korea
Young-hoon Lee
National Chiao Tung University, Hsinchu, Taiwan, ROC
Wai-chi Fang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, HS., Nam, K., Kim, Si. (2012). The Decomposed K-Nearest Neighbor Algorithm for Imbalanced Text Classification. In: Kim, Th., Lee, Yh., Fang, Wc. (eds) Future Generation Information Technology. FGIT 2012. Lecture Notes in Computer Science, vol 7709. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35585-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-35585-1_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35584-4
Online ISBN: 978-3-642-35585-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics