Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation

  • Meng-Fong Tsai
  • Shyr-Shen Yu
Systems-Level Quality Improvement
Part of the following topical collections:
  1. Smart Living in Healthcare and Innovations


An imbalanced classification means that a dataset has an unequal class distribution among its population. For any given dataset, regardless of any balancing issue, the predictions made by most classification methods are highly accurate for the majority class but significantly less accurate for the minority class. To overcome this problem, this study took several imbalanced datasets from the famed UCI datasets and designed and implemented an efficient algorithm which couples Top-N Reverse k-Nearest Neighbor (TRkNN) with the Synthetic Minority Oversampling TEchnique (SMOTE). The proposed algorithm was investigated by applying it to classification methods such as logistic regression (LR), C4.5, Support Vector Machine (SVM), and Back Propagation Neural Network (BPNN). This research also adopted different distance metrics to classify the same UCI datasets. The empirical results illustrate that the Euclidean and Manhattan distances are not only more accurate, but also show greater computational efficiency when compared to the Chebyshev and Cosine distances. Therefore, the proposed algorithm based on TRkNN and SMOTE can be widely used to handle imbalanced datasets. Our recommendations on choosing suitable distance metrics can also serve as a reference for future studies.


Imbalanced classification Synthetic minority oversampling technique Distance Metric UCI Dataset 


  1. 1.
    Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. Proc. 7th Int. Conf. Inform. Knowl. Manag. :148–155.Google Scholar
  2. 2.
    Castillo, M., and Serrano, J., A multistrategy approach for digital text categorization from imbalanced documents. SIGKDD Explor. Newsl. 6:70–79, 2004.CrossRefGoogle Scholar
  3. 3.
    Sun, A., Lim, E. P., and Liu, Y., On strategies for imbalanced text classification using SVM: a comparative study. Decis. Support. Syst. 48:191–201, 2009.CrossRefGoogle Scholar
  4. 4.
    Mazurowski, M., Habas, P. A., Zurada, J. M., Lo, J. Y., Baker, J. A., and Tourassi, G. D., Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 21:427–436, 2008.CrossRefPubMedGoogle Scholar
  5. 5.
    Anand, A., Pugalenthi, G., Fogel, G., and Suganthan, P., An approach for classification of highly imbalanced data using weighting and undersampling. Amino Acids 39:1385–1391, 2010.CrossRefPubMedGoogle Scholar
  6. 6.
    Hao, M., Wang, Y., and Bryant, S. H., An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data. Anal. Chim. Acta. 806:117–127, 2014.CrossRefPubMedGoogle Scholar
  7. 7.
    Chen, M. Y., Using a hybrid evolution approach to forecast financial failures for Taiwan listed companies. Quant. Finan. 14(6):1047–1058, 2014.CrossRefGoogle Scholar
  8. 8.
    Chen, M. Y., A hybrid ANFIS model for business failure prediction - utilization of particle swarm optimization and subtractive clustering. Inform. Sci. 220:180–195, 2013.CrossRefGoogle Scholar
  9. 9.
    Phua, C., Alahakoon, D., and Lee, V., Minority report in fraud detection: Classification of skewed data. SIGKDD Explor. Newsl. 6:50–59, 2004.CrossRefGoogle Scholar
  10. 10.
    Wei, W., Li, J., Cao, L., Ou, Y., and Chen, J., Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 16:449–475, 2013.CrossRefGoogle Scholar
  11. 11.
    Khor, K. C., Ting, C. Y., and Phon-Amnuaisuk, S., A cascaded classifier approach for improving detection rates on rare attack categories in network intrusion detection. Appl. Intell. 36:320–329, 2012.CrossRefGoogle Scholar
  12. 12.
    Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P., SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 16:321–357, 2002.Google Scholar
  13. 13.
    Hart, P. E., The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 18:515–516, 1968.CrossRefGoogle Scholar
  14. 14.
    Wilson, D. L., Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 2:408–420, 1972.CrossRefGoogle Scholar
  15. 15.
    Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. Proc. 14th Int. Conf. Inform. Mach. Learn. :179–186.Google Scholar
  16. 16.
    Laurikkala, J., Improving identification of difficult small classes by balancing class distribution. Artif. Intell. Med. 2101:63–66, 2001.CrossRefGoogle Scholar
  17. 17.
    Mani I, Zhang I (2003) kNN approach to unbalanced data distributions: a case study involving information extraction. Int. Conf. Mach. Learn., Workshop on Learning from Imbalanced Datasets 42–48.Google Scholar
  18. 18.
    Guo, H., and Viktor, H. L., Learning from imbalanced data sets with boosting and data generation: The data boosting approach. SIGKDD Explor. 6(1):30–39, 2004.CrossRefGoogle Scholar
  19. 19.
    Han, H., Wang, W. Y., and Mao, B. H., Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Proc. Int. Conf. Intell. Comput. 2005(I):878–887, 2005.Google Scholar
  20. 20.
    Cohen, G., Hilario, M., Sax, H., Hogonnet, S., and Geissbuhler, A., Learning from imbalanced data in surveillance of nosocomial infection. Artif. Intell. Med. 37:7–18, 2006.CrossRefPubMedGoogle Scholar
  21. 21.
    Sáez, J. A., Luengo, J., Stefanowski, J., and Herrera, F., SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Inform. Sci. 291:184–203, 2015.CrossRefGoogle Scholar
  22. 22.
    Bunkhumpornpat, C., Sinapiromsaran, K., and Lursinsap, C., Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In: Proceedings of the 13th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’09. Springer, Berlin, pp. 475–482, 2009.CrossRefGoogle Scholar
  23. 23.
    Maciejewski, T., and Stefanowski, J., Local neighbourhood extension of SMOTE for mining imbalanced data. In: Proceedings of IEEE symposium on computational intelligence and data mining. IEEE Press, SSCI IEEE, pp. 104–111, 2011.Google Scholar
  24. 24.
    Batista, G., Prati, R., and Monard, M., A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 6:20–29, 2004.CrossRefGoogle Scholar
  25. 25.
    Tomek, I., Two modifications of CNN. IEEE Trans. Syst. Man Commun. 6:769–772, 1976.CrossRefGoogle Scholar
  26. 26.
    Katos, V., Network intrusion detection: Evaluating cluster, discriminant, and logit analysis. Inform. Sci. 177(15):3060–3073, 2007.CrossRefGoogle Scholar
  27. 27.
    Chen, M. Y., Bankruptcy prediction in firms with statistical and intelligent techniques and a comparison of evolutionary computation approaches. Comput. Math. Appl. 62(12):4514–4524, 2011.CrossRefGoogle Scholar
  28. 28.
    Quinlan, J. R., Programs for machine learning. Morgan Kaufmann, San Fransisco, 1993.Google Scholar
  29. 29.
    Salzberg, S. L., On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min. Knowl. Disc. 1:317–327, 1997.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringNational Chung Hsing UniversityTaichungTaiwan

Personalised recommendations