Skip to main content

PDD Algorithm for Balancing Medical Data

  • Conference paper
  • First Online:
  • 1054 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 905))

Abstract

There can be various aspects that can affect the performance of a machine learning classifier, among which the unbalanced dataset is the most prominent. The unbalanced dataset is the one in which there is a disproportion among classes i.e. instances belonging to the one class heavily outnumber instances belonging to all other classes. This problem of the unbalanced dataset is more common in medical data as it is collected from the real world where the number of persons affected by the disease will always be less than the non-affected persons. Due to this disproportion among the classes, classifiers face difficulties in learning concepts related to the class in minority. Most of all data balancing techniques are created keeping general data in mind and are not viable for medical data. In this paper, a method is proposed that helps balance medical data more effectively and at the same time increase performance and decrease the leaning time for the classifier.

K. Kalra and R. Goyal—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Liu, T.: Feature selection based on mutual information for gear faultydiagnosis on the imbalanced dataset. J. Comput. Inf. Syst. 8(18), 7831–7838 (2012)

    Google Scholar 

  2. Mena, L., Gonzalez, J.A.: Symbolic one-class learning from imbalanced datasets: application in medical diagnosis. Int. J. Artif. Intell. Tools 18(02), 273–309 (2009)

    Article  Google Scholar 

  3. Cieslak, D.A., Chawla, N.V., Striegel, A.: Combating imbalance in network intrusion datasets. In: GrC, pp. 732–737, May 2006

    Google Scholar 

  4. Thomas, C.: Improving intrusion detection for imbalanced network traffic. Secur. Commun. Netw. 6(3), 309–324 (2013)

    Article  MathSciNet  Google Scholar 

  5. Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor. Newsl. 6(1), 80–89 (2004)

    Article  Google Scholar 

  6. Li, Y., Sun, G., Zhu, Y.: Data imbalance problem in text classification. In: 2010 Third International Symposium on Information Processing (ISIP), pp. 301–305. IEEE, October 2010

    Google Scholar 

  7. Perols, J.: Financial statement fraud detection: an analysis of statistical and machine learning algorithms. Audit. J. Pract. Theory 30(2), 19–50 (2011)

    Article  Google Scholar 

  8. Ghazikhani, A., Monsefi, R., Yazdi, H.S.: Ensemble of online neural networks for non-stationary and imbalanced data streams. Neurocomputing 122, 535–544 (2013)

    Article  Google Scholar 

  9. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2012)

    Article  Google Scholar 

  10. Qian, Y., Liang, Y., Li, M., Feng, G., Shi, X.: A resampling ensemble algorithm for classification of imbalance problems. Neurocomputing 143, 57–67 (2014)

    Article  Google Scholar 

  11. Pearson, R., Goney, G., Shwaber, J.: Imbalanced clustering for microarray time-series. In: Proceedings of the ICML, vol. 3 (2003)

    Google Scholar 

  12. Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 592–602. IEEE, December 2006

    Google Scholar 

  13. Chen, K., Lu, B.L., Kwok, J.T.: Efficient classification of multi-label and imbalanced data using min-max modular classifiers. In: International Joint Conference on Neural Networks, IJCNN 2006, pp. 1770–1775. IEEE, July 2006

    Google Scholar 

  14. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2016)

    Google Scholar 

  15. Tomek, I.: Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 6, 769–772 (1976)

    MathSciNet  MATH  Google Scholar 

  16. Wilson, D.L.: Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 3, 408–421 (1972)

    Article  MathSciNet  Google Scholar 

  17. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)

    Article  Google Scholar 

  18. Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91

    Chapter  Google Scholar 

  19. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, IJCNN 2008. IEEE World Congress on Computational Intelligence, pp. 1322–1328. IEEE, June 2008

    Google Scholar 

  20. Drummond, C., Holte, R.C.: C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Workshop on Learning from Imbalanced Datasets II, vol. 11, pp. 1–8. Citeseer, Washington DC, August 2003

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Karan Kalra or Riya Goyal .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kalra, K., Goyal, R., Kaur, S., Kumar, P. (2018). PDD Algorithm for Balancing Medical Data. In: Singh, M., Gupta, P., Tyagi, V., Flusser, J., Ören, T. (eds) Advances in Computing and Data Sciences. ICACDS 2018. Communications in Computer and Information Science, vol 905. Springer, Singapore. https://doi.org/10.1007/978-981-13-1810-8_26

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-1810-8_26

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-1809-2

  • Online ISBN: 978-981-13-1810-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics