Skip to main content

Parameter-Free Imputation for Imbalance Datasets

  • Conference paper
The Emergence of Digital Libraries – Research and Practices (ICADL 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8839))

Included in the following conference series:

Abstract

Class imbalance is a problem that aims to improve the accuracy of a minority class, while imputation is a process to replace missing values. Traditionally, class imbalance and imputation problems are considered independently. In addition, filled-in minority-class values that are substituted by traditional methods are not sufficient for imbalance datasets. In this paper, we provide a new parameter-free imputation to operate on imbalance datasets by estimating a random value between the mean of the missing value attribute and a value in this attribute of the closet record instance from the missing value record. Our proposed algorithm ignores mean of instances to avoid an over-fitting problem. Consequently, experimental results on imbalance datasets reveal that our imputation outperforms other techniques, when class imbalance measures are used.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gelman, A., Hill, J.: Data Analysis Using Regression and Multi-level/Hierarchical Models. In: Missing-data Imputation, pp. 529–544. Cambridge University Press (2006)

    Google Scholar 

  2. Batista, G., Monard, M.C.: A study of K-nearest neighbour as an imputation method. In: Abraham, A., et al. (eds.) Hybrid Intell. Syst., Ser. Front Artif. Intell. Appl., vol. 87, pp. 251–260. IOS Press (2002)

    Google Scholar 

  3. Batista, G., Monard, M.C.: Experimental comparison of K-nearest neighbour and mean or mode imputation methods with the internal strategies used by C4.5 and CN2 to treat missing data. Tech. Rep., University of Sao Paulo (2003)

    Google Scholar 

  4. Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. Department of Information and Computer Sci-ences, University of California, Irvine, California, USA (2009), http://archive.ics.uci.edu/ml/

  5. Bradley, A.P.: The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognition 30(6), 1145–1159 (1997)

    Article  Google Scholar 

  6. Buckland, M., Gey, F.: The Relationship between Recall and Precision. Journal of the American Society for Information Science 45(1), 12–19 (1994)

    Article  Google Scholar 

  7. Bunkhumpornpat, C., Subpaiboonkit, S.: Safe Level Graph for Synthetic Minority Over-sampling Techniques. In: The 13th International Symposium on Communications and Information Technologies (ISCIT) indexed in IEEE Xplore, Samui Island, Thailand, pp. 570–575 (2013)

    Google Scholar 

  8. Zhu, H., Lee, S.-Y., Wei, B.-C., Zhou, J.: Case-deletion meas-ures for models with incomplete data. Biometrika, 727–737 (2001)

    Google Scholar 

  9. Japkowicz, N.: Class imbalance Problem: Significance and Strategies. In: The 2000 International Conference on Artificial Intelligence (IC-AI 2000), Las Vegas, NV, USA, pp. 111–117 (2000)

    Google Scholar 

  10. Hall, M.A., Frank, E., Witten, I.H.: Data Mining: Practical Machine Learning Tools and Techniques, 3rd edn. The Kaufmann Series in Data Management Systems (2011)

    Google Scholar 

  11. Solomon, N., Oatley, G., McGarry, K.: A Fast Multivariate Nearest Neighbour Imputation Algorithm (2007) (manuscript received March 9)

    Google Scholar 

  12. Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C.: Reducing Mis-classifica-tion Costs. In: The 11th International Conference on Machine Learning, ICML 1994, pp. 217–225. Morgan Kaufmann, San Francisco (1994)

    Chapter  Google Scholar 

  13. Garcıa-Laencina, P.J., Sancho-Gomez, J.-L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Computing and Applications (2009)

    Google Scholar 

  14. Randall Wilson, D., Martinez, T.R.: Improved Heterogeneous Distance Functions. AI Access Foundation and Morgan Kaufmann Publishers. Journal of Artificial Intelligence Research 6, 1–34 (1997)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Takum, J., Bunkhumpornpat, C. (2014). Parameter-Free Imputation for Imbalance Datasets. In: Tuamsuk, K., Jatowt, A., Rasmussen, E. (eds) The Emergence of Digital Libraries – Research and Practices. ICADL 2014. Lecture Notes in Computer Science, vol 8839. Springer, Cham. https://doi.org/10.1007/978-3-319-12823-8_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12823-8_27

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12822-1

  • Online ISBN: 978-3-319-12823-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics