Data Analysis in Python: Anonymized Features and Imbalanced Data Target

  • Emanuel Rocha Woiski


Remaining useful life (RUL) of an equipment or system is a prognostic value that depends on data gathered from multiple and diverse sources. Moreover, assumed for the sake of the present study as a binary classification problem, the probability of failure of any system is usually very much smaller than that of the same system to be in normal operating conditions. The imbalanced outcome (largely much more ‘normal’ than ‘failure’ states) at any time results from the combined values of a large set of features, some related to one another, some redundant, and most quite noisy. Previewing the development and requirements of a robust framework, it is advocated that by using Python libraries, those difficulties can be dealt with. In the present Chapter, DOROTHEA, a dataset from UCI library with a hundred thousand of sparse anonymized (i.e. unrecognizable labels) binary features and imbalanced binary classes are analyzed. For that, an ipython (jupyter) notebook, pandas are used to import the data set, then some exploratory analysis and feature engineering are performed and several estimators (classifiers) obtained from scikit-learn library are applied. It is demonstrated that global accuracy does not work for this case, since the minority class is easily overlooked by the algorithms. Therefore, receiver operating characteristics (ROC), Precision-Recall curves and respective area under curve (AUCs) evaluated from each estimator or ensemble, as well as some simple statistics, using three hybrid methods, that are, a mix of filter, embedded and wrapper methods, feature selection strategies, were compared.


Data analysis Machine learning Scikit-learn Python Imbalanced classes ROC Precision-recall 



In order to approach DOROTHEA, Python, numpy, matplotlib, pandas, scipy sparse, and mostly scikit-learn were employed all over to facilitate all the work. Therefore, the author is very grateful to the developers of those wonderful open-source packages. The author must acknowledge DuPont Pharmaceuticals Research Laboratories as well as KDD Cup 2001, for gracefully allowing the use of the data from which DOROTHEA dataset was built. Finally, the author wishes to thank Dr. João Paulo Dias, from the Department of Mechanical Engineering of the Texas Tech University, for contributing with his comments on the manuscript and for his invaluable help with the references organization.


  1. 1.
    R.D. Peng, E. Matsui, The Art of Data Science: A Guide for Anyone Who Works with Data (Skybrude Consulting, LLC, 2016)Google Scholar
  2. 2.
    H. Koepke, 10 Reasons Python Rocks for Research (And a Few Reasons It Doesn’t) (University of Washington, 2010),
  3. 3.
    Learn More About Anaconda,
  4. 4.
    Enthought Scientific Computing Solutions,
  5. 5.
    S. van der Walt, S. C. Colbert, G. Varoquaux, The NumPy array: a structure for efficient numerical computation (2011)Google Scholar
  6. 6.
    E. Jones, E. Oliphant, P. Peterson, SciPy: Open Source Scientific Tools for Python (2001),
  7. 7.
    J. D. Hunter, Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 90–95 (2007)Google Scholar
  8. 8.
    F. Pérez, B.E. Granger, IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9(3), 21–29 (2007)CrossRefGoogle Scholar
  9. 9.
    F. Anthony, Mastering Pandas (Packt Publishing Ltd., 2015)Google Scholar
  10. 10.
    S. Seabold, J. Perktold, Statsmodels: econometric and statistical modeling with Python, in Proceedings of the 9th Python in Science Conference (2010), pp. 57–61Google Scholar
  11. 11.
    Scikit-learn, Documentation of scikit-learn 0.17 (2014),
  12. 12.
    I. Guyon, Design of experiments of the NIPS 2003 variable selection benchmark, in NIPS 2003 Workshop on Feature Extraction (2003)Google Scholar
  13. 13.
    J. Leek, The Elements of Data Analytic Style (Kindle Edi, Leanpub, 2015)Google Scholar
  14. 14.
    T.M. Mitchell, Machine Learning (McGraw-Hill Science/Engineering/Math, 1997)Google Scholar
  15. 15.
    K. Markhan, Introduction to machine learning with scikit-learn. Kaggle’s blog (2015),
  16. 16.
    A. Smola, S.V.N. Vishwanathan, Introduction to Machine Learning (Cambridge University Press, 2008)Google Scholar
  17. 17.
    P. Domingos, A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)CrossRefGoogle Scholar
  18. 18.
    A. Boschetti, L. Massaron, Python Data Science Essentials (Packt Publishing Ltd., 2015)Google Scholar
  19. 19.
    L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A.C. Muller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Vanderplas, A. Joly, B. Holt, G. Varoquaux, API design for machine learning software: experiences from the scikit-learn project, in European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases (2013), pp. 1–15Google Scholar
  20. 20.
    M. Lichman, UCI Machine Learning Repository, (University of California, School of Information and Computer Science, Irvine, CA, 2013)Google Scholar
  21. 21.
    V. Bolon-Canedo, N. Sanchez-Maroño, A.A. Betanzos, Feature Selection for High-Dimensional Data (Springer, 2015)Google Scholar
  22. 22.
    J. Fan, R. Li, Statistical challenges with high dimensionality: feature selection in knowledge discovery, in Proceedings of the International Congress of Mathematicians, Madrid, Spain (2006), pp. 595–622Google Scholar
  23. 23.
    A. Singh, A. Purohit, A survey on methods for solving data imbalance problem for classification. Int. J. Comput. Appl. 127(15), 37–41 (2015)Google Scholar
  24. 24.
    S.V. Jadhav, V. Pinki, A Survey on feature selection algorithm for high dimensional data. Int. J. Recent Innov. Trends Comput. Commun. 4(1), 83–86 (2016)Google Scholar
  25. 25.
    F. Chang, J. Guo, W. Xu, K. Yao, A feature selection method to handle imbalanced data in text classification. J. Digit. Inf. Manage. 13(3), 169–175 (2015)Google Scholar
  26. 26.
    M. Imran, M. Afroze, A.V. Kumar, A.A.M. Qyser, Learning from imbalanced data of diverse strategies with investigation. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 5(6), 1285–1290 (2015)Google Scholar
  27. 27.
    A. Ali, S.M. Shamsuddin, A.L. Ralescu, Classification with class imbalance problem: a review. Int. J. Adv. Soft Comput. Appl. 7(3), 176–204 (2015)Google Scholar
  28. 28.
    A. Sonak, R.A. Patankar, A survey on methods to handle imbalance dataset. Int. J. Comput. Sci. Mobile Comput. 4(11), 338–343 (2015)Google Scholar
  29. 29.
    A.H.M. Kamal, X. Zhu, A. Pandya, S. Hsu, R. Narayanan, Feature selection for datasets with imbalanced class distributions. Int. J. Softw. Eng. Knowl. Eng. 20(2), 113–137 (2010)CrossRefGoogle Scholar
  30. 30.
    R. Balasubramanian, S.J.S.A. Joseph, Intrusion detection on highly imbalance big data using tree based real time intrusion detection system: effects and solutions. Int. J. Adv. Res. Comput. Commun. Eng. 5(2), 27–32 (2016)Google Scholar
  31. 31.
    C. Chen, A. Liaw, L. Breiman, Using random forest to learn imbalanced data, (2004)Google Scholar
  32. 32.
    Y. Liu, J. Cheng, C. Yan, X. Wu, F. Chen, Research on the Matthews correlation coefficients metrics of personalized recommendation algorithm evaluation. Int. J. Hybrid Inf. Technol. 8(1), 163–172 (2015)CrossRefGoogle Scholar
  33. 33.
    M. Bekkar, H.K. Djemaa, T.A. Alitouche, Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl. 3(10), 27–38 (2013)Google Scholar
  34. 34.
    J.M. Lobo, A. Jiménez-Valverde, R. Real, AUC: a misleading measure of the performance of predictive distribution models. Glob. Ecol. Biogeogr. 17(2), 145–151 (2008)CrossRefGoogle Scholar
  35. 35.
    P. Flach, J. Hernández-Orallo, C. Ferri, A coherent interpretation of AUC as a measure of aggregated classification performance, in Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA (2011), pp. 657–664Google Scholar
  36. 36.
    L. Breiman, Random Forests, Berkeley, CA (2001)Google Scholar
  37. 37.
    A. Liaw, M. Wiener, Classification and regression by randomForest. R News 2(3), 18–22 (2002)Google Scholar
  38. 38.
    G. Hackeling, Mastering Machine Learning with Scikit-Learn (Packt Publishing Ltd., 2014)Google Scholar
  39. 39.
    D. Muchlinski, D. Siroky, J. He, M. Kocher, Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Polit. Anal. 24(1), 87–103 (2016)CrossRefGoogle Scholar
  40. 40.
    I. Guyon, S. Gunn, A. Ben Hur, G. Dror, Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst. 17, 545–552 (2003)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Department of Mechanical EngineeringSão Paulo State University (UNESP)Ilha SolteiraBrazil

Personalised recommendations