# Data Analysis in Python: Anonymized Features and Imbalanced Data Target

## Abstract

Remaining useful life (RUL) of an equipment or system is a prognostic value that depends on data gathered from multiple and diverse sources. Moreover, assumed for the sake of the present study as a binary classification problem, the probability of failure of any system is usually very much smaller than that of the same system to be in normal operating conditions. The imbalanced outcome (largely much more ‘normal’ than ‘failure’ states) at any time results from the combined values of a large set of features, some related to one another, some redundant, and most quite noisy. Previewing the development and requirements of a robust framework, it is advocated that by using Python libraries, those difficulties can be dealt with. In the present Chapter, DOROTHEA, a dataset from UCI library with a hundred thousand of sparse anonymized (i.e. unrecognizable labels) binary features and imbalanced binary classes are analyzed. For that, an ipython (jupyter) notebook, pandas are used to import the data set, then some exploratory analysis and feature engineering are performed and several estimators (classifiers) obtained from scikit-learn library are applied. It is demonstrated that global accuracy does not work for this case, since the minority class is easily overlooked by the algorithms. Therefore, receiver operating characteristics (ROC), Precision-Recall curves and respective area under curve (AUCs) evaluated from each estimator or ensemble, as well as some simple statistics, using three hybrid methods, that are, a mix of filter, embedded and wrapper methods, feature selection strategies, were compared.

## Keywords

Data analysis Machine learning Scikit-learn Python Imbalanced classes ROC Precision-recall## Notes

### Acknowledgements

In order to approach DOROTHEA, *Python, numpy, matplotlib, pandas, scipy sparse,* and mostly *scikit*-*learn* were employed all over to facilitate all the work. Therefore, the author is very grateful to the developers of those wonderful open-source packages. The author must acknowledge DuPont Pharmaceuticals Research Laboratories as well as KDD Cup 2001, for gracefully allowing the use of the data from which DOROTHEA dataset was built. Finally, the author wishes to thank Dr. João Paulo Dias, from the Department of Mechanical Engineering of the Texas Tech University, for contributing with his comments on the manuscript and for his invaluable help with the references organization.

## References

- 1.R.D. Peng, E. Matsui,
*The Art of Data Science: A Guide for Anyone Who Works with Data*(Skybrude Consulting, LLC, 2016)Google Scholar - 2.H. Koepke, 10 Reasons Python Rocks for Research (And a Few Reasons It Doesn’t) (University of Washington, 2010), http://www.stat.washington.edu/~hoytak/blog/whypython.html
- 3.Learn More About Anaconda, https://www.continuum.io/documentation
- 4.Enthought Scientific Computing Solutions, https://www.enthought.com
- 5.S. van der Walt, S. C. Colbert, G. Varoquaux, The NumPy array: a structure for efficient numerical computation (2011)Google Scholar
- 6.E. Jones, E. Oliphant, P. Peterson,
*SciPy: Open Source Scientific Tools for Python*(2001), http://www.scipy.org/ - 7.J. D. Hunter, Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 90–95 (2007)Google Scholar
- 8.F. Pérez, B.E. Granger, IPython: a system for interactive scientific computing. Comput. Sci. Eng.
**9**(3), 21–29 (2007)CrossRefGoogle Scholar - 9.F. Anthony,
*Mastering Pandas*(Packt Publishing Ltd., 2015)Google Scholar - 10.S. Seabold, J. Perktold, Statsmodels: econometric and statistical modeling with Python, in
*Proceedings of the 9th Python in Science Conference*(2010), pp. 57–61Google Scholar - 11.Scikit-learn, Documentation of scikit-learn 0.17 (2014), http://scikit-learn.org/stable/documentation.html
- 12.I. Guyon, Design of experiments of the NIPS 2003 variable selection benchmark, in
*NIPS 2003 Workshop on Feature Extraction*(2003)Google Scholar - 13.J. Leek,
*The Elements of Data Analytic Style*(Kindle Edi, Leanpub, 2015)Google Scholar - 14.T.M. Mitchell,
*Machine Learning*(McGraw-Hill Science/Engineering/Math, 1997)Google Scholar - 15.K. Markhan, Introduction to machine learning with scikit-learn.
*Kaggle’s blog*(2015), https://github.com/justmarkham/scikit-learn-videos - 16.A. Smola, S.V.N. Vishwanathan,
*Introduction to Machine Learning*(Cambridge University Press, 2008)Google Scholar - 17.P. Domingos, A few useful things to know about machine learning. Commun. ACM
**55**(10), 78–87 (2012)CrossRefGoogle Scholar - 18.A. Boschetti, L. Massaron,
*Python Data Science Essentials*(Packt Publishing Ltd., 2015)Google Scholar - 19.L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A.C. Muller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Vanderplas, A. Joly, B. Holt, G. Varoquaux, API design for machine learning software: experiences from the scikit-learn project, in
*European Conference on Machine Learning and Principles and Practices of Knowledge Discovery in Databases*(2013), pp. 1–15Google Scholar - 20.M. Lichman,
*UCI Machine Learning Repository,*(University of California, School of Information and Computer Science, Irvine, CA, 2013)Google Scholar - 21.V. Bolon-Canedo, N. Sanchez-Maroño, A.A. Betanzos,
*Feature Selection for High-Dimensional Data*(Springer, 2015)Google Scholar - 22.J. Fan, R. Li, Statistical challenges with high dimensionality: feature selection in knowledge discovery, in
*Proceedings of the International Congress of Mathematicians*, Madrid, Spain (2006), pp. 595–622Google Scholar - 23.A. Singh, A. Purohit, A survey on methods for solving data imbalance problem for classification. Int. J. Comput. Appl.
**127**(15), 37–41 (2015)Google Scholar - 24.S.V. Jadhav, V. Pinki, A Survey on feature selection algorithm for high dimensional data. Int. J. Recent Innov. Trends Comput. Commun.
**4**(1), 83–86 (2016)Google Scholar - 25.F. Chang, J. Guo, W. Xu, K. Yao, A feature selection method to handle imbalanced data in text classification. J. Digit. Inf. Manage.
**13**(3), 169–175 (2015)Google Scholar - 26.M. Imran, M. Afroze, A.V. Kumar, A.A.M. Qyser, Learning from imbalanced data of diverse strategies with investigation. Int. J. Adv. Res. Comput. Sci. Softw. Eng.
**5**(6), 1285–1290 (2015)Google Scholar - 27.A. Ali, S.M. Shamsuddin, A.L. Ralescu, Classification with class imbalance problem: a review. Int. J. Adv. Soft Comput. Appl.
**7**(3), 176–204 (2015)Google Scholar - 28.A. Sonak, R.A. Patankar, A survey on methods to handle imbalance dataset. Int. J. Comput. Sci. Mobile Comput.
**4**(11), 338–343 (2015)Google Scholar - 29.A.H.M. Kamal, X. Zhu, A. Pandya, S. Hsu, R. Narayanan, Feature selection for datasets with imbalanced class distributions. Int. J. Softw. Eng. Knowl. Eng.
**20**(2), 113–137 (2010)CrossRefGoogle Scholar - 30.R. Balasubramanian, S.J.S.A. Joseph, Intrusion detection on highly imbalance big data using tree based real time intrusion detection system: effects and solutions. Int. J. Adv. Res. Comput. Commun. Eng.
**5**(2), 27–32 (2016)Google Scholar - 31.C. Chen, A. Liaw, L. Breiman, Using random forest to learn imbalanced data, (2004)Google Scholar
- 32.Y. Liu, J. Cheng, C. Yan, X. Wu, F. Chen, Research on the Matthews correlation coefficients metrics of personalized recommendation algorithm evaluation. Int. J. Hybrid Inf. Technol.
**8**(1), 163–172 (2015)CrossRefGoogle Scholar - 33.M. Bekkar, H.K. Djemaa, T.A. Alitouche, Evaluation measures for models assessment over imbalanced data sets. J. Inf. Eng. Appl.
**3**(10), 27–38 (2013)Google Scholar - 34.J.M. Lobo, A. Jiménez-Valverde, R. Real, AUC: a misleading measure of the performance of predictive distribution models. Glob. Ecol. Biogeogr.
**17**(2), 145–151 (2008)CrossRefGoogle Scholar - 35.P. Flach, J. Hernández-Orallo, C. Ferri, A coherent interpretation of AUC as a measure of aggregated classification performance, in
*Proceedings of the 28th International Conference on Machine Learning*, Bellevue, WA, USA (2011), pp. 657–664Google Scholar - 36.L. Breiman,
*Random Forests*, Berkeley, CA (2001)Google Scholar - 37.A. Liaw, M. Wiener, Classification and regression by randomForest. R News
**2**(3), 18–22 (2002)Google Scholar - 38.G. Hackeling,
*Mastering Machine Learning with Scikit-Learn*(Packt Publishing Ltd., 2014)Google Scholar - 39.D. Muchlinski, D. Siroky, J. He, M. Kocher, Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Polit. Anal.
**24**(1), 87–103 (2016)CrossRefGoogle Scholar - 40.I. Guyon, S. Gunn, A. Ben Hur, G. Dror, Result analysis of the NIPS 2003 feature selection challenge. Adv. Neural Inf. Process. Syst.
**17**, 545–552 (2003)Google Scholar