Abstract
The prime motivation for pattern discovery and machine learning research has been the collection and warehousing of large amounts of data, in many domains such as life sciences and industrial processes. Examples of unique problems arisen are situations where the data is imbalanced. The class imbalance problem corresponds to situations where majority of cases belong to one class and a small minority belongs to the other, which in many cases is equally or even more important. To deal with this problem a number of approaches have been studied in the past. In this talk we provide an overview of some existing methods and present novel applications that are based on identifying the inherent characteristics of one class vs the other. We present the results of a number of studies focusing on real data from life science applications.
Chapter PDF
Similar content being viewed by others
References
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Bullinger, L., Döhner, K., Bair, E., Fröhling, S., Schlenk, R.F., Tibshirani, R., Döhner, H., Pollack, J.R.: Use of Gene-Expression Profiling to Identify Prognostic Subclasses in Adult Acute Myeloid Leukemia. N. Engl. J. Med. 350, 1605–1616 (2004)
Chawla, N.V., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)
Colantonio, S., Little, S., Salvetti, O., Perner, P.: Prototype–Based Classification in Unbalanced Biomedical Problems. In: Montani, S., Jain, L.C. (eds.) Successful Case-based Reasoning Appl. SCI, vol. 305, pp. 143–163. Springer, Heidelberg (2010)
Dahinden, C.: An improved Random Forests approach with application to the performance prediction challenge datasets. In: Guyon, I., et al. (eds.) Hands on Pattern Recognition. Microtome (2009)
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: International Conference on Machine Learning, pp. 148–156 (1996)
Fuller, J.F., McAdara, J., Yaron, Y., Sakaguchi, M., Fraser, J.K., Gasson, J.C.: Characterization of HOX gene expression during myelopoiesis: role of HOX A5 in lineage commitment and maturation. Blood 93(10), 3391–3400 (1999)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286(5439), 531–537 (2009)
Guo, X., Yin, Y., Dong, C., Yang, G., Zhou, G.: On the Class Imbalance Problem. In: Proc. of 4th International Conference on Natural Computation, Jinan, October 18-20, pp. 192–201. IEEE, Los Alamitos (2008)
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005)
Japkowicz, N., Stephen, S.: The class imbalance problem: A systematic study. Journal Intelligent Data Analysis Archive 6(5) (2002)
Joshi, M.V., Kumar, V., Agarwal, R.C.: Evaluating boosting algorithms to classify rare cases: comparison and improvements. In: First IEEE International Conference on Data Mining, pp. 257–264 (2001)
Kharas, M.G., Lengner, C.J., Al-Shahrour, F., Bullinger, L., Ball, B., Zaidi, S.: Musashi-2 regulates normal hematopoiesis and promotes aggressive myeloid leukemia. Nature Medicine 16(8), 903–908 (2010)
Klein, H.U., Ruckert, C., Kohlmann, A., Bullinger, L., Thiede, C., Haferlach, T., Dugas, M.: Quantitative comparison of microarray experiments with published leukemia related gene expression signatures. BMC Bioinformatics 10, 422 (2009), doi:10.1186/1471-2105-10-422
Kubat, M., Holte, R., Matwin, S.: Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Machine Learning 30, 195–215 (1998)
Kubat, M., Matwin, S.: Addressing the curse of imbalanced data set: One sided sampling. In: Proc. of the Fourteenth International Conference on Machine Learning, pp. 179–186 (1997)
Liu, X.Y., Wu, J., Zhou, Z.: Exploratory Undersampling for Class-Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 39(2), 539–550 (2009)
Ouyang, J., Famili, F., Xu, W.: An Approach to Automated Knowledge Discovery in Bioinformatics. In: Li, D., Wang, B. (eds.) Proceedings of the Conference on Artificial Intelligence and Innovations (AIAI 2005). IFIP, vol. 187, pp. 593–600. Springer, Boston (2005)
Padmaja, T.M., Dhulipalla, N., Krishna, P.R., Bapi, R.S., Laha, A.: An unbalanced data classification model using hybrid sampling technique for fraud detection. In: Ghosh, A., De, R.K., Pal, S.K. (eds.) PReMI 2007. LNCS, vol. 4815, pp. 341–348. Springer, Heidelberg (2007)
Provost, F., Fawcett, T.: Robust classification for imprecise environments. Machine Learning 42, 203–231 (2001)
Rebhan, M., Chalifa-Caspi, V., Prilusky, J., Lancet, D.: GeneCards: Integrating information about genes, proteins and diseases. Trends Genet. 13(4), 163 (1997)
Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proc. 24th Int. Conf. Mach. Learn., Corvallis, OR, pp. 935–942 (2007)
van de Vijver, M.J., He, Y.D., van’t Veer, L.J., Dai, H., Hart, A.A., Voskuil, D.W., et al.: A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347 (2002)
van’t Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415, 530–536 (2002)
Van Der Putten, P., Van Someren, M.: A biasvariance analysis of a real world learning problem: the coil challenge 2000. Machine Learning 57(1-2), 177–195 (2004)
Weiss, G.M.: The Effect of Small Disjuncts and Class Distribution on Decision Tree Learning. Ph.D. Dissertation, Department of Computer Science, Rutgers University, New Brunswick, New Jersey (2003)
Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. SIGKDD Explorations 6(1), 80–89 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Famili, A.F. (2014). Searching for Patterns in Imbalanced Data. In: Bayro-Corrochano, E., Hancock, E. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2014. Lecture Notes in Computer Science, vol 8827. Springer, Cham. https://doi.org/10.1007/978-3-319-12568-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-12568-8_20
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12567-1
Online ISBN: 978-3-319-12568-8
eBook Packages: Computer ScienceComputer Science (R0)