A study on unstable cuts and its application to sample selection

  • Sheng XingEmail author
  • Zhong Ming
Original Article


An unstable cuts-based sample selection (UCBSS) is proposed. The proposed method addresses problems associated with traditional sample selection methods based on distance calculation when compressing large datasets, that is, significant time requirements and computational complexity. The core idea of the proposed method is that the extreme value of the convex function will be obtained at the boundary point. The proposed method measures the boundary extent of samples by marking unstable cuts, counting the number of unstable cuts and setting a threshold, and then obtains unstable subsets. Experimental results show that the proposed method is suitable for compression of large datasets with high imbalance ratio. Compared to the traditional condensed nearest neighbour (CNN) method, the proposed method can obtain similar compression ratios and higher G-mean values on datasets with high imbalance ratio. When the discriminant function of the classifier is a convex function, the proposed method can obtain similar accuracy and higher compression ratios on datasets with significant noise. In addition, the run time of the proposed method shows obvious advantage.


Sample selection Discriminant function Convex function Large dataset Imbalanced dataset 


  1. 1.
    Wang XZ (2015) Learning from big data with uncertainty—editoria. J Intell Fuzzy Syst 28(5):2329–2330CrossRefGoogle Scholar
  2. 2.
    Wang XZ, Huang ZX (2015) Editorial: Uncertainty in learning from big data. Fuzzy Sets Syst 258: 1–4MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Hart P (1968) The condensed nearest neighbor rule. IEEE Trans Inf Theory 14(5):515–516CrossRefGoogle Scholar
  4. 4.
    Gates GW (1972) The reduced nearest neighbor rule. IEEE Trans Theory 18(3):431–433CrossRefGoogle Scholar
  5. 5.
    Tomek I (1976) An experiment with the edited nearest-neighbor rule. IEEE Trans Syst Man Cybern SMC 6(6):448–452MathSciNetzbMATHGoogle Scholar
  6. 6.
    Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Commun 6:769–772MathSciNetzbMATHGoogle Scholar
  7. 7.
    Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Machine Learning 38(3):257–286CrossRefzbMATHGoogle Scholar
  8. 8.
    Brighton B, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Disc 6(2):153–172MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    Ritter GL, Woodruff HB, Lowry SR et al (1975) An algorithm for a selective nearest neighbour decision rule. IEEE Trans Inf Theory 21(6):665–669CrossRefzbMATHGoogle Scholar
  10. 10.
    Dasarathy BV (1994) Minimal consistent set (MCS) identification for optimal nearest neighbor decision systems design. IEEE Trans Syst Man Cybern 24(1):511–517CrossRefGoogle Scholar
  11. 11.
    Fayed HA, Atiya AF (2009) A novel template reduction approach for the k-nearest neighbor method. IEEE Trans Neural Networks 20(5):890–896CrossRefGoogle Scholar
  12. 12.
    Nikolaidis K, Goulermas JY, Wu QH (2011) A class boundary preserving algorithm for data condensation. Pattern Recognit 44(3):704–715CrossRefzbMATHGoogle Scholar
  13. 13.
    Zhai JH, Li T, Wang XZ (2016) A cross-instance selection algorithm. J Intell Fuzzy Syst 30(2):717–728CrossRefGoogle Scholar
  14. 14.
    Chen JN, Zhang CM, Xue XP, Liu CL (2013) Fast instance selection for speeding up support vector machines. Knowl Based Syst 45:1–7CrossRefGoogle Scholar
  15. 15.
    Chou CH, Kuo BH, Chang F (2006) The generalized condensed nearest neighbor rule as a data reduction method. Proceedings of the 18th international conference on pattern recognition, Hong-Kong, 556–559Google Scholar
  16. 16.
    Li YH, Maguire L (2011) Selecting critical patterns based on local geometrical and statistical information. IEEE Trans Pattern Anal Mach Intell 33(6):1189–1201CrossRefGoogle Scholar
  17. 17.
    Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern SMC-2(3):408–421MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Lowe DG (1995) Similarity Metric Learning for a Variable-Kernel Classifier. Neural Comput 7(1):72–85CrossRefGoogle Scholar
  19. 19.
    Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6:37–66Google Scholar
  20. 20.
    Wilson DR, Martinez TR (1997) Instance pruning techniques. In: Fisher D (ed) Machine learning: Proceedings of the Fourteenth International Conference (ICML’97). Morgan Kaufmann Publishers, San Francisco, pp 403–411Google Scholar
  21. 21.
    Tsai CF, Chen ZY (2014) Towards high dimensional instance selection: an evolutionary approach. Decis Support Syst 61:79–92CrossRefGoogle Scholar
  22. 22.
    Tsai CF, Chang CW (2013) SVOIS: Support vector oriented instance selection for text classification. Inf Syst 38(8):1070–1083CrossRefGoogle Scholar
  23. 23.
    García-Osorio C, Haro-García AD, García-Pedrajas N (2010) Democratic instance selection: a linear complexity instance selection algorithm based on classifier ensemble concepts. Artif Intell 174:410–441MathSciNetCrossRefGoogle Scholar
  24. 24.
    Haro-García AD, García-Pedrajas N, Castillo JARD (2012) Large scale instance selection by means of federal instance selection. Data Knowl Eng 75:58–77CrossRefGoogle Scholar
  25. 25.
    Wang XZ, Dong LC, Yan JH (2012) Maximum ambiguity-based sample selection in fuzzy decision tree induction. IEEE Trans Knowl Data Eng 24(8):1491–1505CrossRefGoogle Scholar
  26. 26.
    Fu YF, Zhu XQ, Elmagarmid AK (2013) Active learning with optimal instance subset selection [J]. IEEE Trans Cybern 44(5):464–475Google Scholar
  27. 27.
    Zhai TT, He ZF (2013) Instance selection for time series classification based on immune binary particle swarm optimization. Knowl-Based Syst 49:106–115CrossRefGoogle Scholar
  28. 28.
    Wang XZ, Xing S, Zhao SX (2016) Unstable cut-points based sample selection for large data classification 29(9):780–789Google Scholar
  29. 29.
    Lv J, Yi Z (2005) An improved backpropagation algorithm using absolute error function. Springer Berlin Heidelberg, 3496:585–590Google Scholar
  30. 30.
    Breiman L, Friedman JH, Stone CJ (1984) Classification and regression tree. Wadsworth International GroupGoogle Scholar
  31. 31.
    Breiman L (1996) Technical note: Some properties of splitting criteria. Mach Learn 24:41–47zbMATHGoogle Scholar
  32. 32.
    Rokach L, Maimon O (2005) Top-down induction of decision trees classifiers-a survey. IEEE Trans Syst Man Cybern Part C 35(4):476–488CrossRefGoogle Scholar
  33. 33.
    Quinlan JR (1986) Induction of decision tree. Machine Learning 1(1):81–106Google Scholar
  34. 34.
    Quinlan JR (1996) Improved use of continuous attributes in C4.5. J Artif Intell Res 4:77–90CrossRefzbMATHGoogle Scholar
  35. 35.
    Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8:87–102zbMATHGoogle Scholar
  36. 36.
    Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. Mach Learn 1:1022–1027Google Scholar
  37. 37.
    UCI Machine Learning Repository.
  38. 38.
    Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501CrossRefGoogle Scholar
  39. 39.
    Wang XZ, Shao QY, Qing M, Zhai JH (2013) Architecture selection for networks trained with extreme learning machine using localized generalization error model. Neurocomputing 102: 3–9CrossRefGoogle Scholar
  40. 40.
    Wang XZ, Chen AX, Feng HM (2011) Upper integral network with extreme learning mechanism. Neurocomputing 74(16): 2520–2525CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.College of ManagementHebei UniversityBaodingChina
  2. 2.College of Computer Science and EngineeringCangzhou Normal UniversityCangzhouChina
  3. 3.College of Computer Science and Software EngineeringShenzhen UniversityShenzhenChina

Personalised recommendations