Skip to main content
Log in

Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Data imbalance problem often exists in our real life dataset, especial for massive video dataset, however, the balanced data distribution and the same misclassification cost are assumed in traditional machine learning algorithms, thus, it will be difficult for them to accurately describe the true data distribution, and resulting in misclassification. In this paper, the data imbalance problem in semantic extraction under massive video dataset is exploited, and enhanced and hierarchical structure (called EHS) algorithm is proposed. In proposed algorithm, data sampling, filtering and model training are considered and integrated together compactly via hierarchical structure algorithm, thus, the performance of model can be improved step by step, and is robust and stability with the change of features and datasets. Experiments on TRECVID2010 Semantic Indexing demonstrate that our proposed algorithm has much more powerful performance than that of traditional machine learning algorithms, and keeps stable and robust when different kinds of features are employed. Extended experiments on TRECVID2010 Surveillance Event Detection also prove that our EHS algorithm is efficient and effective, and reaches top performance in four of seven events.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. “Learning from Imbalanced Data Sets,” Proc. Am. Assoc. for Artificial Intelligence (AAAI) Workshop, N. Japkowicz, ed., 2000, (Technical Report WS-00-05).

  2. “Workshop Learning from Imbalanced Data Sets II,” Proc. Int’l Conf. Machine Learning, N.V. Chawla, N. Japkowicz, and A. Kolcz, eds., 2003.

  3. Abe N, Zadrozny B, Langford J (2004) An iterative method for multi-class cost-sensitive learning. Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 3–11

  4. Akbani R, Kwek S, Japkowicz N (2004) Applying Support Vector Machines to Imbalanced Datasets. European Conference on Machine Learning (ECML) 3201:39–50

    Google Scholar 

  5. Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter 6(1):20–29

    Article  Google Scholar 

  6. Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst 14(6):67–74

    Article  Google Scholar 

  7. Chan P, Stolfo S (1998) Toward scalable learning with non-uniform class and cost distributions. Proc. Int’l Conf. Knowledge Discovery and Data Mining, pp. 164–168

  8. Chang S-F, Hsu W, Jiang W, Kennedy L, Xu D et al (2006) Columbia university trecvid-2006 video search and high-level feature extraction,” in TRECVID workshop

  9. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  10. Chawla NV, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter 6(1):1–6

    Article  Google Scholar 

  11. Chen M-Y, Hauptmann A (2009) MoSIFT: Reocgnizing Human Actions in Surveillance Videos. CMU-CS-09-161, Carnegie Mellon University

  12. Chen K, Lu BL, Kwok J (2006) Efficient classification of multi-label and imbalanced data using min-max modular classifiers. Proc. World Congress on Computation Intelligence-Int’l Joint Conf. Neural Networks, pp. 1770–1775

  13. Clifton P, Damminda A, Vincent L (2004) Minority report in fraud detection: classification of skewed data. ACM SIGKDD Explorations Newsletter 6(1):50–59

    Article  Google Scholar 

  14. Cortes C, Vapnik V (1995) Support vector networks. Mach Learn 20:273–297

    MATH  Google Scholar 

  15. Daugman J (1980) Two-dimensional spectral analysis of cortical receptive field profiles. Vis Res 20:847–856

    Article  Google Scholar 

  16. Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V (1997) Support vector regression machines. Adv Neural Inform Process Syst 9:155–161

    Google Scholar 

  17. Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity: why under sampling beats over-sampling. Proc. Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets II

  18. Elkan C (2001) The foundations of cost-sensitive learning. Proc. Int’l Joint Conf. Artificial Intelligence, pp. 973978

  19. Estabrooks A, Jo T, Japkowicz N (2004) A multiple resampling method for learning from imbalanced data sets. Comput Intell 20:18–36

    Article  MathSciNet  Google Scholar 

  20. Freund Y, Schapire RE (1997) Decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139

    Article  MATH  MathSciNet  Google Scholar 

  21. Graf HP, Cosatto E, Bottou L, Durdanovic I, Vapnik V (2005) Parallel support vector machines: The cascade svm. In Advances in Neural Information Processing Systems 17:521–528

    Google Scholar 

  22. Guo H, Viktor HL (2004) Learning from imbalanced data sets with boosting and data generation: the databoost IM approach. ACM SIGKDD Explorations Newsletter 6(1):30–39

    Article  Google Scholar 

  23. Haibo He, Member, IEEE, and Edwardo A. Garcia (2009) Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, Vol.21, No.9, pp.1263-1284, Sep

  24. Han H, Wang W, Mao B (2005) Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International Conference on Intelligent Computing (ICIC) 3644:878–887

    Google Scholar 

  25. He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. Proc. Int’l J. Conf. Neural Networks, pp. 1322-1328

  26. He H, Shen X (2007) A ranked subspace learning method for gene expression data classification. Proc. Int’l Conf. Artificial Intelligence, pp. 358-364

  27. Holte RC, Acker L, Porter BW (1989) Concept learning and the problem of small disjuncts. Proc. Int’l J. Conf. Artificial Intelligence, pp. 813–818

  28. Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on Neural Networks (TNN) 18(1):28–41

    Article  Google Scholar 

  29. http://www-nlpir.nist.gov/projects/tv2010/tv2010.html

  30. http://www-nlpir.nist.gov/projects/tv2010/tv2010.html#data

  31. http://www-nlpir.nist.gov/projects/tv2010/tv2010.html#sed

  32. http://www-nlpir.nist.gov/projects/tv2010/tv2010.html#sin

  33. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5):429–449

    MATH  Google Scholar 

  34. Jiang Y-G, Yang J, Ngo C-W, Hauptmann AG (2010) Representations of keypoint-based semantic concept detection: a comprehensive study. IEEE Transactions on Multimedia 12(1):42–53

    Article  Google Scholar 

  35. Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2/3):195–215

    Article  Google Scholar 

  36. Laurikkala J (2001) Improving identification of difficult small classes by balancing class distribution. Proc. Conf. AI in Medicine in Europe: Artificial Intelligence Medicine, pp. 63-66,

  37. Liu XY, Wu J, Zhou ZH (2006) Exploratory under sampling for class imbalance learning. Proc. Int’l Conf. Data Mining. 965–969

  38. Liu XY, Zhou ZH (2006) Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowledge and Data Eng 18(1):63–77

    Article  Google Scholar 

  39. Lowe DG (1999) Object recognition from local scale-invariant features. Proc of the International Conference on Computer Vision, Corfu 2:1150–1157

    Google Scholar 

  40. Maloof MA (2003) Learning when data sets are imbalanced and when costs are unequal and unknown. Proc. Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets II

  41. Mease D, Wyner AJ, Buja A (2007) Boosted classification trees and class probability/quantile estimation. J Mach Learn Res 8:409–439

    MATH  Google Scholar 

  42. Mehrotra R (1992) Gabor filter-based edge detection. PaRem Recognition 25(12):1479–1494

    Article  MathSciNet  Google Scholar 

  43. National Institute of Standards and Technology (NIST):http://www.nist.gov/index.html

  44. Pearson R, Goney G, Shwaber J (2003) Imbalanced clustering for microarray time-series,” Proc. Int’l Conf. Machine Learning, Workshop Learning from Imbalanced Data Sets II,

  45. Peng Y, Yang Z, Yi J, Cao L, Li H, Yao J (2008) Peking University at TRECVID 2008: High Level Feature Extraction, in TRECVID workshop

  46. Rao RB, Krishnan S, Niculescu RS (2006) Data mining for improved cardiac care. ACM SIGKDD Explorations Newsletter 8(1):3–10

    Article  Google Scholar 

  47. Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Stat Comput 14:199–222

    Article  MathSciNet  Google Scholar 

  48. Sun Y, Kamel MS, Wang Y (2006) Boosting for learning multiple classes with imbalanced class distribution. Proc. Int’l Conf. Data Mining, pp. 592–602

  49. Surveillance event detection: System task, Data, Submissions, Evaluation http://www.itl.nist.gov/iad/mig//tests/trecvid/2009/doc/EventDet09-EvalPlan-v03.htm

  50. Tan C, Gilbert D, Deville Y (2003) Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics 14:206–217

    Google Scholar 

  51. Ting KM (2002) An Instance-Weighting Method to Induce Cost-Sensitive Trees. IEEE Trans Knowledge and Data Eng 14(3):659–665

    Article  Google Scholar 

  52. Tomek I (1976) Two modifications of CNN. IEEE Trans System Man Cybernetics 6(11):769–772

    Article  MATH  MathSciNet  Google Scholar 

  53. TREC Video Retrieval Evaluation (TRECVID): http://trecvid.nist.gov/

  54. Vapnik VN (1999) An overview of statistical learning theory. IEEE Trans Neural Netw 10:988–999

    Article  Google Scholar 

  55. Viola P, Jones M (2001) Robust real-time object detection, second international workshop on statistical and computational theories of vision – modeling, learning, computing and sampling, Vancouver, Canada, July, 13

  56. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features, International conference of computer vision and pattern recognition, Kauai, HI, USA, December, 8–14

  57. Wang BX, Japkowicz N (2004) Imbalanced data set learning with synthetic samples. Proc. IRIS Machine Learning Workshop

  58. Weiss GM (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explorations Newsletter 6(1):7–19

    Article  Google Scholar 

  59. Weiss GM, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report MLTR-43, Dept. of Computer Science, Rutgers Univ., 2001.

  60. Woods K, Doss C, Bowyer K, Solka J, Priebe C, Kegelmeyer W (1993) Comparative evaluation of pattern recognition techniques for detection of microcalcifications in mammography. Int’l J Pattern Recognition and Artificial Intelligence 7(6):1417–1436

    Article  Google Scholar 

  61. Wu G, Chang EY (2003) Class-boundary alignment for imbalanced dataset learning. ICML Workshop on Learning from Imbalanced Data Sets II

  62. Yang J, Jiang Y-G, Hauptmann AG (2007) etc, Evaluating bag-of-visual-words representations in scene classification[C]//International Multimedia Conference, MM'07, pp.197–206

  63. Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. in Proceedings of the Fifteenth ACM International Conference on Information and Knowledge Management (CIKM). November

  64. Zhang J, Mani I (2003) KNN approach to unbalanced data distributions: a case study involving information extraction. Proc. Int’l Conf. Machine Learning (ICML’2003), Workshop Learning from Imbalanced Data Sets

  65. Zhou ZH, Liu XY (2006) On multi-class cost-sensitive learning. Proc. Nat’l Conf. Artificial Intelligence, pp. 567-572

Download references

Acknowledgements

This material is based in part upon work supported by the National Science Foundation under Grants No. 0624236 and 0751185. Zan Gao is partially supported by the NSFC (No.90920001), and Key project in Science and Technology Pillar Program of Tianjin, P.R. China (10ZCKFGX00400). We also thank the anonymous reviewers for their valuable suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zan Gao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, Z., Zhang, Lf., Chen, My. et al. Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimed Tools Appl 68, 641–657 (2014). https://doi.org/10.1007/s11042-012-1071-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-012-1071-7

Keyword

Navigation