Event detection in soccer videos using unsupervised learning of Spatio-temporal features based on pooled spatial pyramid model

  • Babak Fakhar
  • Hamidreza Rashidy KananEmail author
  • Alireza Behrad


Most existing researches for semantic analysis of soccer videos benefit from special approaches to bridge the semantic gap between low-level features and high-level events using a hierarchical structure. In this paper, we propose a novel data-driven model for automatic recognition of important events in soccer broadcast videos based on the analysis of spatio-temporal local features of video frames. Our presented algorithm explores the local visual content of video frames by focusing on spatial and temporal learned features in a low-dimensional transformed sparse space. The proposed algorithm, without using mid-level futures, dynamically extracts the most informative semantic concepts/features and improves the generality of the system. The dictionary learning process plays an important role in sparse coding and sparse representation-based event classification. In this paper, we present a novel dictionary learning method, which calculates several category-specific dictionaries by training the detected shots of various view categories. In order to evaluate the feasibility and effectiveness of the proposed algorithm, an extensive experimental investigation is conducted for the analysis, detection, and classification of soccer events on a large collection of video data. Experimental results indicate that our approach outperforms the state-of-the-art methods and demonstrate the effectiveness of the proposed approach.


Semantic video analysis Pooled spatial pyramid feature based sparse representation (PSPFSR) Locality-constrained linear coding (LLC) Long–short-term memory (LSTM) 



  1. 1.
    Aharon M, Elad M, Bruckstein A (2006) $ rm k K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 54:4311–4322CrossRefGoogle Scholar
  2. 2.
    Akrivas G, Stamou GB, Kollias S (2004) Semantic association of multimedia document descriptions through fuzzy relational algebra and fuzzy reasoning. IEEE Trans Syst Man Cybernet-Part A: Syst Humans 34:190–196CrossRefGoogle Scholar
  3. 3.
    Bengio Y, Frasconi P (1994) Credit assignment through time: alternatives to backpropagation. Adv Neural Inform Process Syst: 75–82Google Scholar
  4. 4.
    Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5:157–166CrossRefGoogle Scholar
  5. 5.
    Cong Y, Yuan J, Luo J (2012) Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14:66–75CrossRefGoogle Scholar
  6. 6.
    Cong Y, Yuan J, Liu JJPR (2013) Abnormal event detection in crowded scenes using sparse representation 46: 1851–1864Google Scholar
  7. 7.
    Cong Y, Yuan J, Liu J (2013) Abnormal event detection in crowded scenes using sparse representation. Pattern Recogn 46:1851–1864CrossRefGoogle Scholar
  8. 8.
    D’Orazio T, Leo M, Spagnolo P, Nitti M, Mosca N, Distante A (2009) A visual system for real time detection of goal events during soccer matches. Comput Vis Image Underst 113:622–632CrossRefGoogle Scholar
  9. 9.
    Dai W, Shen Y, Tang X, Zou J, Xiong H, Chen CW (2016) Sparse representation with Spatio-temporal online dictionary learning for promising video coding. IEEE Trans Image Process 25:4580–4595MathSciNetCrossRefGoogle Scholar
  10. 10.
    D'Orazio T, Leo M, Spagnolo P, Mazzeo PL, Mosca N, Nitti M et al (2009) An investigation into the feasibility of real-time soccer offside detection from a multiple camera system. IEEE Trans Circ Syst Video Technol 19:1804–1818CrossRefGoogle Scholar
  11. 11.
    Ekin A, Tekalp AM, Mehrotra R (2003) Automatic soccer video analysis and summarization. IEEE Trans Image Process 12:796–807CrossRefGoogle Scholar
  12. 12.
    Elad M, Aharon M (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans Image Process 15:3736–3745MathSciNetCrossRefGoogle Scholar
  13. 13.
    F. J. I. T o p a Perronnin and m intelligence (2008) Universal and adapted vocabularies for generic visual categorization 30: 1243–1256Google Scholar
  14. 14.
    Fani M, Yazdi M, Clausi DA, Wong A (2017) Soccer video structure analysis by parallel feature fusion network and hidden-to-observable transferring Markov model. IEEE Access 5:27322–27336CrossRefGoogle Scholar
  15. 15.
    Guan G, Wang Z, Yu K, Mei S, He M, Feng D (2012) Video summarization with global and local features. Multimed Expo Workshops (ICMEW), 2012 IEEE Int Conf: 570–575Google Scholar
  16. 16.
    Guan G, Wang Z, Lu S, Da Deng J, Feng DD (2013) Keypoint-based keyframe selection. IEEE Trans Circ Syst Video Technol 23:729–734CrossRefGoogle Scholar
  17. 17.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780CrossRefGoogle Scholar
  18. 18.
    Hosseini M-S, Eftekhari-Moghadam A-M (2013) Fuzzy rule-based reasoning approach for event detection and annotation of broadcast soccer video. Appl Soft Comput 13:846–866CrossRefGoogle Scholar
  19. 19.
    Huang C-L, Shih H-C, Chao C-Y (2006) Semantic analysis of soccer video using dynamic Bayesian network. IEEE Trans Multimed 8:749–760CrossRefGoogle Scholar
  20. 20.
    Inoue N, Shinoda K (2012) A fast and accurate video semantic-indexing system using fast MAP adaptation and GMM supervectors. IEEE Trans Multimed 14:1196–1205CrossRefGoogle Scholar
  21. 21.
    Jai-Andaloussi S, El Mourabit I, Madrane N, Chaouni SB, Sekkaki A (2015) Soccer events summarization by using sentiment analysis. Comput Sci Comput Intell (CSCI), 2015 Int Conf: 398–403Google Scholar
  22. 22.
    Ji Won Lee D-WN, Moon S-W, Lee J, Yoo W-Y (2017) Soccer event recognition technique based on pattern matching. Comput Sci Inform Syst (FedCSIS), 2017 Fed Conf: 4, 3–6Google Scholar
  23. 23.
    Jiang Z, Lin Z, Davis LS (2013) Label consistent K-SVD: learning a discriminative dictionary for recognition. IEEE Trans Pattern Anal Mach Intell 35:2651–2664CrossRefGoogle Scholar
  24. 24.
    Jiang H, Lu Y, Xue J (2016) Automatic soccer video event detection based on a deep neural network combined CNN and RNN. Tools Artif Intell (ICTAI), 2016 IEEE 28th Int Conf: 490–494Google Scholar
  25. 25.
    Kolekar MH, Sengupta S (2015) Bayesian network-based customized highlight generation for broadcast soccer videos. IEEE Trans Broadcast 61:195–209CrossRefGoogle Scholar
  26. 26.
    Kolekar MH, Sengupta SJITOB (2015) Bayesian network-based customized highlight generation for broadcast soccer videos 61: 195–209Google Scholar
  27. 27.
    Li N, Wu X, Xu D, Guo H, Feng W (2015) Spatio-temporal context analysis within video volumes for anomalous-event detection and localization. Neurocomputing 155:309–319CrossRefGoogle Scholar
  28. 28.
    Liu Y, Nie L, Han L, Zhang L, Rosenblum DS (2015) Action2Activity: recognizing complex activities from sensor data. IJCAI: 1617–1623Google Scholar
  29. 29.
    Liu Y, Nie L, Liu L, Rosenblum DSJN (2016) From action to activity: sensor-based activity. Recognition 181:108–115Google Scholar
  30. 30.
    Liu Y, Zheng Y, Liang Y, Liu S, Rosenblum DS (2016) Urban water quality prediction based on multi-task multi-view learningGoogle Scholar
  31. 31.
    Liu T, Lu Y, Lei X, Zhang L, Wang H, Huang W et al. (2017) Soccer video event detection using 3D convolutional networks and shot boundary detection via deep feature distance. Int Conf Neural Inform Process: 440–449Google Scholar
  32. 32.
    Lu S, Wang Z, Mei T, Guan G, Feng DD (2014) A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Trans Multimed 16:1497–1509CrossRefGoogle Scholar
  33. 33.
    Mairal J, Leordeanu M, Bach F, Hebert M, Ponce J (2008) Discriminative sparse image models for class-specific edge detection and image interpretation. European conference on computer vision: 43–56Google Scholar
  34. 34.
    Mei S, Guan G, Wang Z, Wan S, He M, Feng DDJPR (2015) Video summarization via minimum sparse reconstruction 48: 522–533Google Scholar
  35. 35.
    Nagasaka A, Tanaka Y (1992) Automatic video indexing and full-video search for object appearancesGoogle Scholar
  36. 36.
    Ouyang J-q, Liu R (2013) Ontology reasoning scheme for constructing meaningful sports video summarisation. IET Image Process 7:324–334CrossRefGoogle Scholar
  37. 37.
    Pandya MAZDS (2017) Frame based approach for automatic event boundary detection of soccer video using optical flow. Conference: Conference: 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA): 5Google Scholar
  38. 38.
    Park J-H, Cho K (2016) Extraction of visual information in basketball broadcasting video for event segmentation system. Inform Commun Technol convergence (ICTC), 2016 Int Conf: 1098–1100Google Scholar
  39. 39.
    Perin C, Vuillemot R, Fekete J-D (2013) SoccerStories: a kick-off for visual soccer analysis. IEEE Trans Vis Comput Graph 19:2506–2515CrossRefGoogle Scholar
  40. 40.
    Poultney C, Chopra S, Cun YL (2007) Efficient learning of sparse representations with an energy-based model. Adv Neural Inform Process Syst: 1137–1144Google Scholar
  41. 41.
    Qian X, Wang H, Liu G, Hou X (2012) HMM based soccer video event detection using enhanced mid-level semantic. Multimed Tools Appl 60:233–255CrossRefGoogle Scholar
  42. 42.
    Ramirez I, Sprechmann P, Sapiro G (2010) Classification and clustering via dictionary learning with structured incoherence and shared featuresGoogle Scholar
  43. 43.
    Raventos A, Quijada R, Torres L, Tarrés F (2015) Automatic summarization of soccer highlights using audio-visual descriptors. SpringerPlus 4:301CrossRefGoogle Scholar
  44. 44.
    Roy D, Srinivas M, Mohan CK (2016) Sparsity-inducing dictionaries for effective action classification. Pattern Recogn 59:55–62CrossRefGoogle Scholar
  45. 45.
    Sadlier DA, O'Connor NE (2005) Event detection in field sports video using audio-visual features and a support vector machine. IEEE Trans Circ Syst Video Technol 15:1225–1233CrossRefGoogle Scholar
  46. 46.
    Saraogi H, Sharma RA, Kumar V (2016) Event recognition in broadcast soccer videos Proc Tenth Indian Conf Comput Vision Graph Image Process: 14Google Scholar
  47. 47.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Comput Vision Pattern Recogn 2006 IEEE Comput Soc Conf: 2169–2178Google Scholar
  48. 48.
    Sigari M-H, Soltanian-Zadeh H, Pourreza H-R (2016) A framework for dynamic restructuring of semantic video analysis systems based on learning attention control. Image Vis Comput 53:20–34CrossRefGoogle Scholar
  49. 49.
    Sivalingam R, Boley D, Morellas V, Papanikolopoulos N (2011) Positive definite dictionary learning for region covariances. Comput Vision (ICCV), 2011 IEEE Int Conf: 1013–1019Google Scholar
  50. 50.
    Song W, Hagras H (2017) A type-2 fuzzy logic system for event detection in soccer videos. Fuzzy Syst (FUZZ-IEEE), 2017 IEEE Int Conf: 1–6Google Scholar
  51. 51.
    Tavassolipour M, Karimian M, Kasaei S (2014) Event detection and summarization in soccer videos using bayesian network and copula. IEEE Trans Circ Syst Video Technol 24:291–304CrossRefGoogle Scholar
  52. 52.
    Tjondronegoro DW, Chen Y-PP (2010) Knowledge-discounted event detection in sports video. IEEE Trans Syst Man Cybernet-Part A: Syst Humans 40:1009–1024CrossRefGoogle Scholar
  53. 53.
    Tropp JA, Gilbert AC (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans Inf Theory 53:4655–4666MathSciNetCrossRefGoogle Scholar
  54. 54.
    Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. Comput Vision Pattern Recogn (CVPR), 2010 IEEE Conf: 3360–3367Google Scholar
  55. 55.
    Wang C, Yang H, Meinel C (2015) Deep semantic mapping for cross-modal retrieval. Tools Artif Intell (ICTAI), 2015 IEEE 27th Int Conf: 234–241Google Scholar
  56. 56.
    Wang C, Yang H, Meinel C (2016) Exploring multimodal video representation for action recognition. Neural Networks (IJCNN), 2016 International Joint Conf: 1924–1931Google Scholar
  57. 57.
    Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional LSTMs. Proc 2016 ACM Multimed Conf: 988–997Google Scholar
  58. 58.
    Wang C, Yang H, C J M T Meinel, and Applications (2016) A deep semantic framework for multimodal representation learning 75: 9255–9276Google Scholar
  59. 59.
    Wang Z, Yu J, He YJITOC, S. F. V Technology (2017) Soccer video event annotation by synchronization of attack–defense clips and match reports with coarse-grained time information 27: 1104–1117,Google Scholar
  60. 60.
    Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level CNN: saliency-aware 3-D CNN with LSTM for video action recognition. IEEE Sign Process Lett 24:510–514CrossRefGoogle Scholar
  61. 61.
    Wang C, Yang H, Meinel CJATOMC (2018) Communications,, and applications. Image Cap Deep Bidirect LSTMs Multi-Task Learn 14:40Google Scholar
  62. 62.
    Wright J, Yang AY, Ganesh A, Sastry SS, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31:210–227CrossRefGoogle Scholar
  63. 63.
    Xiang T, Gong S (2008) Video behavior profiling for anomaly detection. IEEE Trans Pattern Anal Mach Intell 30:893–908CrossRefGoogle Scholar
  64. 64.
    Xie W, Tong M (2011) A novel framework for soccer goal detection based on semantic rule. J Electron (China) 28:670–674CrossRefGoogle Scholar
  65. 65.
    Yang M, Zhang L, Yang J, Zhang D (2010) Metaface learning for sparse representation based face recognition. Image Process (ICIP), 2010 17th IEEE Int Conf: 1601–1604Google Scholar
  66. 66.
    Yang M, Zhang L, Feng X, Zhang DJIJOCV (2014) Sparse representation based fisher discrimination dictionary learning for image classification 109: 209–232Google Scholar
  67. 67.
    Zawbaa HM, El-Bendary N, Hassanien AE, Abraham A (2011) SVM-based soccer video summarization system. Nature Biol Inspired Comput (NaBIC), 2011 Third World Congress: 7–11Google Scholar
  68. 68.
    Zhang Z, Xu Y, Yang J, Li X, Zhang D (2015) A survey of sparse representation: algorithms and applications. IEEE access 3:490–530CrossRefGoogle Scholar
  69. 69.
    Zhao W, Lu Y, Jiang H, Huang W (2015) Event detection in soccer videos using shot focus identification. Pattern Recogn (ACPR), 2015 3rd IAPR Asian Conf: 341–345Google Scholar
  70. 70.
    Zhao Z, Song Y, Su F (2016) Specific video identification via joint learning of latent semantic concept, scene and temporal structure. Neurocomputing 208:378–386CrossRefGoogle Scholar
  71. 71.
    Zhou N, Shen Y, Peng J, Fan J (2012) Learning inter-related visual dictionary for object recognition. Computer vision and pattern recognition (CVPR), 2012 IEEE conference: 3490–3497Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Faculty of Computer and Information Technology EngineeringQazvin Branch, Islamic Azad UniversityQazvinIran
  2. 2.Department of Computer EngineeringShahid Rajaee Teacher Training UniversityTehranIran
  3. 3.Department of Electrical EngineeringShahed UniversityTehranIran

Personalised recommendations