Correlation based feature fusion for the temporal video scene segmentation task

  • Rodrigo Mitsuo KishiEmail author
  • Tiago Henrique Trojahn
  • Rudinei Goularte


The available automatic temporal video scene segmentation methods still lack efficacy to be employed in most practical multimedia systems. The ones showing better results are multimodal and based on late fusion. On the other hand, early fusion has not been sufficiently investigated in this task because of the well known barriers of this approach: correlation identification, temporal synchronization and unique representation. This work presents a feature fusion method which deals with the mentioned difficulties and produces features which can enhance the efficacy of existing temporal video scene segmentation methods. This feature fusion process is performed on singlemodal Bag of Features feature vectors and is intended to enrich previously captured latent semantics by performing temporal clustering of features, providing an unified representation of multiple temporal related features. This feature fusion process have been coupled with two of-the-shelf scene segmentation algorithms, presenting competitive results when compared with two other state-of-the-art multimodal temporal scene segmentation methods. The results indicate that the proposed early fusion feature representation method is a promising alternative in helping to boost video retrieval related tasks.


Multimedia Video Temporal scene segmentation Early fusion 



Authors of this work would like to thank Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP), Universidade Federal de Mato Grosso do Sul (UFMS), Universidade de São Paulo (USP) and Instituto Federal de São Paulo (IFSP) for financial support. This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001. The authors also like to thank Dr Lorenzo Baraldi for providing evaluation scripts. This research have been developed using computational resources from Centro de Ciências Matemáticas Aplicadas à Indústria (CeMEAI) financed by FAPESP.


  1. 1.
    Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Syst 16(6):345–379. CrossRefGoogle Scholar
  2. 2.
    Baraldi L, Grana C, Cucchiara R (2015) A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd ACM international conference on multimedia, MM ’15, pp 1199–1202. ACM, New York.
  3. 3.
    Baraldi L, Grana C, Cucchiara R (2015) Measuring scene detection performance, pp 395–403, Springer International Publishing, ChamGoogle Scholar
  4. 4.
    BBC: Planet earth. (2006). [Online; accessed 25-may-2018]
  5. 5.
    Bromley J, Guyon I, LeCun Y, Säckinger E, Shah R (1993) Signature verification using a “siamese” time delay neural network. In: Proceedings of the 6th international conference on neural information processing systems, NIPS’93, pp 737–744. Morgan Kaufmann Publishers Inc., San Francisco.
  6. 6.
    Chasanis V, Kalogeratos A, Likas A (2009) Movie segmentation into scenes and chapters using locally weighted bag of visual words. In: Proceedings of the ACM international conference on image and video retrieval, CIVR ’09, pp 35:1–35:7. ACM, New York
  7. 7.
    Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, pp 1–22Google Scholar
  8. 8.
    Davis SB, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech and signal processing, pp 357–366Google Scholar
  9. 9.
    Del Fabro M, Böszörmenyi L (2013) State-of-the-art and future challenges in video scene detection: a survey. Multimedia Syst 19(5):427–454. CrossRefGoogle Scholar
  10. 10.
    Ellouze M, Boujemaa N, Alimi AM (2010) Scene pathfinder: unsupervised clustering techniques for movie scenes extraction. Multimedia Tools Appl 47(2):325–346. CrossRefGoogle Scholar
  11. 11.
    Gao G, Ma H (2012) Multi-modality movie scene detection using kernel canonical correlation analysis. In: 2012 21st International Conference on Pattern recognition (ICPR), pp 3074–3077Google Scholar
  12. 12.
    Gauch JM, Gauch S, Bouix S, Zhu X (1999) Real time video scene detection and classification. Inf Process Manag 35(3):381–400CrossRefGoogle Scholar
  13. 13.
    Haghighat M, Abdel-Mottaleb M, Alhalabi W (2016) Discriminant correlation analysis: Real-time feature level fusion for multimodal biometric recognition. IEEE Trans Inf Forensic Secur 11(9):1984–1996. CrossRefGoogle Scholar
  14. 14.
    Han B, Wu W (2011) Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In: 2011 IEEE International conference on multimedia and expo, pp 1–6.
  15. 15.
    Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664. CrossRefzbMATHGoogle Scholar
  16. 16.
    Hare J, Samangooei S, Dupplaw D (2011) Openimaj and imageterrier: Java libraries and tools for scalable multimedia analysis and indexing of images. In: ACM Multimedia 2011, pp 691–694. ACM. Event Dates: 28/11/2011 until 1/12/2011.
  17. 17.
    Jhuo IH, Ye G, Gao S, Liu D, Jiang YG, Lee DT, Chang SF (2014) Discovering joint audio–visual codewords for video event detection. Mach Vis Appl 25 (1):33–47CrossRefGoogle Scholar
  18. 18.
    Kender JR, Yeo BL (1998) Video scene segmentation via continuous video coherence. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, CVPR ’98, pp 367–. IEEE Computer Society, Washington, DC, USAGoogle Scholar
  19. 19.
    Koprinska I, Carrato S (2001) Temporal video segmentation: a survey. In: Signal processing: image communication, pp 477–500Google Scholar
  20. 20.
    Kurcius JJ, Breckon TP (2014) Using compressed audio-visual words for multi-modal scene classification. In: 2014 International workshop on computational intelligence for multimedia understanding (IWCIM), pp 1–5.
  21. 21.
    LeCun Y, Bengio Y (1998) The handbook of brain theory and neural networks. MIT Press, Cambridge.
  22. 22.
    Lloyd SP (1982) Least squares quantization in pcm. IEEE Trans Inf Theory 28:129–137MathSciNetCrossRefGoogle Scholar
  23. 23.
    Lopes BL, Trojahn TH, Goularte R (2014) Video scene detection by multimodal bag of features. J Inf Data Manag 5(2):194Google Scholar
  24. 24.
    Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110MathSciNetCrossRefGoogle Scholar
  25. 25.
    Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems - Volume 2, NIPS’13, pp 3111–3119. Curran Associates Inc., USA.
  26. 26.
    Rabiner L, Juang BH (1993) Fundamentals of speech recognition. Prentice-hall, inc., upper saddle river, NJ USAGoogle Scholar
  27. 27.
    Rao KS, Koolagudi SG (2012) Emotion recognition using speech features. Springer Publishing Company, Incorporated, New YorkzbMATHGoogle Scholar
  28. 28.
    Rasheed Z, Shah M (2003) Scene detection in hollywood movies and tv shows. In: Proceedings of the 2003 IEEE computer society conference on computer vision and pattern recognition, 2003. vol 2, pp II–343–8 vol 2.
  29. 29.
    Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Kaski S, Corander J (eds) Proceedings of the seventeenth international conference on artificial intelligence and statistics, Proceedings of machine learning research, vol 33, pp 823-831. PMLR, Reykjavik, IcelandGoogle Scholar
  30. 30.
    Saraceno C, Leonardi R (1997) Audio as a support to scene change detection and characterization of video sequences. In: 1997 IEEE international conference on acoustics, speech, and signal processing, 1997. ICASSP-97. vol 4, pp 2597–2600 vol 4.
  31. 31.
    Sidiropoulos P, Mezaris V, Kompatsiaris I, Meinedo H, Bugalho M, Trancoso I (2011) Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans Cir Sys Video Technol 21(8):1163–1177. CrossRefGoogle Scholar
  32. 32.
    Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380. CrossRefGoogle Scholar
  33. 33.
    Snoek CGM, Worring M (2002) A review on multimodal video indexing. In: Proceedings of the 2002 IEEE International Conference on Multimedia and expo, 2002. ICME ’02. vol 2, pp 21–24 vol 2.
  34. 34.
    Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7(1):11–32. CrossRefGoogle Scholar
  35. 35.
    Vendrig J, Worring M (2002) Systematic evaluation of logical story unit segmentation. IEEE Trans Multimedia 4(4):492–499. CrossRefGoogle Scholar
  36. 36.
    Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level cnn: saliency-aware 3-d cnn with lstm for video action recognition. IEEE Sig Process Lett 24(4):510–514. CrossRefGoogle Scholar
  37. 37.
    Wang X, Gao L, Song J, Zhen X, Sebe N, Shen HT (2018) Deep appearance and motion learning for egocentric activity recognition. Neurocomputing 275:438–447. CrossRefGoogle Scholar
  38. 38.
    Wang X, Gao L, Wang P, Sun X, Liu X (2018) Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia 20(3):634–644. CrossRefGoogle Scholar
  39. 39.
    Wu S, Jin M (2015) Study on a new video scene segmentation algorithm. Appl Math Inf Sci 9 (1):361–368. Cited By 0
  40. 40.
    Xi W, Fox EA, Fan W, Zhang B, Chen Z, Yan J, Zhuang D (2005) Simfusion: Measuring similarity using unified relationship matrix. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’05, pp 130–137. ACM, New York.
  41. 41.
    Xie L, Shen J, Han J, Zhu L, Shao L (2017) Dynamic multi-view hashing for online image retrieval. In: Proceedings of the 26th international joint conference on artificial intelligence, IJCAI’17, pp 3133–3139. AAAI Press.
  42. 42.
    Xie L, Shen J, Zhu L (2016) Online cross-modal hashing for web image retrieval. In: Proceedings of the thirtieth AAAI conference on artificial intelligence, AAAI’16, pp 294–300. AAAI Press.
  43. 43.
    Xu S, Feng B, Ding P, Xu B (2012) Graph-based multi-modal scene detection for movie and teleplay. In: 2012 IEEE International Conference On Acoustics, Speech and Signal Processing (ICASSP), pp 1413–1416.
  44. 44.
    Xu S, Feng B, Xu B (2013) Temporal video segmentation to scene based on conditional random fileds. In: Li S, El Saddik A, Wang M, Mei T, Sebe N, Yan S, Hong R, Gurrin C (eds) 2013 Proceedings of the 19th international conference on advances in multimedia modeling, MMM 2013, Huangshan, China, January 7-9, Part II, pp 374–384. Springer, Berlin.
  45. 45.
    Yeung M, Yeo BL, Liu B (1998) Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst 71(1):94–109.
  46. 46.
    Yu SX, Shi J (2001) Grouping with bias. In: Proceedings of the 14th international conference on neural information processing systems: natural and synthetic, NIPS’01, pp 1327–1334. MIT Press, Cambridge
  47. 47.
    Zhu L, Shen J, Xie L, Cheng Z (2017) Unsupervised topic hypergraph hashing for efficient mobile image retrieval. IEEE Trans Cybern 47(11):3941–3954. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.São Paulo UniversitySão CarlosBrazil
  2. 2.Federal University of Mato Grosso do SulTrês LagoasBrazil
  3. 3.Federal Institute of São PauloSão CarlosBrazil

Personalised recommendations