GVSUM: generic video summarization using deep visual features


Video Summarization is the method of producing a summary of the video content. A generic video summarization method named GVSUM is proposed in this paper. The generic summary is generated by choosing keyframes whenever a major scene change occurs in the video. All frames of the video are assigned a cluster number based on their visual features and the keyframes are extracted when the cluster number of the frame changes. Visual features of the video are extracted from a pre-trained Convolutional Neural Network (CNN) and then k-means clustering is applied on these features followed by a sequential keyframe generation technique. However, the optimum value of number of clusters can also be chosen before summarizing by applying Average Silhouette Width method. Mean Opinion Scores (MOS) of the summaries generated show that the GVSUM approach gives satisfactory results for a generic video summarization as it picks up a frame wherever the the visual content changes. The quantitative F1 measure also shows promising results.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3


  1. 1.

    Agyeman R, Muhammad R, Choi GS (2019) Soccer video summarization using deep learning. In: Proceedings - 2nd international conference on multimedia information processing and retrieval, MIPR. IEEE, pp 270–273

  2. 2.

    Almeida J, Leite NJ, Torres RdS (2012) Vison: video summarization for online applications. Pattern Recogn Lett 33(4):397–409

    Article  Google Scholar 

  3. 3.

    Asghar MN, Hussain F, Manton R (2014) Video indexing: a survey. International Journal of Computer and Information Technology 3(01)

  4. 4.

    Basavarajaiah M, Sharma P (2018) Ksumm: a compressed domain technique for video summarization using partial decoding of videos. In: International conference on advanced informatics for computing research. Springer, pp 241–252

  5. 5.

    Basavarajaiah M, Sharma P (2019) Survey of compressed domain video summarization techniques. ACM Comput Surv 52(6):116:1–116:29

    Google Scholar 

  6. 6.

    Batool F, Hennig C (2019) Clustering by optimizing the average silhouette width. arXiv:1910.08644

  7. 7.

    Ćalić J, Mrak M, Kondoz A (2008) Flexible generation of video summaries from layered video bit-streams. In: 2008 15th IEEE International conference on image processing, ICIP 2008

  8. 8.

    Chew CM, Kankanhalli MS (2001) Compressed domain summarization of digital video. In: Pacific-Rim conference on multimedia. Springer, pp 490–497

  9. 9.

    Chu WS, Song Y, Jaimes A (2015) Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3584–3592

  10. 10.

    Cyganek B (2019) Thumbnail tensor—a method for multidimensional data streams clustering with an efficient tensor subspace model in the scale-space. Sensors 19(19):4088

    Article  Google Scholar 

  11. 11.

    Cyganek B, Woźniak M (2017) Tensor-based shot boundary detection in video streams. N Gener Comput 35(4):311–340

    Article  Google Scholar 

  12. 12.

    Davila K, Zanibbi R (2017) Whiteboard video summarization via spatio-temporal conflict minimization. In: 2017 14th IAPR International conference on document analysis and recognition (ICDAR), vol 1. IEEE, pp 355–362

  13. 13.

    De Avila SEF, Lopes APB, da Luz A Jr, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68

    Article  Google Scholar 

  14. 14.

    Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE Conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 248–255

  15. 15.

    Divakaran A, Peker KA, Radhakrishnan R, Xiong Z, Cabasson R (2003) Video summarization using mpeg-7 motion activity and audio descriptors. In: Video mining. Springer, pp 91–121

  16. 16.

    Drew MS, Au J (2003) Clustering of compressed illumination-invariant chromaticity signatures for efficient video summarization. Image Vis Comput 21(8):705–716

    Article  Google Scholar 

  17. 17.

    Dundar A, Jin J, Culurciello E (2015) Convolutional clustering for unsupervised learning. arXiv:1511.06241

  18. 18.

    Fei M, Jiang W, Mao W (2017) Memorable and rich video summarization. J Vis Commun Image Represent 42:207–217

    Article  Google Scholar 

  19. 19.

    Fu Y, Guo Y, Zhu Y, Liu F, Song C, Zhou ZH (2010) Multi-view video summarization. IEEE Trans Multimed 12(7):717–729

    Article  Google Scholar 

  20. 20.

    Furini M, Geraci F, Montangero M, Pellegrini M (2007) Visto: visual storyboard for web video browsing. In: Proceedings of the 6th ACM international conference on Image and video retrieval. ACM, pp 635–642

  21. 21.

    Furini M, Ghini V (2006) An audio-video summarischeme based on audio and video analysis. In: IEEE CCNC

  22. 22.

    Gao Y, Wang WB, Yong JH, Gu HJ (2009) Dynamic video summarization using two-level redundancy detection. Multimed Tools Applic 42(2):233–250

    Article  Google Scholar 

  23. 23.

    Gianluigi C, Raimondo S (2006) An innovative algorithm for key frame extraction in video summarization. J Real-Time Image Proc 1(1):69–88

    Article  Google Scholar 

  24. 24.

    Jeong Dj, Yoo HJ, Cho NI (2015) Consumer video summarization based on image quality and representativeness measure. In: IEEE Global conference on signal and information processing (GlobalSIP), pp 572–576

  25. 25.

    Lee H, Yu J, Im Y, Gil JM, Park D (2011) A unified scheme of shot boundary detection and anchor shot detection in news video story parsing. Multimed Tools Applic 51(3):1127–1145

    Article  Google Scholar 

  26. 26.

    Lloyd S (1982) Least squares quantization in pcm. IEEE Trans Inform Theory 28(2):129–137

    MathSciNet  Article  Google Scholar 

  27. 27.

    Lu Z, Grauman K (2013) Story-driven summarization for egocentric video. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 2714–2721. https://doi.org/10.1109/CVPR.2013.350

  28. 28.

    Mahmoud KM, Ismail MA, Ghanem NM (2013) Vscan: an enhanced video summarization using density-based spatial clustering. In: International conference on image analysis and processing. Springer, pp 733–742

  29. 29.

    Mobahi H, Collobert R, Weston J (2009) Deep learning from temporal coherence in video. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 737–744

  30. 30.

    Ngo CW, Ma YF, Zhang HJ (2005) Video summarization and scene detection by graph modeling. IEEE Trans Circ Syst Video Technol 15(2):296–305

    Article  Google Scholar 

  31. 31.

    Otani M, Nakashima Y, Rahtu E, Heikkila J (2019) Rethinking the evaluation of video summaries. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7596–7604

  32. 32.

    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  33. 33.

    Raikwar SC, Bhatnagar C, Jalal AS (2014) A framework for key frame extraction from surveillance video. In: Proceedings - 5th IEEE international conference on computer and communication technology, ICCCT. IEEE, pp 297–300

  34. 34.

    Rawat W, Wang Z (2017) Deep convolutional neural networks for image classification: a comprehensive review. Neur Comput 29(9):2352–2449

    MathSciNet  Article  Google Scholar 

  35. 35.

    Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  Google Scholar 

  36. 36.

    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  37. 37.

    Streijl RC, Winkler S, Hands DS (2016) Mean opinion score (mos) revisited: methods and applications, limitations and alternatives. Multimed Syst 22 (2):213–227

    Article  Google Scholar 

  38. 38.

    Sugano M, Nakajima Y, Yanagihara H, Yoneyama A (2004) Generic summarization technology for consumer video. In: Pacific-Rim conference on multimedia. Springer, pp 1–8

  39. 39.

    Sun J, Tao D, Papadimitriou S, Yu PS, Faloutsos C (2008) Incremental tensor analysis: theory and applications. ACM Trans Knowl Discov Data (TKDD) 2(3):1–37

    Article  Google Scholar 

  40. 40.

    Sun X, Kankanhalli MS (2000) Video summarization using r-sequences. Real-time Imaging 6(6):449–459

    Article  Google Scholar 

  41. 41.

    Taj-Eddin IA, Afifi M, Korashy M, Hamdy D, Nasser M, Derbaz S (2016) A new compression technique for surveillance videos: evaluation using new dataset. In: 6th International conference on digital information and communication technology and its applications, DICTAP. IEEE, pp 159–164

  42. 42.

    Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13(4):600–612

    Article  Google Scholar 

  43. 43.

    Wu J, Zhong Sh, Jiang J, Yang Y (2017) A novel clustering method for static video summarization. Multimed Tools Applic 76(7):9625–9641. http://link.springer.com/10.1007/s11042-016-3569-x

    Article  Google Scholar 

  44. 44.

    Yu JCS, Kankanhalli MS, Mulhen P (2003) Semantic video summarization in compressed domain mpeg video. In: 2003 International conference on multimedia and expo, 2003. ICME’03. Proceedings, vol 3. IEEE, pp III–329

  45. 45.

    Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European conference on computer vision. Springer, Cham, pp 1–17

  46. 46.

    Zhang Y, Liang X, Zhang D, Tan M, Xing EP (2018) Unsupervised object-level video summarization with online motion auto-encoder. arXiv:180.00543

Download references

Author information



Corresponding author

Correspondence to Madhushree Basavarajaiah.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This publication is an outcome of the research work supported by Visvesvaraya Ph.D. Scheme, Ministry of Electronics & Information Technology, Government of India (MEITY-PHD-1369)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Basavarajaiah, M., Sharma, P. GVSUM: generic video summarization using deep visual features. Multimed Tools Appl (2021). https://doi.org/10.1007/s11042-020-10460-0

Download citation


  • Video summarization
  • Convolutional neural networks
  • Compressed domain
  • Generic summary
  • Transfer learning