Skip to main content
Log in

A scalable summary generation method based on cross-modal consensus clustering and OLAP cube modeling

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video summarization has been a core problem to manage the growing amount of content in multimedia databases. An efficient video summary should display an overview of the video content and most existing approaches fulfill this goal. However, such an overview does not allow the user to reach all details of interest selectively and progressively. This paper proposes a novel scalable summary generation approach based on the On-Line Analytical Processing data cube. Such a structure integrates tools like the drill down operation allowing to browse efficiently multiple descriptions of a dataset according to increased levels of detail. We adapt this model to video summary generation by expressing a video within a cross-media feature space and by performing clusterings according to particular subspaces. Consensus clustering is used to guide the subspace selection strategy at small dimensions, as the novelty brought by the least consensual subspaces is interesting for the refinements of a summary. Our approach is designed for weakly-structured contents such as cultural documentaries. We perform its evaluation on a corpus of cultural archives provided by the French Audiovisual National Institute (INA) using information retrieval metrics handling single and multiple reference annotations. The performances obtained overall improved results compared to two baseline systems performing random and arbitrary segmentations, showing a better balance between Precision and Recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://trecvid.nist.gov/

  2. http://www.ina.fr

  3. These features are extracted using the MPEG-7 Feature Extraction Library proposed by Bilkent University Multimedia Database Group, available at http://cs.bilkent.edu.tr/∼bilmdg/bilvideo-7/Software.html.

  4. PHOG descriptors are computed thanks to the MATLAB script by Anna Bosch and Andrew Zisserman available at http://www.robots.ox.ac.uk/∼vgg/research/caltech/phog.html.

  5. The extraction of HOG, HOF, MBH and the implementation of bag of words approach are realized using the Dense Trajectories Video Description Toolbox by Wang et al., available at http://lear.inrialpes.fr/people/wang/dense_trajectories.

  6. The analysis window size is set automatically by the toolbox in the case of the Chroma vectors. The other parameters for the extraction of MFCC and Chroma vectors are set as the default ones in Yaafe.

  7. http://www-nlpir.nist.gov/projects/tv2008/tv2008.html#4.4

  8. The datasets used by Li and Merialdo [14] were not considered in this work for copyright reasons.

References

  1. Almeida J, Leite NJ, da S Torres R (2013) Online video summarization in compressed domain. J Vis Commun Image Represent 24:729–738

    Article  Google Scholar 

  2. Bartolini I, Patella M, Stromei G (2011) The windsurf library for the efficient retrieval of multimedia hierarchical data. In: Proceedings of ACM special interest group on multimedia (SIGMM), pp 139–148

  3. Bartsch MA, Wakefield GH (2005) Audio thumbnailing of popular music using chroma-based representations. IEEE Trans Multimed 7:96–104

    Article  Google Scholar 

  4. Ben Abdelali A, Nidhalkrifa M, Mtibaa A, Bourennane EB (2009) A study of color structure descriptor for shot boundary detection. Int J Sci Tech Autom Control Comput Eng 3(1):956–971

    Google Scholar 

  5. Benini S, Bianchetti A, Leonardi R, Migliorati P (2006) Extraction of significant video summaries by dendrogram analysis. In: Proceedings of the international conference on image processing (ICIP), pp 133–136

  6. Benois-Pineau J, Dupuy W, Barba D (2001) Recovering of visual scenarios in movies by motion analysis and grouping spatio-temporal colour signatures of video shots. In: Proceedings of EUSFLAT’2001, pp 385–389

  7. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231

  8. Goder A, Filkov V (2008) Consensus clustering algorithms: Comparison and refinement. In: Proceedings of 9th workshop on algorithm engineering and experiments (ALENEX’08), pp 109–117

  9. Gong B, Chao WL, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Proceedings of the neural information processing systems conference (NIPS), pp 1–9

  10. Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H (1997) Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub totals. J Data Min Knowledge Disc 1(1):29–53

    Article  Google Scholar 

  11. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31:651–666

    Article  Google Scholar 

  12. Jin X, Han J, Cao L, Luo J, Ding B, Lin CK (2010) Visual cube and n-line analytical processing of images. In: Proceedings of the 19th ACM international conference on information and knowledge management (CIKM), pp 849–858

  13. Kompatsiaris Y, Merialdo B, Lian S (eds) (2012) TV content analysis. Techniques and applications. CRC Press

  14. Li Y, Merialdo B (2010) VERT: automatic evaluation of video summaries. In: Proceedings of ACM multimedia, pp 851–854

  15. Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) YAAFE, an easy to use and efficient audio feature extraction software. In: Proceedings of the 11th International society for music information retrieval (ISMIR), pp 441–446

  16. Messing D, van Beek P, Errico JH (2001) The mpeg-7 color structure descriptor: image description using color and local spatial information. In: Proceedings of the international conference on image processing (ICIP), pp 670–673

  17. Naci U, Damnjanovic U, Mansencal B, Benois-Pineau J, Kaes C, Corvaglia M, Rossi E, Aginako N (2008) The COST292 experimental framework for rushes summarization task in TRECVID 2008. In: Proceedings of the 2nd ACM TRECVID video summarization workshop, pp 40–44

  18. Peltonen V, Tuomi J, Klapuri A, Huopaniemi J, Sorsa T (2002) Computational auditory scene recognition. In: Proceedings of the 2002 IEEE international conference on acoustics, speech, and signal processing (ICASSP), pp 1941–1944

  19. Pinquier J, Karaman S, Letoupin L, Guyot P, Mégret R, Benois-Pineau J, Gaëstel Y, Dartigues JF (2012) Strategies for multiple feature fusion with Hierarchical HMM: Application to activity recognition from wearable audiovisual sensors. In: Proceedings of the 21st international conference on pattern recognition (ICPR), pp 3192–3195

  20. Quénot G, Benois-Pineau J, Mansencal B, Rossi E, Cord M, Precioso F, Gorisse D, Lambert P, Augereau B, Granjon L, Pellerin D, Rombaut M, Ayache S (2008) Rushes summarization by IRIM consortium: redundancy removal and multi-feature fusion. In: Proceedings of the 2nd ACM TRECVID video summarization workshop, pp 80–84

  21. R Perez-Daniel K, Nakano-Miyatake M, Benois-Pineau J, Maabout S, Sargent G (2014) Scalable video summarization of cultural video documents in cross-media space based on data cube approach. In: Proceedings of the 12th international workshop on content-based multimedia indexing (CBMI), pp 1–6

  22. Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103:60–79

    Article  MathSciNet  Google Scholar 

  23. Wang J, Liu P, She M, Kouzani A, Nahavandi S (2011) The MPEG-7 color structure descriptor: Image description using color and local spatial information. In: Proceedings of 2011 IEEE international conference on systems, man, and cybernetics (SMC), pp 2449–2454

  24. Yeung M, Yeo BL (1996) Time-constrained clustering for segmentation of video into story units. In: Proceedings of the 13th international conference on pattern recognition (ICPR), vol. 3, pp 375–380

  25. Yong-ge W, Sheng-ze P (2012) Research on image retrieval based on scalable color descriptor of mpeg7. Adv Control Commun:91–98

Download references

Acknowledgments

This work is supported by the French National Research Agency grant ANR-11-IS02-001 within the joint French-Mexican project Mex-Culture. We are grateful to the Institut National de l’Audiovisuel (INA, France) for providing us the video content we employed for setting up the evaluation. The authors thank Michel Crucianu and Marin Ferecatu for valuable discussions and master student Elie Génard for his efficient help in conducting computational experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Sargent.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sargent, G., Perez-Daniel, K.R., Stoian, A. et al. A scalable summary generation method based on cross-modal consensus clustering and OLAP cube modeling. Multimed Tools Appl 75, 9073–9094 (2016). https://doi.org/10.1007/s11042-015-2863-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2863-3

Keywords

Navigation